Re: [PATCH RFC v3 28/35] arm64: mte: swap: Handle tag restoring when missing tag storage

2024-02-02 Thread Peter Collingbourne
On Fri, Feb 2, 2024 at 6:56 AM Alexandru Elisei
 wrote:
>
> Hi Peter,
>
> On Thu, Feb 01, 2024 at 08:02:40PM -0800, Peter Collingbourne wrote:
> > On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
> >  wrote:
> > >
> > > Linux restores tags when a page is swapped in and there are tags 
> > > associated
> > > with the swap entry which the new page will replace. The saved tags are
> > > restored even if the page will not be mapped as tagged, to protect against
> > > cases where the page is shared between different VMAs, and is tagged in
> > > some, but untagged in others. By using this approach, the process can 
> > > still
> > > access the correct tags following an mprotect(PROT_MTE) on the non-MTE
> > > enabled VMA.
> > >
> > > But this poses a challenge for managing tag storage: in the scenario 
> > > above,
> > > when a new page is allocated to be swapped in for the process where it 
> > > will
> > > be mapped as untagged, the corresponding tag storage block is not 
> > > reserved.
> > > mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, 
> > > will
> > > overwrite data in the tag storage block associated with the new page,
> > > leading to data corruption if the block is in use by a process.
> > >
> > > Get around this issue by saving the tags in a new xarray, this time 
> > > indexed
> > > by the page pfn, and then restoring them when tag storage is reserved for
> > > the page.
> > >
> > > Signed-off-by: Alexandru Elisei 
> > > ---
> > >
> > > Changes since rfc v2:
> > >
> > > * Restore saved tags **before** setting the PG_tag_storage_reserved bit to
> > > eliminate a brief window of opportunity where userspace can access 
> > > uninitialized
> > > tags (Peter Collingbourne).
> > >
> > >  arch/arm64/include/asm/mte_tag_storage.h |   8 ++
> > >  arch/arm64/include/asm/pgtable.h |  11 +++
> > >  arch/arm64/kernel/mte_tag_storage.c  |  12 ++-
> > >  arch/arm64/mm/mteswap.c  | 110 +++
> > >  4 files changed, 140 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/include/asm/mte_tag_storage.h 
> > > b/arch/arm64/include/asm/mte_tag_storage.h
> > > index 50bdae94cf71..40590a8c3748 100644
> > > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > > @@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page);
> > >
> > >  vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct 
> > > vm_fault *vmf,
> > > bool *map_pte);
> > > +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page 
> > > *page);
> > > +
> > > +void tags_by_pfn_lock(void);
> > > +void tags_by_pfn_unlock(void);
> > > +
> > > +void *mte_erase_tags_for_pfn(unsigned long pfn);
> > > +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
> > > +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
> > >  #else
> > >  static inline bool tag_storage_enabled(void)
> > >  {
> > > diff --git a/arch/arm64/include/asm/pgtable.h 
> > > b/arch/arm64/include/asm/pgtable.h
> > > index 0174e292f890..87ae59436162 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int 
> > > type)
> > > mte_invalidate_tags_area_by_swp_entry(type);
> > >  }
> > >
> > > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > > +#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
> > > +static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
> > > + struct folio *folio)
> > > +{
> > > +   if (tag_storage_enabled())
> > > +   return mte_try_transfer_swap_tags(entry, >page);
> > > +   return 0;
> > > +}
> > > +#endif
> > > +
> > >  #define __HAVE_ARCH_SWAP_RESTORE
> > >  static inline void arch_swap_restore(swp_entry_t entry, struct folio 
> > > *folio)
> > >  {
> > > diff --git a/arch/arm64/kernel/mte_tag_storage.c 
> > > b/arch/arm64/kernel/mte_tag_storage.c
> > > index afe2bb754879..ac7b9c9c585c

Re: [PATCH RFC v3 28/35] arm64: mte: swap: Handle tag restoring when missing tag storage

2024-02-02 Thread Evgenii Stepanov
On Fri, Feb 2, 2024 at 6:56 AM Alexandru Elisei
 wrote:
>
> Hi Peter,
>
> On Thu, Feb 01, 2024 at 08:02:40PM -0800, Peter Collingbourne wrote:
> > On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
> >  wrote:
> > >
> > > Linux restores tags when a page is swapped in and there are tags 
> > > associated
> > > with the swap entry which the new page will replace. The saved tags are
> > > restored even if the page will not be mapped as tagged, to protect against
> > > cases where the page is shared between different VMAs, and is tagged in
> > > some, but untagged in others. By using this approach, the process can 
> > > still
> > > access the correct tags following an mprotect(PROT_MTE) on the non-MTE
> > > enabled VMA.
> > >
> > > But this poses a challenge for managing tag storage: in the scenario 
> > > above,
> > > when a new page is allocated to be swapped in for the process where it 
> > > will
> > > be mapped as untagged, the corresponding tag storage block is not 
> > > reserved.
> > > mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, 
> > > will
> > > overwrite data in the tag storage block associated with the new page,
> > > leading to data corruption if the block is in use by a process.
> > >
> > > Get around this issue by saving the tags in a new xarray, this time 
> > > indexed
> > > by the page pfn, and then restoring them when tag storage is reserved for
> > > the page.
> > >
> > > Signed-off-by: Alexandru Elisei 
> > > ---
> > >
> > > Changes since rfc v2:
> > >
> > > * Restore saved tags **before** setting the PG_tag_storage_reserved bit to
> > > eliminate a brief window of opportunity where userspace can access 
> > > uninitialized
> > > tags (Peter Collingbourne).
> > >
> > >  arch/arm64/include/asm/mte_tag_storage.h |   8 ++
> > >  arch/arm64/include/asm/pgtable.h |  11 +++
> > >  arch/arm64/kernel/mte_tag_storage.c  |  12 ++-
> > >  arch/arm64/mm/mteswap.c  | 110 +++
> > >  4 files changed, 140 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/include/asm/mte_tag_storage.h 
> > > b/arch/arm64/include/asm/mte_tag_storage.h
> > > index 50bdae94cf71..40590a8c3748 100644
> > > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > > @@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page);
> > >
> > >  vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct 
> > > vm_fault *vmf,
> > > bool *map_pte);
> > > +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page 
> > > *page);
> > > +
> > > +void tags_by_pfn_lock(void);
> > > +void tags_by_pfn_unlock(void);
> > > +
> > > +void *mte_erase_tags_for_pfn(unsigned long pfn);
> > > +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
> > > +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
> > >  #else
> > >  static inline bool tag_storage_enabled(void)
> > >  {
> > > diff --git a/arch/arm64/include/asm/pgtable.h 
> > > b/arch/arm64/include/asm/pgtable.h
> > > index 0174e292f890..87ae59436162 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int 
> > > type)
> > > mte_invalidate_tags_area_by_swp_entry(type);
> > >  }
> > >
> > > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > > +#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
> > > +static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
> > > + struct folio *folio)
> > > +{
> > > +   if (tag_storage_enabled())
> > > +   return mte_try_transfer_swap_tags(entry, >page);
> > > +   return 0;
> > > +}
> > > +#endif
> > > +
> > >  #define __HAVE_ARCH_SWAP_RESTORE
> > >  static inline void arch_swap_restore(swp_entry_t entry, struct folio 
> > > *folio)
> > >  {
> > > diff --git a/arch/arm64/kernel/mte_tag_storage.c 
> > > b/arch/arm64/kernel/mte_tag_storage.c
> > > index afe2bb754879..ac7b9c9c585c

Re: [PATCH RFC v3 28/35] arm64: mte: swap: Handle tag restoring when missing tag storage

2024-02-02 Thread Alexandru Elisei
Hi Peter,

On Thu, Feb 01, 2024 at 08:02:40PM -0800, Peter Collingbourne wrote:
> On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
>  wrote:
> >
> > Linux restores tags when a page is swapped in and there are tags associated
> > with the swap entry which the new page will replace. The saved tags are
> > restored even if the page will not be mapped as tagged, to protect against
> > cases where the page is shared between different VMAs, and is tagged in
> > some, but untagged in others. By using this approach, the process can still
> > access the correct tags following an mprotect(PROT_MTE) on the non-MTE
> > enabled VMA.
> >
> > But this poses a challenge for managing tag storage: in the scenario above,
> > when a new page is allocated to be swapped in for the process where it will
> > be mapped as untagged, the corresponding tag storage block is not reserved.
> > mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
> > overwrite data in the tag storage block associated with the new page,
> > leading to data corruption if the block is in use by a process.
> >
> > Get around this issue by saving the tags in a new xarray, this time indexed
> > by the page pfn, and then restoring them when tag storage is reserved for
> > the page.
> >
> > Signed-off-by: Alexandru Elisei 
> > ---
> >
> > Changes since rfc v2:
> >
> > * Restore saved tags **before** setting the PG_tag_storage_reserved bit to
> > eliminate a brief window of opportunity where userspace can access 
> > uninitialized
> > tags (Peter Collingbourne).
> >
> >  arch/arm64/include/asm/mte_tag_storage.h |   8 ++
> >  arch/arm64/include/asm/pgtable.h |  11 +++
> >  arch/arm64/kernel/mte_tag_storage.c  |  12 ++-
> >  arch/arm64/mm/mteswap.c  | 110 +++
> >  4 files changed, 140 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h 
> > b/arch/arm64/include/asm/mte_tag_storage.h
> > index 50bdae94cf71..40590a8c3748 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page);
> >
> >  vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct 
> > vm_fault *vmf,
> > bool *map_pte);
> > +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page 
> > *page);
> > +
> > +void tags_by_pfn_lock(void);
> > +void tags_by_pfn_unlock(void);
> > +
> > +void *mte_erase_tags_for_pfn(unsigned long pfn);
> > +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
> > +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
> >  #else
> >  static inline bool tag_storage_enabled(void)
> >  {
> > diff --git a/arch/arm64/include/asm/pgtable.h 
> > b/arch/arm64/include/asm/pgtable.h
> > index 0174e292f890..87ae59436162 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int 
> > type)
> > mte_invalidate_tags_area_by_swp_entry(type);
> >  }
> >
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
> > +static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
> > + struct folio *folio)
> > +{
> > +   if (tag_storage_enabled())
> > +   return mte_try_transfer_swap_tags(entry, >page);
> > +   return 0;
> > +}
> > +#endif
> > +
> >  #define __HAVE_ARCH_SWAP_RESTORE
> >  static inline void arch_swap_restore(swp_entry_t entry, struct folio 
> > *folio)
> >  {
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c 
> > b/arch/arm64/kernel/mte_tag_storage.c
> > index afe2bb754879..ac7b9c9c585c 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -567,6 +567,7 @@ int reserve_tag_storage(struct page *page, int order, 
> > gfp_t gfp)
> > }
> > }
> >
> > +   mte_restore_tags_for_pfn(page_to_pfn(page), order);
> > page_set_tag_storage_reserved(page, order);
> >  out_unlock:
> > mutex_unlock(_blocks_lock);
> > @@ -595,7 +596,8 @@ void free_tag_storage(struct page *page, int order)
> > struct tag_region *region;
> > unsigned long p

Re: [PATCH RFC v3 28/35] arm64: mte: swap: Handle tag restoring when missing tag storage

2024-02-01 Thread Peter Collingbourne
On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
 wrote:
>
> Linux restores tags when a page is swapped in and there are tags associated
> with the swap entry which the new page will replace. The saved tags are
> restored even if the page will not be mapped as tagged, to protect against
> cases where the page is shared between different VMAs, and is tagged in
> some, but untagged in others. By using this approach, the process can still
> access the correct tags following an mprotect(PROT_MTE) on the non-MTE
> enabled VMA.
>
> But this poses a challenge for managing tag storage: in the scenario above,
> when a new page is allocated to be swapped in for the process where it will
> be mapped as untagged, the corresponding tag storage block is not reserved.
> mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
> overwrite data in the tag storage block associated with the new page,
> leading to data corruption if the block is in use by a process.
>
> Get around this issue by saving the tags in a new xarray, this time indexed
> by the page pfn, and then restoring them when tag storage is reserved for
> the page.
>
> Signed-off-by: Alexandru Elisei 
> ---
>
> Changes since rfc v2:
>
> * Restore saved tags **before** setting the PG_tag_storage_reserved bit to
> eliminate a brief window of opportunity where userspace can access 
> uninitialized
> tags (Peter Collingbourne).
>
>  arch/arm64/include/asm/mte_tag_storage.h |   8 ++
>  arch/arm64/include/asm/pgtable.h |  11 +++
>  arch/arm64/kernel/mte_tag_storage.c  |  12 ++-
>  arch/arm64/mm/mteswap.c  | 110 +++
>  4 files changed, 140 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h 
> b/arch/arm64/include/asm/mte_tag_storage.h
> index 50bdae94cf71..40590a8c3748 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page);
>
>  vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct 
> vm_fault *vmf,
> bool *map_pte);
> +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
> +
> +void tags_by_pfn_lock(void);
> +void tags_by_pfn_unlock(void);
> +
> +void *mte_erase_tags_for_pfn(unsigned long pfn);
> +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
> +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
>  #else
>  static inline bool tag_storage_enabled(void)
>  {
> diff --git a/arch/arm64/include/asm/pgtable.h 
> b/arch/arm64/include/asm/pgtable.h
> index 0174e292f890..87ae59436162 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int type)
> mte_invalidate_tags_area_by_swp_entry(type);
>  }
>
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
> +static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
> + struct folio *folio)
> +{
> +   if (tag_storage_enabled())
> +   return mte_try_transfer_swap_tags(entry, >page);
> +   return 0;
> +}
> +#endif
> +
>  #define __HAVE_ARCH_SWAP_RESTORE
>  static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>  {
> diff --git a/arch/arm64/kernel/mte_tag_storage.c 
> b/arch/arm64/kernel/mte_tag_storage.c
> index afe2bb754879..ac7b9c9c585c 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -567,6 +567,7 @@ int reserve_tag_storage(struct page *page, int order, 
> gfp_t gfp)
> }
> }
>
> +   mte_restore_tags_for_pfn(page_to_pfn(page), order);
> page_set_tag_storage_reserved(page, order);
>  out_unlock:
> mutex_unlock(_blocks_lock);
> @@ -595,7 +596,8 @@ void free_tag_storage(struct page *page, int order)
> struct tag_region *region;
> unsigned long page_va;
> unsigned long flags;
> -   int ret;
> +   void *tags;
> +   int i, ret;
>
> ret = tag_storage_find_block(page, _block, );
> if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", 
> page_to_pfn(page)))
> @@ -605,6 +607,14 @@ void free_tag_storage(struct page *page, int order)
> /* Avoid writeback of dirty tag cache lines corrupting data. */
> dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
>
> +   tags_by_pfn_lock();
> +   for (i = 0; i < (1 << order)

[PATCH AUTOSEL 6.7 13/39] ring-buffer: Do no swap cpu buffers if order is different

2024-01-28 Thread Sasha Levin
From: "Steven Rostedt (Google)" 

[ Upstream commit b81e03a24966dca0b119eff0549a4e44befff419 ]

As all the subbuffer order (subbuffer sizes) must be the same throughout
the ring buffer, check the order of the buffers that are doing a CPU
buffer swap in ring_buffer_swap_cpu() to make sure they are the same.

If the are not the same, then fail to do the swap, otherwise the ring
buffer will think the CPU buffer has a specific subbuffer size when it
does not.

Link: 
https://lore.kernel.org/linux-trace-kernel/20231219185629.467894...@goodmis.org

Cc: Masami Hiramatsu 
Cc: Mark Rutland 
Cc: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Tzvetomir Stoyanov 
Cc: Vincent Donnefort 
Cc: Kent Overstreet 
Signed-off-by: Steven Rostedt (Google) 
Signed-off-by: Sasha Levin 
---
 kernel/trace/ring_buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9286f88fcd32..f9d9309884d1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5461,6 +5461,9 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
goto out;
 
+   if (buffer_a->subbuf_order != buffer_b->subbuf_order)
+   goto out;
+
ret = -EAGAIN;
 
if (atomic_read(_a->record_disabled))
-- 
2.43.0




[PATCH RFC v3 28/35] arm64: mte: swap: Handle tag restoring when missing tag storage

2024-01-25 Thread Alexandru Elisei
Linux restores tags when a page is swapped in and there are tags associated
with the swap entry which the new page will replace. The saved tags are
restored even if the page will not be mapped as tagged, to protect against
cases where the page is shared between different VMAs, and is tagged in
some, but untagged in others. By using this approach, the process can still
access the correct tags following an mprotect(PROT_MTE) on the non-MTE
enabled VMA.

But this poses a challenge for managing tag storage: in the scenario above,
when a new page is allocated to be swapped in for the process where it will
be mapped as untagged, the corresponding tag storage block is not reserved.
mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
overwrite data in the tag storage block associated with the new page,
leading to data corruption if the block is in use by a process.

Get around this issue by saving the tags in a new xarray, this time indexed
by the page pfn, and then restoring them when tag storage is reserved for
the page.

Signed-off-by: Alexandru Elisei 
---

Changes since rfc v2:

* Restore saved tags **before** setting the PG_tag_storage_reserved bit to
eliminate a brief window of opportunity where userspace can access uninitialized
tags (Peter Collingbourne).

 arch/arm64/include/asm/mte_tag_storage.h |   8 ++
 arch/arm64/include/asm/pgtable.h |  11 +++
 arch/arm64/kernel/mte_tag_storage.c  |  12 ++-
 arch/arm64/mm/mteswap.c  | 110 +++
 4 files changed, 140 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/mte_tag_storage.h 
b/arch/arm64/include/asm/mte_tag_storage.h
index 50bdae94cf71..40590a8c3748 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page);
 
 vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct 
vm_fault *vmf,
bool *map_pte);
+vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
+
+void tags_by_pfn_lock(void);
+void tags_by_pfn_unlock(void);
+
+void *mte_erase_tags_for_pfn(unsigned long pfn);
+bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
+void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
 #else
 static inline bool tag_storage_enabled(void)
 {
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0174e292f890..87ae59436162 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int type)
mte_invalidate_tags_area_by_swp_entry(type);
 }
 
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
+ struct folio *folio)
+{
+   if (tag_storage_enabled())
+   return mte_try_transfer_swap_tags(entry, >page);
+   return 0;
+}
+#endif
+
 #define __HAVE_ARCH_SWAP_RESTORE
 static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 {
diff --git a/arch/arm64/kernel/mte_tag_storage.c 
b/arch/arm64/kernel/mte_tag_storage.c
index afe2bb754879..ac7b9c9c585c 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -567,6 +567,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t 
gfp)
}
}
 
+   mte_restore_tags_for_pfn(page_to_pfn(page), order);
page_set_tag_storage_reserved(page, order);
 out_unlock:
mutex_unlock(_blocks_lock);
@@ -595,7 +596,8 @@ void free_tag_storage(struct page *page, int order)
struct tag_region *region;
unsigned long page_va;
unsigned long flags;
-   int ret;
+   void *tags;
+   int i, ret;
 
ret = tag_storage_find_block(page, _block, );
if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", 
page_to_pfn(page)))
@@ -605,6 +607,14 @@ void free_tag_storage(struct page *page, int order)
/* Avoid writeback of dirty tag cache lines corrupting data. */
dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
 
+   tags_by_pfn_lock();
+   for (i = 0; i < (1 << order); i++) {
+   tags = mte_erase_tags_for_pfn(page_to_pfn(page + i));
+   if (unlikely(tags))
+   mte_free_tag_buf(tags);
+   }
+   tags_by_pfn_unlock();
+
end_block = start_block + order_to_num_blocks(order, 
region->block_size_pages);
 
xa_lock_irqsave(_blocks_reserved, flags);
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index 2a43746b803f..e11495fa3c18 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -20,6 +20,112 @@ void mte_free_tag_buf(void *buf)
kfree(buf);
 }
 
+#ifd

[PATCH v5 07/15] ring-buffer: Do no swap cpu buffers if order is different

2023-12-19 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

As all the subbuffer order (subbuffer sizes) must be the same throughout
the ring buffer, check the order of the buffers that are doing a CPU
buffer swap in ring_buffer_swap_cpu() to make sure they are the same.

If the are not the same, then fail to do the swap, otherwise the ring
buffer will think the CPU buffer has a specific subbuffer size when it
does not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3c11e8e811ed..fdcd171b09b5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5417,6 +5417,9 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
goto out;
 
+   if (buffer_a->subbuf_order != buffer_b->subbuf_order)
+   goto out;
+
ret = -EAGAIN;
 
if (atomic_read(_a->record_disabled))
-- 
2.42.0





[PATCH v4 07/15] ring-buffer: Do no swap cpu buffers if order is different

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

As all the subbuffer order (subbuffer sizes) must be the same throughout
the ring buffer, check the order of the buffers that are doing a CPU
buffer swap in ring_buffer_swap_cpu() to make sure they are the same.

If the are not the same, then fail to do the swap, otherwise the ring
buffer will think the CPU buffer has a specific subbuffer size when it
does not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index dd03887a4737..695b64fbc1cb 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5350,6 +5350,9 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
goto out;
 
+   if (buffer_a->subbuf_order != buffer_b->subbuf_order)
+   goto out;
+
ret = -EAGAIN;
 
if (atomic_read(_a->record_disabled))
-- 
2.42.0





[PATCH v3 07/15] ring-buffer: Do no swap cpu buffers if order is different

2023-12-13 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

As all the subbuffer order (subbuffer sizes) must be the same throughout
the ring buffer, check the order of the buffers that are doing a CPU
buffer swap in ring_buffer_swap_cpu() to make sure they are the same.

If the are not the same, then fail to do the swap, otherwise the ring
buffer will think the CPU buffer has a specific subbuffer size when it
does not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ea3c217e4e43..2c908b4f3f68 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5559,6 +5559,9 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
goto out;
 
+   if (buffer_a->subbuf_order != buffer_b->subbuf_order)
+   goto out;
+
ret = -EAGAIN;
 
if (atomic_read(_a->record_disabled))
-- 
2.42.0





[PATCH 07/14] ring-buffer: Do no swap cpu buffers if order is different

2023-12-09 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

As all the subbuffer order (subbuffer sizes) must be the same throughout
the ring buffer, check the order of the buffers that are doing a CPU
buffer swap in ring_buffer_swap_cpu() to make sure they are the same.

If the are not the same, then fail to do the swap, otherwise the ring
buffer will think the CPU buffer has a specific subbuffer size when it
does not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 4e7eb41695f5..9725aab1b5eb 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5550,6 +5550,9 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
goto out;
 
+   if (buffer_a->subbuf_order != buffer_b->subbuf_order)
+   goto out;
+
ret = -EAGAIN;
 
if (atomic_read(_a->record_disabled))
-- 
2.42.0





[PATCH RFC v2 22/27] arm64: mte: swap: Handle tag restoring when missing tag storage

2023-11-19 Thread Alexandru Elisei
Linux restores tags when a page is swapped in and there are tags associated
with the swap entry which the new page will replace. The saved tags are
restored even if the page will not be mapped as tagged, to protect against
cases where the page is shared between different VMAs, and is tagged in
some, but untagged in others. By using this approach, the process can still
access the correct tags following an mprotect(PROT_MTE) on the non-MTE
enabled VMA.

But this poses a challenge for managing tag storage: in the scenario above,
when a new page is allocated to be swapped in for the process where it will
be mapped as untagged, the corresponding tag storage block is not reserved.
mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
overwrite data in the tag storage block associated with the new page,
leading to data corruption if the block is in use by a process.

Get around this issue by saving the tags in a new xarray, this time indexed
by the page pfn, and then restoring them when tag storage is reserved for
the page.

Signed-off-by: Alexandru Elisei 
---
 arch/arm64/include/asm/mte_tag_storage.h |   9 ++
 arch/arm64/include/asm/pgtable.h |  11 +++
 arch/arm64/kernel/mte_tag_storage.c  |  20 +++-
 arch/arm64/mm/mteswap.c  | 112 +++
 4 files changed, 148 insertions(+), 4 deletions(-)

diff --git a/arch/arm64/include/asm/mte_tag_storage.h 
b/arch/arm64/include/asm/mte_tag_storage.h
index 6a8b19a6a758..a3c38099fe1a 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -37,6 +37,15 @@ bool page_is_tag_storage(struct page *page);
 
 vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
 vm_fault_t handle_huge_page_missing_tag_storage(struct vm_fault *vmf);
+
+void tags_by_pfn_lock(void);
+void tags_by_pfn_unlock(void);
+
+void *mte_erase_tags_for_pfn(unsigned long pfn);
+bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
+void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
+
+vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
 #else
 static inline bool tag_storage_enabled(void)
 {
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 1704411c096d..1a25b7d601c2 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1084,6 +1084,17 @@ static inline void arch_swap_invalidate_area(int type)
mte_invalidate_tags_area_by_swp_entry(type);
 }
 
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
+ struct folio *folio)
+{
+   if (tag_storage_enabled())
+   return mte_try_transfer_swap_tags(entry, >page);
+   return 0;
+}
+#endif
+
 #define __HAVE_ARCH_SWAP_RESTORE
 static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
 {
diff --git a/arch/arm64/kernel/mte_tag_storage.c 
b/arch/arm64/kernel/mte_tag_storage.c
index 5096ce859136..6b11bb408b51 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -547,8 +547,10 @@ int reserve_tag_storage(struct page *page, int order, 
gfp_t gfp)
mutex_lock(_blocks_lock);
 
/* Check again, this time with the lock held. */
-   if (page_tag_storage_reserved(page))
-   goto out_unlock;
+   if (page_tag_storage_reserved(page)) {
+   mutex_unlock(_blocks_lock);
+   return 0;
+   }
 
/* Make sure existing entries are not freed from out under out feet. */
xa_lock_irqsave(_blocks_reserved, flags);
@@ -583,9 +585,10 @@ int reserve_tag_storage(struct page *page, int order, 
gfp_t gfp)
}
 
page_set_tag_storage_reserved(page, order);
-out_unlock:
mutex_unlock(_blocks_lock);
 
+   mte_restore_tags_for_pfn(page_to_pfn(page), order);
+
return 0;
 
 out_error:
@@ -612,7 +615,8 @@ void free_tag_storage(struct page *page, int order)
struct tag_region *region;
unsigned long page_va;
unsigned long flags;
-   int ret;
+   void *tags;
+   int i, ret;
 
ret = tag_storage_find_block(page, _block, );
if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", 
page_to_pfn(page)))
@@ -622,6 +626,14 @@ void free_tag_storage(struct page *page, int order)
/* Avoid writeback of dirty tag cache lines corrupting data. */
dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
 
+   tags_by_pfn_lock();
+   for (i = 0; i < (1 << order); i++) {
+   tags = mte_erase_tags_for_pfn(page_to_pfn(page + i));
+   if (unlikely(tags))
+   mte_free_tag_buf(tags);
+   }
+   tags_by_pfn_unlock();
+
end_block = start_block + order_to_num_blocks(

Re: [PATCH v2 2/3] usb: typec: fsa4480: Add support to swap SBU orientation

2023-10-30 Thread Heikki Krogerus
On Fri, Oct 20, 2023 at 11:33:19AM +0200, Luca Weiss wrote:
> On some hardware designs the AUX+/- lanes are connected reversed to
> SBU1/2 compared to the expected design by FSA4480.
> 
> Made more complicated, the otherwise compatible Orient-Chip OCP96011
> expects the lanes to be connected reversed compared to FSA4480.
> 
> * FSA4480 block diagram shows AUX+ connected to SBU2 and AUX- to SBU1.
> * OCP96011 block diagram shows AUX+ connected to SBU1 and AUX- to SBU2.
> 
> So if OCP96011 is used as drop-in for FSA4480 then the orientation
> handling in the driver needs to be reversed to match the expectation of
> the OCP96011 hardware.
> 
> Support parsing the data-lanes parameter in the endpoint node to swap
> this in the driver.
> 
> The parse_data_lanes_mapping function is mostly taken from nb7vpq904m.c.
> 
> Reviewed-by: Neil Armstrong 
> Signed-off-by: Luca Weiss 

Reviewed-by: Heikki Krogerus 

> ---
>  drivers/usb/typec/mux/fsa4480.c | 71 
> +
>  1 file changed, 71 insertions(+)
> 
> diff --git a/drivers/usb/typec/mux/fsa4480.c b/drivers/usb/typec/mux/fsa4480.c
> index e0ee1f621abb..cb7cdf90cb0a 100644
> --- a/drivers/usb/typec/mux/fsa4480.c
> +++ b/drivers/usb/typec/mux/fsa4480.c
> @@ -60,6 +60,7 @@ struct fsa4480 {
>   unsigned int svid;
>  
>   u8 cur_enable;
> + bool swap_sbu_lanes;
>  };
>  
>  static const struct regmap_config fsa4480_regmap_config = {
> @@ -76,6 +77,9 @@ static int fsa4480_set(struct fsa4480 *fsa)
>   u8 enable = FSA4480_ENABLE_DEVICE;
>   u8 sel = 0;
>  
> + if (fsa->swap_sbu_lanes)
> + reverse = !reverse;
> +
>   /* USB Mode */
>   if (fsa->mode < TYPEC_STATE_MODAL ||
>   (!fsa->svid && (fsa->mode == TYPEC_MODE_USB2 ||
> @@ -179,12 +183,75 @@ static int fsa4480_mux_set(struct typec_mux_dev *mux, 
> struct typec_mux_state *st
>   return ret;
>  }
>  
> +enum {
> + NORMAL_LANE_MAPPING,
> + INVERT_LANE_MAPPING,
> +};
> +
> +#define DATA_LANES_COUNT 2
> +
> +static const int supported_data_lane_mapping[][DATA_LANES_COUNT] = {
> + [NORMAL_LANE_MAPPING] = { 0, 1 },
> + [INVERT_LANE_MAPPING] = { 1, 0 },
> +};
> +
> +static int fsa4480_parse_data_lanes_mapping(struct fsa4480 *fsa)
> +{
> + struct fwnode_handle *ep;
> + u32 data_lanes[DATA_LANES_COUNT];
> + int ret, i, j;
> +
> + ep = fwnode_graph_get_next_endpoint(dev_fwnode(>client->dev), 
> NULL);
> + if (!ep)
> + return 0;
> +
> + ret = fwnode_property_read_u32_array(ep, "data-lanes", data_lanes, 
> DATA_LANES_COUNT);
> + if (ret == -EINVAL)
> + /* Property isn't here, consider default mapping */
> + goto out_done;
> + if (ret) {
> + dev_err(>client->dev, "invalid data-lanes property: %d\n", 
> ret);
> + goto out_error;
> + }
> +
> + for (i = 0; i < ARRAY_SIZE(supported_data_lane_mapping); i++) {
> + for (j = 0; j < DATA_LANES_COUNT; j++) {
> + if (data_lanes[j] != supported_data_lane_mapping[i][j])
> + break;
> + }
> +
> + if (j == DATA_LANES_COUNT)
> + break;
> + }
> +
> + switch (i) {
> + case NORMAL_LANE_MAPPING:
> + break;
> + case INVERT_LANE_MAPPING:
> + fsa->swap_sbu_lanes = true;
> + break;
> + default:
> + dev_err(>client->dev, "invalid data-lanes mapping\n");
> + ret = -EINVAL;
> + goto out_error;
> + }
> +
> +out_done:
> + ret = 0;
> +
> +out_error:
> + fwnode_handle_put(ep);
> +
> + return ret;
> +}
> +
>  static int fsa4480_probe(struct i2c_client *client)
>  {
>   struct device *dev = >dev;
>   struct typec_switch_desc sw_desc = { };
>   struct typec_mux_desc mux_desc = { };
>   struct fsa4480 *fsa;
> + int ret;
>  
>   fsa = devm_kzalloc(dev, sizeof(*fsa), GFP_KERNEL);
>   if (!fsa)
> @@ -193,6 +260,10 @@ static int fsa4480_probe(struct i2c_client *client)
>   fsa->client = client;
>   mutex_init(>lock);
>  
> + ret = fsa4480_parse_data_lanes_mapping(fsa);
> + if (ret)
> + return ret;
> +
>   fsa->regmap = devm_regmap_init_i2c(client, _regmap_config);
>   if (IS_ERR(fsa->regmap))
>   return dev_err_probe(dev, PTR_ERR(fsa->regmap), "failed to 
> initialize regmap\n");
> 
> -- 
> 2.42.0

-- 
heikki


[PATCH v2 2/3] usb: typec: fsa4480: Add support to swap SBU orientation

2023-10-20 Thread Luca Weiss
On some hardware designs the AUX+/- lanes are connected reversed to
SBU1/2 compared to the expected design by FSA4480.

Made more complicated, the otherwise compatible Orient-Chip OCP96011
expects the lanes to be connected reversed compared to FSA4480.

* FSA4480 block diagram shows AUX+ connected to SBU2 and AUX- to SBU1.
* OCP96011 block diagram shows AUX+ connected to SBU1 and AUX- to SBU2.

So if OCP96011 is used as drop-in for FSA4480 then the orientation
handling in the driver needs to be reversed to match the expectation of
the OCP96011 hardware.

Support parsing the data-lanes parameter in the endpoint node to swap
this in the driver.

The parse_data_lanes_mapping function is mostly taken from nb7vpq904m.c.

Reviewed-by: Neil Armstrong 
Signed-off-by: Luca Weiss 
---
 drivers/usb/typec/mux/fsa4480.c | 71 +
 1 file changed, 71 insertions(+)

diff --git a/drivers/usb/typec/mux/fsa4480.c b/drivers/usb/typec/mux/fsa4480.c
index e0ee1f621abb..cb7cdf90cb0a 100644
--- a/drivers/usb/typec/mux/fsa4480.c
+++ b/drivers/usb/typec/mux/fsa4480.c
@@ -60,6 +60,7 @@ struct fsa4480 {
unsigned int svid;
 
u8 cur_enable;
+   bool swap_sbu_lanes;
 };
 
 static const struct regmap_config fsa4480_regmap_config = {
@@ -76,6 +77,9 @@ static int fsa4480_set(struct fsa4480 *fsa)
u8 enable = FSA4480_ENABLE_DEVICE;
u8 sel = 0;
 
+   if (fsa->swap_sbu_lanes)
+   reverse = !reverse;
+
/* USB Mode */
if (fsa->mode < TYPEC_STATE_MODAL ||
(!fsa->svid && (fsa->mode == TYPEC_MODE_USB2 ||
@@ -179,12 +183,75 @@ static int fsa4480_mux_set(struct typec_mux_dev *mux, 
struct typec_mux_state *st
return ret;
 }
 
+enum {
+   NORMAL_LANE_MAPPING,
+   INVERT_LANE_MAPPING,
+};
+
+#define DATA_LANES_COUNT   2
+
+static const int supported_data_lane_mapping[][DATA_LANES_COUNT] = {
+   [NORMAL_LANE_MAPPING] = { 0, 1 },
+   [INVERT_LANE_MAPPING] = { 1, 0 },
+};
+
+static int fsa4480_parse_data_lanes_mapping(struct fsa4480 *fsa)
+{
+   struct fwnode_handle *ep;
+   u32 data_lanes[DATA_LANES_COUNT];
+   int ret, i, j;
+
+   ep = fwnode_graph_get_next_endpoint(dev_fwnode(>client->dev), 
NULL);
+   if (!ep)
+   return 0;
+
+   ret = fwnode_property_read_u32_array(ep, "data-lanes", data_lanes, 
DATA_LANES_COUNT);
+   if (ret == -EINVAL)
+   /* Property isn't here, consider default mapping */
+   goto out_done;
+   if (ret) {
+   dev_err(>client->dev, "invalid data-lanes property: %d\n", 
ret);
+   goto out_error;
+   }
+
+   for (i = 0; i < ARRAY_SIZE(supported_data_lane_mapping); i++) {
+   for (j = 0; j < DATA_LANES_COUNT; j++) {
+   if (data_lanes[j] != supported_data_lane_mapping[i][j])
+   break;
+   }
+
+   if (j == DATA_LANES_COUNT)
+   break;
+   }
+
+   switch (i) {
+   case NORMAL_LANE_MAPPING:
+   break;
+   case INVERT_LANE_MAPPING:
+   fsa->swap_sbu_lanes = true;
+   break;
+   default:
+   dev_err(>client->dev, "invalid data-lanes mapping\n");
+   ret = -EINVAL;
+   goto out_error;
+   }
+
+out_done:
+   ret = 0;
+
+out_error:
+   fwnode_handle_put(ep);
+
+   return ret;
+}
+
 static int fsa4480_probe(struct i2c_client *client)
 {
struct device *dev = >dev;
struct typec_switch_desc sw_desc = { };
struct typec_mux_desc mux_desc = { };
struct fsa4480 *fsa;
+   int ret;
 
fsa = devm_kzalloc(dev, sizeof(*fsa), GFP_KERNEL);
if (!fsa)
@@ -193,6 +260,10 @@ static int fsa4480_probe(struct i2c_client *client)
fsa->client = client;
mutex_init(>lock);
 
+   ret = fsa4480_parse_data_lanes_mapping(fsa);
+   if (ret)
+   return ret;
+
fsa->regmap = devm_regmap_init_i2c(client, _regmap_config);
if (IS_ERR(fsa->regmap))
return dev_err_probe(dev, PTR_ERR(fsa->regmap), "failed to 
initialize regmap\n");

-- 
2.42.0



Re: [PATCH 2/3] usb: typec: fsa4480: Add support to swap SBU orientation

2023-10-18 Thread Heikki Krogerus
Hi Luca,

> > Shouldn't you loop through the endpoints? In any case:
> >
> > ep = fwnode_graph_get_next_endpoint(dev_fwnode(>client->dev, 
> > NULL));
> 
> The docs only mention one endpoint so I'm assuming just next_endpoint is
> fine?

I'm mostly concerned about what we may have in the future. If one day
you have more than the one connection in your graph, then you have to
be able to identify the endpoint you are after.

But that may not be a problem in this case (maybe that "data-lanes"
device property can be used to identify the correct endpoint?).

We can worry about it then when/if we ever have another endpoint to
deal with.

thanks,

-- 
heikki


Re: [PATCH 2/3] usb: typec: fsa4480: Add support to swap SBU orientation

2023-10-17 Thread Luca Weiss
On Tue Oct 17, 2023 at 11:01 AM CEST, Heikki Krogerus wrote:
> Hi Luca,
>
> On Fri, Oct 13, 2023 at 01:38:06PM +0200, Luca Weiss wrote:
> > On some hardware designs the AUX+/- lanes are connected reversed to
> > SBU1/2 compared to the expected design by FSA4480.
> > 
> > Made more complicated, the otherwise compatible Orient-Chip OCP96011
> > expects the lanes to be connected reversed compared to FSA4480.
> > 
> > * FSA4480 block diagram shows AUX+ connected to SBU2 and AUX- to SBU1.
> > * OCP96011 block diagram shows AUX+ connected to SBU1 and AUX- to SBU2.
> > 
> > So if OCP96011 is used as drop-in for FSA4480 then the orientation
> > handling in the driver needs to be reversed to match the expectation of
> > the OCP96011 hardware.
> > 
> > Support parsing the data-lanes parameter in the endpoint node to swap
> > this in the driver.
> > 
> > The parse_data_lanes_mapping function is mostly taken from nb7vpq904m.c.
> > 
> > Signed-off-by: Luca Weiss 
> > ---
> >  drivers/usb/typec/mux/fsa4480.c | 81 
> > +
> >  1 file changed, 81 insertions(+)
> > 
> > diff --git a/drivers/usb/typec/mux/fsa4480.c 
> > b/drivers/usb/typec/mux/fsa4480.c
> > index e0ee1f621abb..6ee467c96fb6 100644
> > --- a/drivers/usb/typec/mux/fsa4480.c
> > +++ b/drivers/usb/typec/mux/fsa4480.c
> > @@ -9,6 +9,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
>
> If you don't mind, let's keep this driver ready for ACPI, just in
> case...

I'm quite clueless about any details about ACPI but makes sense to use
the generic APIs.

>
> >  #include 
> >  #include 
> >  #include 
> > @@ -60,6 +61,7 @@ struct fsa4480 {
> > unsigned int svid;
> >  
> > u8 cur_enable;
> > +   bool swap_sbu_lanes;
> >  };
> >  
> >  static const struct regmap_config fsa4480_regmap_config = {
> > @@ -76,6 +78,9 @@ static int fsa4480_set(struct fsa4480 *fsa)
> > u8 enable = FSA4480_ENABLE_DEVICE;
> > u8 sel = 0;
> >  
> > +   if (fsa->swap_sbu_lanes)
> > +   reverse = !reverse;
> > +
> > /* USB Mode */
> > if (fsa->mode < TYPEC_STATE_MODAL ||
> > (!fsa->svid && (fsa->mode == TYPEC_MODE_USB2 ||
> > @@ -179,12 +184,84 @@ static int fsa4480_mux_set(struct typec_mux_dev *mux, 
> > struct typec_mux_state *st
> > return ret;
> >  }
> >  
> > +enum {
> > +   NORMAL_LANE_MAPPING,
> > +   INVERT_LANE_MAPPING,
> > +};
> > +
> > +#define DATA_LANES_COUNT   2
> > +
> > +static const int supported_data_lane_mapping[][DATA_LANES_COUNT] = {
> > +   [NORMAL_LANE_MAPPING] = { 0, 1 },
> > +   [INVERT_LANE_MAPPING] = { 1, 0 },
> > +};
> > +
> > +static int fsa4480_parse_data_lanes_mapping(struct fsa4480 *fsa)
> > +{
> > +   struct device_node *ep;
>
> struct fwnode_handle *ep;
>
> > +   u32 data_lanes[DATA_LANES_COUNT];
> > +   int ret, i, j;
> > +
> > +   ep = of_graph_get_next_endpoint(fsa->client->dev.of_node, NULL);
>
> Shouldn't you loop through the endpoints? In any case:
>
> ep = fwnode_graph_get_next_endpoint(dev_fwnode(>client->dev, 
> NULL));

The docs only mention one endpoint so I'm assuming just next_endpoint is
fine?

>
> > +   if (!ep)
> > +   return 0;
> > +
> > +   ret = of_property_count_u32_elems(ep, "data-lanes");
>
> ret = fwnode_property_count_u32(ep, "data-lanes");
>
> But is this necessary at all in this case - why not just read the
> array since you expect a fixed size for it (if the read fails it fails)?

Hm yeah that should be okay.. Will check the docs
of_property_read_u32_array (or well fwnode_property_read_u32_array) to
see if there's any gotchas if there's less or more elements provided.

Regards
Luca

>
> > +   if (ret == -EINVAL)
> > +   /* Property isn't here, consider default mapping */
> > +   goto out_done;
> > +   if (ret < 0)
> > +   goto out_error;
> > +
> > +   if (ret != DATA_LANES_COUNT) {
> > +   dev_err(>client->dev, "expected 2 data lanes\n");
> > +   ret = -EINVAL;
> > +   goto out_error;
> > +   }
> > +
> > +   ret = of_property_read_u32_array(ep, "data-lanes", data_lanes, 
> > DATA_LANES_COUNT);
>
> ret = fwnode_property_read_u32_array(ep, "data-lanes", data_lanes, 
> DATA_LANES_COUNT);
>
> > +   if (ret

Re: [PATCH 2/3] usb: typec: fsa4480: Add support to swap SBU orientation

2023-10-17 Thread Heikki Krogerus
Hi Luca,

On Fri, Oct 13, 2023 at 01:38:06PM +0200, Luca Weiss wrote:
> On some hardware designs the AUX+/- lanes are connected reversed to
> SBU1/2 compared to the expected design by FSA4480.
> 
> Made more complicated, the otherwise compatible Orient-Chip OCP96011
> expects the lanes to be connected reversed compared to FSA4480.
> 
> * FSA4480 block diagram shows AUX+ connected to SBU2 and AUX- to SBU1.
> * OCP96011 block diagram shows AUX+ connected to SBU1 and AUX- to SBU2.
> 
> So if OCP96011 is used as drop-in for FSA4480 then the orientation
> handling in the driver needs to be reversed to match the expectation of
> the OCP96011 hardware.
> 
> Support parsing the data-lanes parameter in the endpoint node to swap
> this in the driver.
> 
> The parse_data_lanes_mapping function is mostly taken from nb7vpq904m.c.
> 
> Signed-off-by: Luca Weiss 
> ---
>  drivers/usb/typec/mux/fsa4480.c | 81 
> +
>  1 file changed, 81 insertions(+)
> 
> diff --git a/drivers/usb/typec/mux/fsa4480.c b/drivers/usb/typec/mux/fsa4480.c
> index e0ee1f621abb..6ee467c96fb6 100644
> --- a/drivers/usb/typec/mux/fsa4480.c
> +++ b/drivers/usb/typec/mux/fsa4480.c
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

If you don't mind, let's keep this driver ready for ACPI, just in
case...

>  #include 
>  #include 
>  #include 
> @@ -60,6 +61,7 @@ struct fsa4480 {
>   unsigned int svid;
>  
>   u8 cur_enable;
> + bool swap_sbu_lanes;
>  };
>  
>  static const struct regmap_config fsa4480_regmap_config = {
> @@ -76,6 +78,9 @@ static int fsa4480_set(struct fsa4480 *fsa)
>   u8 enable = FSA4480_ENABLE_DEVICE;
>   u8 sel = 0;
>  
> + if (fsa->swap_sbu_lanes)
> + reverse = !reverse;
> +
>   /* USB Mode */
>   if (fsa->mode < TYPEC_STATE_MODAL ||
>   (!fsa->svid && (fsa->mode == TYPEC_MODE_USB2 ||
> @@ -179,12 +184,84 @@ static int fsa4480_mux_set(struct typec_mux_dev *mux, 
> struct typec_mux_state *st
>   return ret;
>  }
>  
> +enum {
> + NORMAL_LANE_MAPPING,
> + INVERT_LANE_MAPPING,
> +};
> +
> +#define DATA_LANES_COUNT 2
> +
> +static const int supported_data_lane_mapping[][DATA_LANES_COUNT] = {
> + [NORMAL_LANE_MAPPING] = { 0, 1 },
> + [INVERT_LANE_MAPPING] = { 1, 0 },
> +};
> +
> +static int fsa4480_parse_data_lanes_mapping(struct fsa4480 *fsa)
> +{
> + struct device_node *ep;

struct fwnode_handle *ep;

> + u32 data_lanes[DATA_LANES_COUNT];
> + int ret, i, j;
> +
> + ep = of_graph_get_next_endpoint(fsa->client->dev.of_node, NULL);

Shouldn't you loop through the endpoints? In any case:

ep = fwnode_graph_get_next_endpoint(dev_fwnode(>client->dev, 
NULL));

> + if (!ep)
> + return 0;
> +
> + ret = of_property_count_u32_elems(ep, "data-lanes");

ret = fwnode_property_count_u32(ep, "data-lanes");

But is this necessary at all in this case - why not just read the
array since you expect a fixed size for it (if the read fails it fails)?

> + if (ret == -EINVAL)
> + /* Property isn't here, consider default mapping */
> + goto out_done;
> + if (ret < 0)
> + goto out_error;
> +
> + if (ret != DATA_LANES_COUNT) {
> + dev_err(>client->dev, "expected 2 data lanes\n");
> + ret = -EINVAL;
> + goto out_error;
> + }
> +
> + ret = of_property_read_u32_array(ep, "data-lanes", data_lanes, 
> DATA_LANES_COUNT);

ret = fwnode_property_read_u32_array(ep, "data-lanes", data_lanes, 
DATA_LANES_COUNT);

> + if (ret)
> + goto out_error;
> +
> + for (i = 0; i < ARRAY_SIZE(supported_data_lane_mapping); i++) {
> + for (j = 0; j < DATA_LANES_COUNT; j++) {
> + if (data_lanes[j] != supported_data_lane_mapping[i][j])
> + break;
> + }
> +
> + if (j == DATA_LANES_COUNT)
> + break;
> + }
> +
> + switch (i) {
> + case NORMAL_LANE_MAPPING:
> + break;
> + case INVERT_LANE_MAPPING:
> + fsa->swap_sbu_lanes = true;
> + dev_info(>client->dev, "using inverted data lanes 
> mapping\n");

That is just noise. Please drop it.

> + break;
> + default:
> + dev_err(>client->dev, "invalid data lanes mapping\n");
> + ret = -EINVAL;
> + goto out_error;
> + }
> +
> +out_do

Re: [PATCH 2/3] usb: typec: fsa4480: Add support to swap SBU orientation

2023-10-13 Thread Neil Armstrong

On 13/10/2023 13:38, Luca Weiss wrote:

On some hardware designs the AUX+/- lanes are connected reversed to
SBU1/2 compared to the expected design by FSA4480.

Made more complicated, the otherwise compatible Orient-Chip OCP96011
expects the lanes to be connected reversed compared to FSA4480.

* FSA4480 block diagram shows AUX+ connected to SBU2 and AUX- to SBU1.
* OCP96011 block diagram shows AUX+ connected to SBU1 and AUX- to SBU2.

So if OCP96011 is used as drop-in for FSA4480 then the orientation
handling in the driver needs to be reversed to match the expectation of
the OCP96011 hardware.

Support parsing the data-lanes parameter in the endpoint node to swap
this in the driver.

The parse_data_lanes_mapping function is mostly taken from nb7vpq904m.c.


I see the inspiration :-)



Signed-off-by: Luca Weiss 
---
  drivers/usb/typec/mux/fsa4480.c | 81 +
  1 file changed, 81 insertions(+)

diff --git a/drivers/usb/typec/mux/fsa4480.c b/drivers/usb/typec/mux/fsa4480.c
index e0ee1f621abb..6ee467c96fb6 100644
--- a/drivers/usb/typec/mux/fsa4480.c
+++ b/drivers/usb/typec/mux/fsa4480.c
@@ -9,6 +9,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -60,6 +61,7 @@ struct fsa4480 {
unsigned int svid;
  
  	u8 cur_enable;

+   bool swap_sbu_lanes;
  };
  
  static const struct regmap_config fsa4480_regmap_config = {

@@ -76,6 +78,9 @@ static int fsa4480_set(struct fsa4480 *fsa)
u8 enable = FSA4480_ENABLE_DEVICE;
u8 sel = 0;
  
+	if (fsa->swap_sbu_lanes)

+   reverse = !reverse;
+
/* USB Mode */
if (fsa->mode < TYPEC_STATE_MODAL ||
(!fsa->svid && (fsa->mode == TYPEC_MODE_USB2 ||
@@ -179,12 +184,84 @@ static int fsa4480_mux_set(struct typec_mux_dev *mux, 
struct typec_mux_state *st
return ret;
  }
  
+enum {

+   NORMAL_LANE_MAPPING,
+   INVERT_LANE_MAPPING,
+};
+
+#define DATA_LANES_COUNT   2
+
+static const int supported_data_lane_mapping[][DATA_LANES_COUNT] = {
+   [NORMAL_LANE_MAPPING] = { 0, 1 },
+   [INVERT_LANE_MAPPING] = { 1, 0 },
+};
+
+static int fsa4480_parse_data_lanes_mapping(struct fsa4480 *fsa)
+{
+   struct device_node *ep;
+   u32 data_lanes[DATA_LANES_COUNT];
+   int ret, i, j;
+
+   ep = of_graph_get_next_endpoint(fsa->client->dev.of_node, NULL);
+   if (!ep)
+   return 0;
+
+   ret = of_property_count_u32_elems(ep, "data-lanes");
+   if (ret == -EINVAL)
+   /* Property isn't here, consider default mapping */
+   goto out_done;
+   if (ret < 0)
+   goto out_error;
+
+   if (ret != DATA_LANES_COUNT) {
+   dev_err(>client->dev, "expected 2 data lanes\n");
+   ret = -EINVAL;
+   goto out_error;
+   }
+
+   ret = of_property_read_u32_array(ep, "data-lanes", data_lanes, 
DATA_LANES_COUNT);
+   if (ret)
+   goto out_error;
+
+   for (i = 0; i < ARRAY_SIZE(supported_data_lane_mapping); i++) {
+   for (j = 0; j < DATA_LANES_COUNT; j++) {
+   if (data_lanes[j] != supported_data_lane_mapping[i][j])
+   break;
+   }
+
+   if (j == DATA_LANES_COUNT)
+   break;
+   }
+
+   switch (i) {
+   case NORMAL_LANE_MAPPING:
+   break;
+   case INVERT_LANE_MAPPING:
+   fsa->swap_sbu_lanes = true;
+   dev_info(>client->dev, "using inverted data lanes 
mapping\n");
+   break;
+   default:
+   dev_err(>client->dev, "invalid data lanes mapping\n");
+   ret = -EINVAL;
+   goto out_error;
+   }
+
+out_done:
+   ret = 0;
+
+out_error:
+   of_node_put(ep);
+
+   return ret;
+}


It's probbaly slighly over engineered for 2 lanes, and at some point
this could go into an helper somewhere because we will need it for all
SBU muxes and redrivers/retimers.


+
  static int fsa4480_probe(struct i2c_client *client)
  {
struct device *dev = >dev;
struct typec_switch_desc sw_desc = { };
struct typec_mux_desc mux_desc = { };
struct fsa4480 *fsa;
+   int ret;
  
  	fsa = devm_kzalloc(dev, sizeof(*fsa), GFP_KERNEL);

if (!fsa)
@@ -193,6 +270,10 @@ static int fsa4480_probe(struct i2c_client *client)
fsa->client = client;
mutex_init(>lock);
  
+	ret = fsa4480_parse_data_lanes_mapping(fsa);

+   if (ret)
+   return ret;
+
fsa->regmap = devm_regmap_init_i2c(client, _regmap_config);
if (IS_ERR(fsa->regmap))
return dev_err_probe(dev, PTR_ERR(fsa->regmap), "failed to 
initialize regmap\n");



But since I did the same in nb7vpq904m, and the SBU can be inverted, LGTM

Reviewed-by: Neil Armstrong 

Neil


[PATCH 2/3] usb: typec: fsa4480: Add support to swap SBU orientation

2023-10-13 Thread Luca Weiss
On some hardware designs the AUX+/- lanes are connected reversed to
SBU1/2 compared to the expected design by FSA4480.

Made more complicated, the otherwise compatible Orient-Chip OCP96011
expects the lanes to be connected reversed compared to FSA4480.

* FSA4480 block diagram shows AUX+ connected to SBU2 and AUX- to SBU1.
* OCP96011 block diagram shows AUX+ connected to SBU1 and AUX- to SBU2.

So if OCP96011 is used as drop-in for FSA4480 then the orientation
handling in the driver needs to be reversed to match the expectation of
the OCP96011 hardware.

Support parsing the data-lanes parameter in the endpoint node to swap
this in the driver.

The parse_data_lanes_mapping function is mostly taken from nb7vpq904m.c.

Signed-off-by: Luca Weiss 
---
 drivers/usb/typec/mux/fsa4480.c | 81 +
 1 file changed, 81 insertions(+)

diff --git a/drivers/usb/typec/mux/fsa4480.c b/drivers/usb/typec/mux/fsa4480.c
index e0ee1f621abb..6ee467c96fb6 100644
--- a/drivers/usb/typec/mux/fsa4480.c
+++ b/drivers/usb/typec/mux/fsa4480.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -60,6 +61,7 @@ struct fsa4480 {
unsigned int svid;
 
u8 cur_enable;
+   bool swap_sbu_lanes;
 };
 
 static const struct regmap_config fsa4480_regmap_config = {
@@ -76,6 +78,9 @@ static int fsa4480_set(struct fsa4480 *fsa)
u8 enable = FSA4480_ENABLE_DEVICE;
u8 sel = 0;
 
+   if (fsa->swap_sbu_lanes)
+   reverse = !reverse;
+
/* USB Mode */
if (fsa->mode < TYPEC_STATE_MODAL ||
(!fsa->svid && (fsa->mode == TYPEC_MODE_USB2 ||
@@ -179,12 +184,84 @@ static int fsa4480_mux_set(struct typec_mux_dev *mux, 
struct typec_mux_state *st
return ret;
 }
 
+enum {
+   NORMAL_LANE_MAPPING,
+   INVERT_LANE_MAPPING,
+};
+
+#define DATA_LANES_COUNT   2
+
+static const int supported_data_lane_mapping[][DATA_LANES_COUNT] = {
+   [NORMAL_LANE_MAPPING] = { 0, 1 },
+   [INVERT_LANE_MAPPING] = { 1, 0 },
+};
+
+static int fsa4480_parse_data_lanes_mapping(struct fsa4480 *fsa)
+{
+   struct device_node *ep;
+   u32 data_lanes[DATA_LANES_COUNT];
+   int ret, i, j;
+
+   ep = of_graph_get_next_endpoint(fsa->client->dev.of_node, NULL);
+   if (!ep)
+   return 0;
+
+   ret = of_property_count_u32_elems(ep, "data-lanes");
+   if (ret == -EINVAL)
+   /* Property isn't here, consider default mapping */
+   goto out_done;
+   if (ret < 0)
+   goto out_error;
+
+   if (ret != DATA_LANES_COUNT) {
+   dev_err(>client->dev, "expected 2 data lanes\n");
+   ret = -EINVAL;
+   goto out_error;
+   }
+
+   ret = of_property_read_u32_array(ep, "data-lanes", data_lanes, 
DATA_LANES_COUNT);
+   if (ret)
+   goto out_error;
+
+   for (i = 0; i < ARRAY_SIZE(supported_data_lane_mapping); i++) {
+   for (j = 0; j < DATA_LANES_COUNT; j++) {
+   if (data_lanes[j] != supported_data_lane_mapping[i][j])
+   break;
+   }
+
+   if (j == DATA_LANES_COUNT)
+   break;
+   }
+
+   switch (i) {
+   case NORMAL_LANE_MAPPING:
+   break;
+   case INVERT_LANE_MAPPING:
+   fsa->swap_sbu_lanes = true;
+   dev_info(>client->dev, "using inverted data lanes 
mapping\n");
+   break;
+   default:
+   dev_err(>client->dev, "invalid data lanes mapping\n");
+   ret = -EINVAL;
+   goto out_error;
+   }
+
+out_done:
+   ret = 0;
+
+out_error:
+   of_node_put(ep);
+
+   return ret;
+}
+
 static int fsa4480_probe(struct i2c_client *client)
 {
struct device *dev = >dev;
struct typec_switch_desc sw_desc = { };
struct typec_mux_desc mux_desc = { };
struct fsa4480 *fsa;
+   int ret;
 
fsa = devm_kzalloc(dev, sizeof(*fsa), GFP_KERNEL);
if (!fsa)
@@ -193,6 +270,10 @@ static int fsa4480_probe(struct i2c_client *client)
fsa->client = client;
mutex_init(>lock);
 
+   ret = fsa4480_parse_data_lanes_mapping(fsa);
+   if (ret)
+   return ret;
+
fsa->regmap = devm_regmap_init_i2c(client, _regmap_config);
if (IS_ERR(fsa->regmap))
return dev_err_probe(dev, PTR_ERR(fsa->regmap), "failed to 
initialize regmap\n");

-- 
2.42.0



Re: [PATCH v3 3/4] mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info()

2021-04-20 Thread Huang, Ying
Miaohe Lin  writes:

> The non_swap_entry() was used for working with VMA based swap readahead
> via commit ec560175c0b6 ("mm, swap: VMA based swap readahead").

At that time, the non_swap_entry() checking is necessary because the
function is called before checking that in do_swap_page().

> Then it's
> moved to swap_ra_info() since commit eaf649ebc3ac ("mm: swap: clean up swap
> readahead").

After that, the non_swap_entry() checking is unnecessary, because
swap_ra_info() is called after non_swap_entry() has been checked
already.  The resulting code is confusing.

> But this makes the code confusing. The non_swap_entry() check
> looks racy because while we released the pte lock, somebody else might have
> faulted in this pte. So we should check whether it's swap pte first to
> guard against such race or swap_type will be unexpected.

The race isn't important because it will not cause problem.

Best Regards,
Huang, Ying

> But the swap_entry
> isn't used in this function and we will have enough checking when we really
> operate the PTE entries later. So checking for non_swap_entry() is not
> really needed here and should be removed to avoid confusion.
>
> Signed-off-by: Miaohe Lin 
> ---
>  mm/swap_state.c | 6 --
>  1 file changed, 6 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 272ea2108c9d..df5405384520 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -721,7 +721,6 @@ static void swap_ra_info(struct vm_fault *vmf,
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   unsigned long ra_val;
> - swp_entry_t entry;
>   unsigned long faddr, pfn, fpfn;
>   unsigned long start, end;
>   pte_t *pte, *orig_pte;
> @@ -739,11 +738,6 @@ static void swap_ra_info(struct vm_fault *vmf,
>  
>   faddr = vmf->address;
>   orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
> - entry = pte_to_swp_entry(*pte);
> - if ((unlikely(non_swap_entry(entry {
> - pte_unmap(orig_pte);
> - return;
> - }
>  
>   fpfn = PFN_DOWN(faddr);
>   ra_val = GET_SWAP_RA_VAL(vma);


Re: [PATCH v3 2/4] swap: fix do_swap_page() race with swapoff

2021-04-20 Thread Huang, Ying
Miaohe Lin  writes:

> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> - -
> do_swap_page
>   if (data_race(si->flags & SWP_SYNCHRONOUS_IO)
>   swap_readpage
> if (data_race(sis->flags & SWP_FS_OPS)) {
>   swapoff
> p->flags &= ~SWP_VALID;
> ..
> synchronize_rcu();
> ..

You have deleted SWP_VALID and RCU solution in 1/4, so please revise this.

> p->swap_file = NULL;
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[oops!]
>
> Note that for the pages that are swapped in through swap cache, this isn't
> an issue. Because the page is locked, and the swap entry will be marked
> with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
> unlocked.
>
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).

This needs to be revised too.  Unless you squash 1/4 and 2/4.

> Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous 
> device")
> Reported-by: kernel test robot  (auto build test ERROR)
> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h | 9 +
>  mm/memory.c  | 9 +
>  2 files changed, 18 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index c9e7fea10b83..46d51d058d05 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -527,6 +527,15 @@ static inline struct swap_info_struct 
> *swp_swap_info(swp_entry_t entry)
>   return NULL;
>  }
>  
> +static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
> +{
> + return NULL;
> +}
> +
> +static inline void put_swap_device(struct swap_info_struct *si)
> +{
> +}
> +
>  #define swap_address_space(entry)(NULL)
>  #define get_nr_swap_pages()  0L
>  #define total_swap_pages 0L
> diff --git a/mm/memory.c b/mm/memory.c
> index 27014c3bde9f..7a2fe12cf641 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   struct page *page = NULL, *swapcache;
> + struct swap_info_struct *si = NULL;
>   swp_entry_t entry;
>   pte_t pte;
>   int locked;
> @@ -3338,6 +3339,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   goto out;
>   }
>  
> + /* Prevent swapoff from happening to us. */
> + si = get_swap_device(entry);

There's

struct swap_info_struct *si = swp_swap_info(entry);

in do_swap_page(), you can remove that.

Best Regards,
Huang, Ying

> + if (unlikely(!si))
> + goto out;
>  
>   delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
>   page = lookup_swap_cache(entry, vma, vmf->address);
> @@ -3514,6 +3519,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  unlock:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
> + if (si)
> + put_swap_device(si);
>   return ret;
>  out_nomap:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3525,6 +3532,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   unlock_page(swapcache);
>   put_page(swapcache);
>   }
> + if (si)
> + put_swap_device(si);
>   return ret;
>  }


[PATCH v3 0/4] close various race windows for swap

2021-04-20 Thread Miaohe Lin
Hi all,
When I was investigating the swap code, I found some possible race
windows. This series aims to fix all these races. But using current
get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really
long time. And to reduce the performance overhead on the hot-path as
much as possible, it appears we can use the percpu_ref to close this
race window(as suggested by Huang, Ying). The patch 1 adds percpu_ref
support for swap and most of the remaining patches try to use this to
close various race windows. More details can be found in the respective
changelogs. Thanks!

v2->v3:
  some commit log and comment enhance per Huang, Ying
  remove ref_initialized field
  squash PATCH 1-2

v1->v2:
  reorganize the patch-2/5
  various enhance and fixup per Huang, Ying
  Many thanks for the comments of Huang, Ying, Dennis Zhou and Tim Chen.

Miaohe Lin (4):
  mm/swapfile: use percpu_ref to serialize against concurrent swapoff
  swap: fix do_swap_page() race with swapoff
  mm/swap: remove confusing checking for non_swap_entry() in
swap_ra_info()
  mm/shmem: fix shmem_swapin() race with swapoff

 include/linux/swap.h | 14 ++--
 mm/memory.c  |  9 +
 mm/shmem.c   |  6 
 mm/swap_state.c  |  6 
 mm/swapfile.c| 79 +++-
 5 files changed, 76 insertions(+), 38 deletions(-)

-- 
2.19.1



[PATCH v3 2/4] swap: fix do_swap_page() race with swapoff

2021-04-20 Thread Miaohe Lin
When I was investigating the swap code, I found the below possible race
window:

CPU 1   CPU 2
-   -
do_swap_page
  if (data_race(si->flags & SWP_SYNCHRONOUS_IO)
  swap_readpage
if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
  p->flags &= ~SWP_VALID;
  ..
  synchronize_rcu();
  ..
  p->swap_file = NULL;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[oops!]

Note that for the pages that are swapped in through swap cache, this isn't
an issue. Because the page is locked, and the swap entry will be marked
with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
unlocked.

Using current get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really long
time. And this race may not be really pernicious because swapoff is usually
done when system shutdown only. To reduce the performance overhead on the
hot-path as much as possible, it appears we can use the percpu_ref to close
this race window(as suggested by Huang, Ying).

Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous device")
Reported-by: kernel test robot  (auto build test ERROR)
Signed-off-by: Miaohe Lin 
---
 include/linux/swap.h | 9 +
 mm/memory.c  | 9 +
 2 files changed, 18 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c9e7fea10b83..46d51d058d05 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -527,6 +527,15 @@ static inline struct swap_info_struct 
*swp_swap_info(swp_entry_t entry)
return NULL;
 }
 
+static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
+{
+   return NULL;
+}
+
+static inline void put_swap_device(struct swap_info_struct *si)
+{
+}
+
 #define swap_address_space(entry)  (NULL)
 #define get_nr_swap_pages()0L
 #define total_swap_pages   0L
diff --git a/mm/memory.c b/mm/memory.c
index 27014c3bde9f..7a2fe12cf641 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *swapcache;
+   struct swap_info_struct *si = NULL;
swp_entry_t entry;
pte_t pte;
int locked;
@@ -3338,6 +3339,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out;
}
 
+   /* Prevent swapoff from happening to us. */
+   si = get_swap_device(entry);
+   if (unlikely(!si))
+   goto out;
 
delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry, vma, vmf->address);
@@ -3514,6 +3519,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
+   if (si)
+   put_swap_device(si);
return ret;
 out_nomap:
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -3525,6 +3532,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
unlock_page(swapcache);
put_page(swapcache);
}
+   if (si)
+   put_swap_device(si);
return ret;
 }
 
-- 
2.19.1



[PATCH v3 3/4] mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info()

2021-04-20 Thread Miaohe Lin
The non_swap_entry() was used for working with VMA based swap readahead
via commit ec560175c0b6 ("mm, swap: VMA based swap readahead"). Then it's
moved to swap_ra_info() since commit eaf649ebc3ac ("mm: swap: clean up swap
readahead"). But this makes the code confusing. The non_swap_entry() check
looks racy because while we released the pte lock, somebody else might have
faulted in this pte. So we should check whether it's swap pte first to
guard against such race or swap_type will be unexpected. But the swap_entry
isn't used in this function and we will have enough checking when we really
operate the PTE entries later. So checking for non_swap_entry() is not
really needed here and should be removed to avoid confusion.

Signed-off-by: Miaohe Lin 
---
 mm/swap_state.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 272ea2108c9d..df5405384520 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -721,7 +721,6 @@ static void swap_ra_info(struct vm_fault *vmf,
 {
struct vm_area_struct *vma = vmf->vma;
unsigned long ra_val;
-   swp_entry_t entry;
unsigned long faddr, pfn, fpfn;
unsigned long start, end;
pte_t *pte, *orig_pte;
@@ -739,11 +738,6 @@ static void swap_ra_info(struct vm_fault *vmf,
 
faddr = vmf->address;
orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
-   entry = pte_to_swp_entry(*pte);
-   if ((unlikely(non_swap_entry(entry {
-   pte_unmap(orig_pte);
-   return;
-   }
 
fpfn = PFN_DOWN(faddr);
ra_val = GET_SWAP_RA_VAL(vma);
-- 
2.19.1



Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-19 Thread Miaohe Lin
On 2021/4/19 15:52, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> On 2021/4/19 15:09, Huang, Ying wrote:
>>> Miaohe Lin  writes:
>>>
>>>> On 2021/4/19 10:48, Huang, Ying wrote:
>>>>> Miaohe Lin  writes:
>>>>>
>>>>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>>>>> patch adds the percpu_ref support for swap.
>>>>>>
>>>>>> Signed-off-by: Miaohe Lin 
>>>>>> ---
>>>>>>  include/linux/swap.h |  3 +++
>>>>>>  mm/swapfile.c| 33 +
>>>>>>  2 files changed, 32 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>> index 144727041e78..8be36eb58b7a 100644
>>>>>> --- a/include/linux/swap.h
>>>>>> +++ b/include/linux/swap.h
>>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>>   * The in-memory structure used to track swap areas.
>>>>>>   */
>>>>>>  struct swap_info_struct {
>>>>>> +struct percpu_ref users;/* serialization against 
>>>>>> concurrent swapoff */
>>>>>
>>>>> The comments aren't general enough.  We use this to check whether the
>>>>> swap device has been fully initialized, etc. May be something as below?
>>>>>
>>>>> /* indicate and keep swap device valid */
>>>>
>>>> Looks good.
>>>>
>>>>>
>>>>>>  unsigned long   flags;  /* SWP_USED etc: see above */
>>>>>>  signed shortprio;   /* swap priority of this type */
>>>>>>  struct plist_node list; /* entry in swap_active_head */
>>>>>> @@ -260,6 +261,8 @@ struct swap_info_struct {
>>>>>>  struct block_device *bdev;  /* swap device or bdev of swap 
>>>>>> file */
>>>>>>  struct file *swap_file; /* seldom referenced */
>>>>>>  unsigned int old_block_size;/* seldom referenced */
>>>>>> +bool ref_initialized;   /* seldom referenced */
>>>>>> +struct completion comp; /* seldom referenced */
>>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>>  unsigned long *frontswap_map;   /* frontswap in-use, one bit 
>>>>>> per page */
>>>>>>  atomic_t frontswap_pages;   /* frontswap pages in-use 
>>>>>> counter */
>>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>>> index 149e77454e3c..66515a3a2824 100644
>>>>>> --- a/mm/swapfile.c
>>>>>> +++ b/mm/swapfile.c
>>>>>> @@ -39,6 +39,7 @@
>>>>>>  #include 
>>>>>>  #include 
>>>>>>  #include 
>>>>>> +#include 
>>>>>>  
>>>>>>  #include 
>>>>>>  #include 
>>>>>> @@ -511,6 +512,14 @@ static void swap_discard_work(struct work_struct 
>>>>>> *work)
>>>>>>  spin_unlock(>lock);
>>>>>>  }
>>>>>>  
>>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>>>>> +{
>>>>>> +struct swap_info_struct *si;
>>>>>> +
>>>>>> +si = container_of(ref, struct swap_info_struct, users);
>>>>>> +complete(>comp);
>>>>>> +}
>>>>>> +
>>>>>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long 
>>>>>> idx)
>>>>>>  {
>>>>>>  struct swap_cluster_info *ci = si->cluster_info;
>>>>>> @@ -2500,7 +2509,7 @@ static void enable_swap_info(struct 
>>>>>> swap_info_struct *p, int prio,
>>>>>>   * Guarantee swap_map, cluster_info, etc. fields are valid
>>>>>>   * between get/put_swap_device() if SWP_VALID bit is set
>>>>>>   */
>>>>>> -synchronize_rcu();
>>>>>
>>>>> You cannot remove this without changing get/put_swap_device().  It's
>>>>> better to squash at least PATCH 1-2.
>>>>
>>>> Will squash PATCH 1-2. Tha

Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-19 Thread Huang, Ying
Miaohe Lin  writes:

> On 2021/4/19 15:09, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> On 2021/4/19 10:48, Huang, Ying wrote:
>>>> Miaohe Lin  writes:
>>>>
>>>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>>>> patch adds the percpu_ref support for swap.
>>>>>
>>>>> Signed-off-by: Miaohe Lin 
>>>>> ---
>>>>>  include/linux/swap.h |  3 +++
>>>>>  mm/swapfile.c| 33 +
>>>>>  2 files changed, 32 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>> index 144727041e78..8be36eb58b7a 100644
>>>>> --- a/include/linux/swap.h
>>>>> +++ b/include/linux/swap.h
>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>   * The in-memory structure used to track swap areas.
>>>>>   */
>>>>>  struct swap_info_struct {
>>>>> + struct percpu_ref users;/* serialization against concurrent 
>>>>> swapoff */
>>>>
>>>> The comments aren't general enough.  We use this to check whether the
>>>> swap device has been fully initialized, etc. May be something as below?
>>>>
>>>> /* indicate and keep swap device valid */
>>>
>>> Looks good.
>>>
>>>>
>>>>>   unsigned long   flags;  /* SWP_USED etc: see above */
>>>>>   signed shortprio;   /* swap priority of this type */
>>>>>   struct plist_node list; /* entry in swap_active_head */
>>>>> @@ -260,6 +261,8 @@ struct swap_info_struct {
>>>>>   struct block_device *bdev;  /* swap device or bdev of swap file */
>>>>>   struct file *swap_file; /* seldom referenced */
>>>>>   unsigned int old_block_size;/* seldom referenced */
>>>>> + bool ref_initialized;   /* seldom referenced */
>>>>> + struct completion comp; /* seldom referenced */
>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>   unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>>>>   atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>> index 149e77454e3c..66515a3a2824 100644
>>>>> --- a/mm/swapfile.c
>>>>> +++ b/mm/swapfile.c
>>>>> @@ -39,6 +39,7 @@
>>>>>  #include 
>>>>>  #include 
>>>>>  #include 
>>>>> +#include 
>>>>>  
>>>>>  #include 
>>>>>  #include 
>>>>> @@ -511,6 +512,14 @@ static void swap_discard_work(struct work_struct 
>>>>> *work)
>>>>>   spin_unlock(>lock);
>>>>>  }
>>>>>  
>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>>>> +{
>>>>> + struct swap_info_struct *si;
>>>>> +
>>>>> + si = container_of(ref, struct swap_info_struct, users);
>>>>> + complete(>comp);
>>>>> +}
>>>>> +
>>>>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>>>>  {
>>>>>   struct swap_cluster_info *ci = si->cluster_info;
>>>>> @@ -2500,7 +2509,7 @@ static void enable_swap_info(struct 
>>>>> swap_info_struct *p, int prio,
>>>>>* Guarantee swap_map, cluster_info, etc. fields are valid
>>>>>* between get/put_swap_device() if SWP_VALID bit is set
>>>>>*/
>>>>> - synchronize_rcu();
>>>>
>>>> You cannot remove this without changing get/put_swap_device().  It's
>>>> better to squash at least PATCH 1-2.
>>>
>>> Will squash PATCH 1-2. Thanks.
>>>
>>>>
>>>>> + percpu_ref_resurrect(>users);
>>>>>   spin_lock(_lock);
>>>>>   spin_lock(>lock);
>>>>>   _enable_swap_info(p);
>>>>> @@ -2621,11 +2630,18 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
>>>>> specialfile)
>>>>>   p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>>>>>   spin_unlock(>lock);
>>>>>   spin_unlock(_lock);
>>>>> +
>>>>> + percpu_ref_kill(>users);
>>>>

Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-19 Thread Miaohe Lin
On 2021/4/19 15:09, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> On 2021/4/19 10:48, Huang, Ying wrote:
>>> Miaohe Lin  writes:
>>>
>>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>>> patch adds the percpu_ref support for swap.
>>>>
>>>> Signed-off-by: Miaohe Lin 
>>>> ---
>>>>  include/linux/swap.h |  3 +++
>>>>  mm/swapfile.c| 33 +
>>>>  2 files changed, 32 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 144727041e78..8be36eb58b7a 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>   * The in-memory structure used to track swap areas.
>>>>   */
>>>>  struct swap_info_struct {
>>>> +  struct percpu_ref users;    /* serialization against concurrent 
>>>> swapoff */
>>>
>>> The comments aren't general enough.  We use this to check whether the
>>> swap device has been fully initialized, etc. May be something as below?
>>>
>>> /* indicate and keep swap device valid */
>>
>> Looks good.
>>
>>>
>>>>unsigned long   flags;  /* SWP_USED etc: see above */
>>>>signed shortprio;   /* swap priority of this type */
>>>>struct plist_node list; /* entry in swap_active_head */
>>>> @@ -260,6 +261,8 @@ struct swap_info_struct {
>>>>struct block_device *bdev;  /* swap device or bdev of swap file */
>>>>struct file *swap_file; /* seldom referenced */
>>>>unsigned int old_block_size;/* seldom referenced */
>>>> +  bool ref_initialized;   /* seldom referenced */
>>>> +  struct completion comp; /* seldom referenced */
>>>>  #ifdef CONFIG_FRONTSWAP
>>>>unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>>>atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index 149e77454e3c..66515a3a2824 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -39,6 +39,7 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> +#include 
>>>>  
>>>>  #include 
>>>>  #include 
>>>> @@ -511,6 +512,14 @@ static void swap_discard_work(struct work_struct 
>>>> *work)
>>>>spin_unlock(>lock);
>>>>  }
>>>>  
>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>>> +{
>>>> +  struct swap_info_struct *si;
>>>> +
>>>> +  si = container_of(ref, struct swap_info_struct, users);
>>>> +  complete(>comp);
>>>> +}
>>>> +
>>>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>>>  {
>>>>struct swap_cluster_info *ci = si->cluster_info;
>>>> @@ -2500,7 +2509,7 @@ static void enable_swap_info(struct swap_info_struct 
>>>> *p, int prio,
>>>> * Guarantee swap_map, cluster_info, etc. fields are valid
>>>> * between get/put_swap_device() if SWP_VALID bit is set
>>>> */
>>>> -  synchronize_rcu();
>>>
>>> You cannot remove this without changing get/put_swap_device().  It's
>>> better to squash at least PATCH 1-2.
>>
>> Will squash PATCH 1-2. Thanks.
>>
>>>
>>>> +  percpu_ref_resurrect(>users);
>>>>spin_lock(_lock);
>>>>spin_lock(>lock);
>>>>_enable_swap_info(p);
>>>> @@ -2621,11 +2630,18 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
>>>> specialfile)
>>>>p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>>>>spin_unlock(>lock);
>>>>spin_unlock(_lock);
>>>> +
>>>> +  percpu_ref_kill(>users);
>>>>/*
>>>> -   * wait for swap operations protected by get/put_swap_device()
>>>> -   * to complete
>>>> +   * We need synchronize_rcu() here to protect the accessing
>>>> +   * to the swap cache data structure.
>>>> */
>>>>synchronize_rcu();
>>>> +  /*
>>>> +   * Wait for swap op

Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-19 Thread Huang, Ying
Miaohe Lin  writes:

> On 2021/4/19 10:48, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>> patch adds the percpu_ref support for swap.
>>>
>>> Signed-off-by: Miaohe Lin 
>>> ---
>>>  include/linux/swap.h |  3 +++
>>>  mm/swapfile.c| 33 +
>>>  2 files changed, 32 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 144727041e78..8be36eb58b7a 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>   * The in-memory structure used to track swap areas.
>>>   */
>>>  struct swap_info_struct {
>>> +   struct percpu_ref users;/* serialization against concurrent 
>>> swapoff */
>> 
>> The comments aren't general enough.  We use this to check whether the
>> swap device has been fully initialized, etc. May be something as below?
>> 
>> /* indicate and keep swap device valid */
>
> Looks good.
>
>> 
>>> unsigned long   flags;  /* SWP_USED etc: see above */
>>> signed shortprio;   /* swap priority of this type */
>>> struct plist_node list; /* entry in swap_active_head */
>>> @@ -260,6 +261,8 @@ struct swap_info_struct {
>>> struct block_device *bdev;  /* swap device or bdev of swap file */
>>> struct file *swap_file; /* seldom referenced */
>>> unsigned int old_block_size;/* seldom referenced */
>>> +   bool ref_initialized;   /* seldom referenced */
>>> +   struct completion comp; /* seldom referenced */
>>>  #ifdef CONFIG_FRONTSWAP
>>> unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>> atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 149e77454e3c..66515a3a2824 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -39,6 +39,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  #include 
>>>  #include 
>>> @@ -511,6 +512,14 @@ static void swap_discard_work(struct work_struct *work)
>>> spin_unlock(>lock);
>>>  }
>>>  
>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>> +{
>>> +   struct swap_info_struct *si;
>>> +
>>> +   si = container_of(ref, struct swap_info_struct, users);
>>> +   complete(>comp);
>>> +}
>>> +
>>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>>  {
>>> struct swap_cluster_info *ci = si->cluster_info;
>>> @@ -2500,7 +2509,7 @@ static void enable_swap_info(struct swap_info_struct 
>>> *p, int prio,
>>>  * Guarantee swap_map, cluster_info, etc. fields are valid
>>>  * between get/put_swap_device() if SWP_VALID bit is set
>>>  */
>>> -   synchronize_rcu();
>> 
>> You cannot remove this without changing get/put_swap_device().  It's
>> better to squash at least PATCH 1-2.
>
> Will squash PATCH 1-2. Thanks.
>
>> 
>>> +   percpu_ref_resurrect(>users);
>>> spin_lock(_lock);
>>> spin_lock(>lock);
>>> _enable_swap_info(p);
>>> @@ -2621,11 +2630,18 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
>>> specialfile)
>>> p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>>> spin_unlock(>lock);
>>> spin_unlock(_lock);
>>> +
>>> +   percpu_ref_kill(>users);
>>> /*
>>> -* wait for swap operations protected by get/put_swap_device()
>>> -* to complete
>>> +* We need synchronize_rcu() here to protect the accessing
>>> +* to the swap cache data structure.
>>>  */
>>> synchronize_rcu();
>>> +   /*
>>> +* Wait for swap operations protected by get/put_swap_device()
>>> +* to complete.
>>> +*/
>> 
>> I think the comments (after some revision) can be moved before
>> percpu_ref_kill().  The synchronize_rcu() comments can be merged.
>> 
>
> Ok.
>
>>> +   wait_for_completion(>comp);
>>>  
>>> flush_work(>discard_work);
>>>  
>>> @@ -3132,7 +3148,7 @@

Re: [PATCH v2 3/5] swap: fix do_swap_page() race with swapoff

2021-04-19 Thread Miaohe Lin
On 2021/4/19 10:23, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> When I was investigating the swap code, I found the below possible race
>> window:
>>
>> CPU 1CPU 2
>> --
>> do_swap_page
> 
> This is OK for swap cache cases.  So
> 
>   if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> 
> should be shown here.

Ok.

> 
>>   swap_readpage(skip swap cache case)
>> if (data_race(sis->flags & SWP_FS_OPS)) {
>>  swapoff
>>p->flags = &= ~SWP_VALID;
>>..
>>synchronize_rcu();
>>..
>>p->swap_file = NULL;
>> struct file *swap_file = sis->swap_file;
>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>
>> Note that for the pages that are swapped in through swap cache, this isn't
>> an issue. Because the page is locked, and the swap entry will be marked
>> with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
>> unlocked.
>>
>> Using current get/put_swap_device() to guard against concurrent swapoff for
>> swap_readpage() looks terrible because swap_readpage() may take really long
>> time. And this race may not be really pernicious because swapoff is usually
>> done when system shutdown only. To reduce the performance overhead on the
>> hot-path as much as possible, it appears we can use the percpu_ref to close
>> this race window(as suggested by Huang, Ying).
> 
> I still suggest to squash PATCH 1-3, at least PATCH 1-2.  That will
> change the relevant code together and make it easier to review.
> 

Will squash PATCH 1-2. Thanks.

> Best Regards,
> Huang, Ying
> 
>> Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous 
>> device")
>> Reported-by: kernel test robot  (auto build test ERROR)
>> Signed-off-by: Miaohe Lin 
>> ---
>>  include/linux/swap.h | 9 +
>>  mm/memory.c  | 9 +
>>  2 files changed, 18 insertions(+)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 993693b38109..523c2411a135 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -528,6 +528,15 @@ static inline struct swap_info_struct 
>> *swp_swap_info(swp_entry_t entry)
>>  return NULL;
>>  }
>>  
>> +static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
>> +{
>> +return NULL;
>> +}
>> +
>> +static inline void put_swap_device(struct swap_info_struct *si)
>> +{
>> +}
>> +
>>  #define swap_address_space(entry)   (NULL)
>>  #define get_nr_swap_pages() 0L
>>  #define total_swap_pages0L
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 27014c3bde9f..7a2fe12cf641 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  {
>>  struct vm_area_struct *vma = vmf->vma;
>>  struct page *page = NULL, *swapcache;
>> +struct swap_info_struct *si = NULL;
>>  swp_entry_t entry;
>>  pte_t pte;
>>  int locked;
>> @@ -3338,6 +3339,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  goto out;
>>  }
>>  
>> +/* Prevent swapoff from happening to us. */
>> +si = get_swap_device(entry);
>> +if (unlikely(!si))
>> +goto out;
>>  
>>  delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
>>  page = lookup_swap_cache(entry, vma, vmf->address);
>> @@ -3514,6 +3519,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  unlock:
>>  pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  out:
>> +if (si)
>> +put_swap_device(si);
>>  return ret;
>>  out_nomap:
>>  pte_unmap_unlock(vmf->pte, vmf->ptl);
>> @@ -3525,6 +3532,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  unlock_page(swapcache);
>>  put_page(swapcache);
>>  }
>> +if (si)
>> +put_swap_device(si);
>>  return ret;
>>  }
> .
> 



Re: [PATCH v2 4/5] mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info()

2021-04-19 Thread Miaohe Lin
On 2021/4/19 9:53, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> While we released the pte lock, somebody else might faulted in this pte.
>> So we should check whether it's swap pte first to guard against such race
>> or swp_type would be unexpected. But the swap_entry isn't used in this
>> function and we will have enough checking when we really operate the PTE
>> entries later. So checking for non_swap_entry() is not really needed here
>> and should be removed to avoid confusion.
> 
> Please rephrase the change log to describe why we have the code and why
> it's unnecessary now.  You can dig the git history via git-blame to find
> out it.
> 

Will try to do it. Thanks.

> The patch itself looks good to me.
> 
> Best Regards,
> Huang, Ying
> 
>> Signed-off-by: Miaohe Lin 
>> ---
>>  mm/swap_state.c | 6 --
>>  1 file changed, 6 deletions(-)
>>
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index 272ea2108c9d..df5405384520 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -721,7 +721,6 @@ static void swap_ra_info(struct vm_fault *vmf,
>>  {
>>  struct vm_area_struct *vma = vmf->vma;
>>  unsigned long ra_val;
>> -swp_entry_t entry;
>>  unsigned long faddr, pfn, fpfn;
>>  unsigned long start, end;
>>  pte_t *pte, *orig_pte;
>> @@ -739,11 +738,6 @@ static void swap_ra_info(struct vm_fault *vmf,
>>  
>>  faddr = vmf->address;
>>  orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
>> -entry = pte_to_swp_entry(*pte);
>> -if ((unlikely(non_swap_entry(entry {
>> -pte_unmap(orig_pte);
>> -return;
>> -}
>>  
>>  fpfn = PFN_DOWN(faddr);
>>  ra_val = GET_SWAP_RA_VAL(vma);
> .
> 



Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-19 Thread Miaohe Lin
On 2021/4/19 10:48, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> We will use percpu-refcount to serialize against concurrent swapoff. This
>> patch adds the percpu_ref support for swap.
>>
>> Signed-off-by: Miaohe Lin 
>> ---
>>  include/linux/swap.h |  3 +++
>>  mm/swapfile.c| 33 +
>>  2 files changed, 32 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 144727041e78..8be36eb58b7a 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>   * The in-memory structure used to track swap areas.
>>   */
>>  struct swap_info_struct {
>> +struct percpu_ref users;/* serialization against concurrent 
>> swapoff */
> 
> The comments aren't general enough.  We use this to check whether the
> swap device has been fully initialized, etc. May be something as below?
> 
> /* indicate and keep swap device valid */

Looks good.

> 
>>  unsigned long   flags;  /* SWP_USED etc: see above */
>>  signed shortprio;   /* swap priority of this type */
>>  struct plist_node list; /* entry in swap_active_head */
>> @@ -260,6 +261,8 @@ struct swap_info_struct {
>>  struct block_device *bdev;  /* swap device or bdev of swap file */
>>  struct file *swap_file; /* seldom referenced */
>>  unsigned int old_block_size;/* seldom referenced */
>> +bool ref_initialized;   /* seldom referenced */
>> +struct completion comp; /* seldom referenced */
>>  #ifdef CONFIG_FRONTSWAP
>>  unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>  atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 149e77454e3c..66515a3a2824 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -39,6 +39,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -511,6 +512,14 @@ static void swap_discard_work(struct work_struct *work)
>>  spin_unlock(>lock);
>>  }
>>  
>> +static void swap_users_ref_free(struct percpu_ref *ref)
>> +{
>> +struct swap_info_struct *si;
>> +
>> +si = container_of(ref, struct swap_info_struct, users);
>> +complete(>comp);
>> +}
>> +
>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>  {
>>  struct swap_cluster_info *ci = si->cluster_info;
>> @@ -2500,7 +2509,7 @@ static void enable_swap_info(struct swap_info_struct 
>> *p, int prio,
>>   * Guarantee swap_map, cluster_info, etc. fields are valid
>>   * between get/put_swap_device() if SWP_VALID bit is set
>>   */
>> -synchronize_rcu();
> 
> You cannot remove this without changing get/put_swap_device().  It's
> better to squash at least PATCH 1-2.

Will squash PATCH 1-2. Thanks.

> 
>> +percpu_ref_resurrect(>users);
>>  spin_lock(_lock);
>>  spin_lock(>lock);
>>  _enable_swap_info(p);
>> @@ -2621,11 +2630,18 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
>> specialfile)
>>  p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>>  spin_unlock(>lock);
>>  spin_unlock(_lock);
>> +
>> +percpu_ref_kill(>users);
>>  /*
>> - * wait for swap operations protected by get/put_swap_device()
>> - * to complete
>> + * We need synchronize_rcu() here to protect the accessing
>> + * to the swap cache data structure.
>>   */
>>  synchronize_rcu();
>> +/*
>> + * Wait for swap operations protected by get/put_swap_device()
>> + * to complete.
>> + */
> 
> I think the comments (after some revision) can be moved before
> percpu_ref_kill().  The synchronize_rcu() comments can be merged.
> 

Ok.

>> +wait_for_completion(>comp);
>>  
>>  flush_work(>discard_work);
>>  
>> @@ -3132,7 +3148,7 @@ static bool swap_discardable(struct swap_info_struct 
>> *si)
>>  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  {
>>  struct swap_info_struct *p;
>> -struct filename *name;
>> +struct filename *name = NULL;
>>  struct file *swap_file = NULL;
>>  struct address_space *mapping;
>>  int prio;
>> @@ -3163,6 +3179,15 @@ SYSCALL_DEFIN

Re: [PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-18 Thread Huang, Ying
Miaohe Lin  writes:

> We will use percpu-refcount to serialize against concurrent swapoff. This
> patch adds the percpu_ref support for swap.
>
> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h |  3 +++
>  mm/swapfile.c| 33 +
>  2 files changed, 32 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 144727041e78..8be36eb58b7a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>   * The in-memory structure used to track swap areas.
>   */
>  struct swap_info_struct {
> + struct percpu_ref users;/* serialization against concurrent 
> swapoff */

The comments aren't general enough.  We use this to check whether the
swap device has been fully initialized, etc. May be something as below?

/* indicate and keep swap device valid */

>   unsigned long   flags;  /* SWP_USED etc: see above */
>   signed shortprio;   /* swap priority of this type */
>   struct plist_node list; /* entry in swap_active_head */
> @@ -260,6 +261,8 @@ struct swap_info_struct {
>   struct block_device *bdev;  /* swap device or bdev of swap file */
>   struct file *swap_file; /* seldom referenced */
>   unsigned int old_block_size;/* seldom referenced */
> + bool ref_initialized;   /* seldom referenced */
> + struct completion comp; /* seldom referenced */
>  #ifdef CONFIG_FRONTSWAP
>   unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>   atomic_t frontswap_pages;   /* frontswap pages in-use counter */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 149e77454e3c..66515a3a2824 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -39,6 +39,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -511,6 +512,14 @@ static void swap_discard_work(struct work_struct *work)
>   spin_unlock(>lock);
>  }
>  
> +static void swap_users_ref_free(struct percpu_ref *ref)
> +{
> + struct swap_info_struct *si;
> +
> + si = container_of(ref, struct swap_info_struct, users);
> + complete(>comp);
> +}
> +
>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>  {
>   struct swap_cluster_info *ci = si->cluster_info;
> @@ -2500,7 +2509,7 @@ static void enable_swap_info(struct swap_info_struct 
> *p, int prio,
>* Guarantee swap_map, cluster_info, etc. fields are valid
>* between get/put_swap_device() if SWP_VALID bit is set
>*/
> - synchronize_rcu();

You cannot remove this without changing get/put_swap_device().  It's
better to squash at least PATCH 1-2.

> + percpu_ref_resurrect(>users);
>   spin_lock(_lock);
>   spin_lock(>lock);
>   _enable_swap_info(p);
> @@ -2621,11 +2630,18 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
> specialfile)
>   p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>   spin_unlock(>lock);
>   spin_unlock(_lock);
> +
> + percpu_ref_kill(>users);
>   /*
> -  * wait for swap operations protected by get/put_swap_device()
> -  * to complete
> +  * We need synchronize_rcu() here to protect the accessing
> +  * to the swap cache data structure.
>*/
>   synchronize_rcu();
> + /*
> +  * Wait for swap operations protected by get/put_swap_device()
> +  * to complete.
> +  */

I think the comments (after some revision) can be moved before
percpu_ref_kill().  The synchronize_rcu() comments can be merged.

> + wait_for_completion(>comp);
>  
>   flush_work(>discard_work);
>  
> @@ -3132,7 +3148,7 @@ static bool swap_discardable(struct swap_info_struct 
> *si)
>  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  {
>   struct swap_info_struct *p;
> - struct filename *name;
> + struct filename *name = NULL;
>   struct file *swap_file = NULL;
>   struct address_space *mapping;
>   int prio;
> @@ -3163,6 +3179,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
> specialfile, int, swap_flags)
>  
>   INIT_WORK(>discard_work, swap_discard_work);
>  
> + if (!p->ref_initialized) {

I don't think it's necessary to add another flag p->ref_initialized.  We
can distinguish newly allocated and reused swap_info_struct in 
alloc_swap_info().

Best Regards,
Huang, Ying

> + error = percpu_ref_init(>users, swap_users_ref_free,
> + PERCPU_REF_INIT_DEAD, GFP_KERNEL);
> + if (unlikely(error))
> + goto bad_swap;
> + init_completion(>comp);
> + p->ref_initialized = true;
> + }
> +
>   name = getname(specialfile);
>   if (IS_ERR(name)) {
>   error = PTR_ERR(name);


Re: [PATCH v2 3/5] swap: fix do_swap_page() race with swapoff

2021-04-18 Thread Huang, Ying
Miaohe Lin  writes:

> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> - -
> do_swap_page

This is OK for swap cache cases.  So

  if (data_race(si->flags & SWP_SYNCHRONOUS_IO))

should be shown here.

>   swap_readpage(skip swap cache case)
> if (data_race(sis->flags & SWP_FS_OPS)) {
>   swapoff
> p->flags = &= ~SWP_VALID;
> ..
> synchronize_rcu();
> ..
> p->swap_file = NULL;
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[oops!]
>
> Note that for the pages that are swapped in through swap cache, this isn't
> an issue. Because the page is locked, and the swap entry will be marked
> with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
> unlocked.
>
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).

I still suggest to squash PATCH 1-3, at least PATCH 1-2.  That will
change the relevant code together and make it easier to review.

Best Regards,
Huang, Ying

> Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous 
> device")
> Reported-by: kernel test robot  (auto build test ERROR)
> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h | 9 +
>  mm/memory.c  | 9 +
>  2 files changed, 18 insertions(+)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 993693b38109..523c2411a135 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -528,6 +528,15 @@ static inline struct swap_info_struct 
> *swp_swap_info(swp_entry_t entry)
>   return NULL;
>  }
>  
> +static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
> +{
> + return NULL;
> +}
> +
> +static inline void put_swap_device(struct swap_info_struct *si)
> +{
> +}
> +
>  #define swap_address_space(entry)(NULL)
>  #define get_nr_swap_pages()  0L
>  #define total_swap_pages 0L
> diff --git a/mm/memory.c b/mm/memory.c
> index 27014c3bde9f..7a2fe12cf641 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   struct page *page = NULL, *swapcache;
> + struct swap_info_struct *si = NULL;
>   swp_entry_t entry;
>   pte_t pte;
>   int locked;
> @@ -3338,6 +3339,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   goto out;
>   }
>  
> + /* Prevent swapoff from happening to us. */
> + si = get_swap_device(entry);
> + if (unlikely(!si))
> + goto out;
>  
>   delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
>   page = lookup_swap_cache(entry, vma, vmf->address);
> @@ -3514,6 +3519,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  unlock:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
> + if (si)
> + put_swap_device(si);
>   return ret;
>  out_nomap:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3525,6 +3532,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   unlock_page(swapcache);
>   put_page(swapcache);
>   }
> + if (si)
> + put_swap_device(si);
>   return ret;
>  }


Re: [PATCH v2 4/5] mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info()

2021-04-18 Thread Huang, Ying
Miaohe Lin  writes:

> While we released the pte lock, somebody else might faulted in this pte.
> So we should check whether it's swap pte first to guard against such race
> or swp_type would be unexpected. But the swap_entry isn't used in this
> function and we will have enough checking when we really operate the PTE
> entries later. So checking for non_swap_entry() is not really needed here
> and should be removed to avoid confusion.

Please rephrase the change log to describe why we have the code and why
it's unnecessary now.  You can dig the git history via git-blame to find
out it.

The patch itself looks good to me.

Best Regards,
Huang, Ying

> Signed-off-by: Miaohe Lin 
> ---
>  mm/swap_state.c | 6 --
>  1 file changed, 6 deletions(-)
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 272ea2108c9d..df5405384520 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -721,7 +721,6 @@ static void swap_ra_info(struct vm_fault *vmf,
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   unsigned long ra_val;
> - swp_entry_t entry;
>   unsigned long faddr, pfn, fpfn;
>   unsigned long start, end;
>   pte_t *pte, *orig_pte;
> @@ -739,11 +738,6 @@ static void swap_ra_info(struct vm_fault *vmf,
>  
>   faddr = vmf->address;
>   orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
> - entry = pte_to_swp_entry(*pte);
> - if ((unlikely(non_swap_entry(entry {
> - pte_unmap(orig_pte);
> - return;
> - }
>  
>   fpfn = PFN_DOWN(faddr);
>   ra_val = GET_SWAP_RA_VAL(vma);


[PATCH v2 0/5] close various race windows for swap

2021-04-17 Thread Miaohe Lin
Hi all,
When I was investigating the swap code, I found some possible race
windows. This series aims to fix all these races. But using current
get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really
long time. And to reduce the performance overhead on the hot-path as
much as possible, it appears we can use the percpu_ref to close this
race window(as suggested by Huang, Ying). The patch 1 adds percpu_ref
support for swap and most of the remaining patches try to use this to
close various race windows. More details can be found in the respective
changelogs. Thanks!

v1->v2:
  reorganize the patch-2/5
  various enhance and fixup per Huang, Ying
  Many thanks for the comments of Huang, Ying, Dennis Zhou and Tim Chen.

Miaohe Lin (5):
  mm/swapfile: add percpu_ref support for swap
  mm/swapfile: use percpu_ref to serialize against concurrent swapoff
  swap: fix do_swap_page() race with swapoff
  mm/swap: remove confusing checking for non_swap_entry() in
swap_ra_info()
  mm/shmem: fix shmem_swapin() race with swapoff

 include/linux/swap.h | 15 +++--
 mm/memory.c  |  9 ++
 mm/shmem.c   |  6 
 mm/swap_state.c  |  6 
 mm/swapfile.c| 74 +++-
 5 files changed, 73 insertions(+), 37 deletions(-)

-- 
2.19.1



[PATCH v2 3/5] swap: fix do_swap_page() race with swapoff

2021-04-17 Thread Miaohe Lin
When I was investigating the swap code, I found the below possible race
window:

CPU 1   CPU 2
-   -
do_swap_page
  swap_readpage(skip swap cache case)
if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
  p->flags = &= ~SWP_VALID;
  ..
  synchronize_rcu();
  ..
  p->swap_file = NULL;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[oops!]

Note that for the pages that are swapped in through swap cache, this isn't
an issue. Because the page is locked, and the swap entry will be marked
with SWAP_HAS_CACHE, so swapoff() can not proceed until the page has been
unlocked.

Using current get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really long
time. And this race may not be really pernicious because swapoff is usually
done when system shutdown only. To reduce the performance overhead on the
hot-path as much as possible, it appears we can use the percpu_ref to close
this race window(as suggested by Huang, Ying).

Fixes: 0bcac06f27d7 ("mm,swap: skip swapcache for swapin of synchronous device")
Reported-by: kernel test robot  (auto build test ERROR)
Signed-off-by: Miaohe Lin 
---
 include/linux/swap.h | 9 +
 mm/memory.c  | 9 +
 2 files changed, 18 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 993693b38109..523c2411a135 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -528,6 +528,15 @@ static inline struct swap_info_struct 
*swp_swap_info(swp_entry_t entry)
return NULL;
 }
 
+static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
+{
+   return NULL;
+}
+
+static inline void put_swap_device(struct swap_info_struct *si)
+{
+}
+
 #define swap_address_space(entry)  (NULL)
 #define get_nr_swap_pages()0L
 #define total_swap_pages   0L
diff --git a/mm/memory.c b/mm/memory.c
index 27014c3bde9f..7a2fe12cf641 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *swapcache;
+   struct swap_info_struct *si = NULL;
swp_entry_t entry;
pte_t pte;
int locked;
@@ -3338,6 +3339,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out;
}
 
+   /* Prevent swapoff from happening to us. */
+   si = get_swap_device(entry);
+   if (unlikely(!si))
+   goto out;
 
delayacct_set_flag(current, DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry, vma, vmf->address);
@@ -3514,6 +3519,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
+   if (si)
+   put_swap_device(si);
return ret;
 out_nomap:
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -3525,6 +3532,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
unlock_page(swapcache);
put_page(swapcache);
}
+   if (si)
+   put_swap_device(si);
return ret;
 }
 
-- 
2.19.1



[PATCH v2 4/5] mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info()

2021-04-17 Thread Miaohe Lin
While we released the pte lock, somebody else might faulted in this pte.
So we should check whether it's swap pte first to guard against such race
or swp_type would be unexpected. But the swap_entry isn't used in this
function and we will have enough checking when we really operate the PTE
entries later. So checking for non_swap_entry() is not really needed here
and should be removed to avoid confusion.

Signed-off-by: Miaohe Lin 
---
 mm/swap_state.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 272ea2108c9d..df5405384520 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -721,7 +721,6 @@ static void swap_ra_info(struct vm_fault *vmf,
 {
struct vm_area_struct *vma = vmf->vma;
unsigned long ra_val;
-   swp_entry_t entry;
unsigned long faddr, pfn, fpfn;
unsigned long start, end;
pte_t *pte, *orig_pte;
@@ -739,11 +738,6 @@ static void swap_ra_info(struct vm_fault *vmf,
 
faddr = vmf->address;
orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
-   entry = pte_to_swp_entry(*pte);
-   if ((unlikely(non_swap_entry(entry {
-   pte_unmap(orig_pte);
-   return;
-   }
 
fpfn = PFN_DOWN(faddr);
ra_val = GET_SWAP_RA_VAL(vma);
-- 
2.19.1



[PATCH v2 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-17 Thread Miaohe Lin
We will use percpu-refcount to serialize against concurrent swapoff. This
patch adds the percpu_ref support for swap.

Signed-off-by: Miaohe Lin 
---
 include/linux/swap.h |  3 +++
 mm/swapfile.c| 33 +
 2 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 144727041e78..8be36eb58b7a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -240,6 +240,7 @@ struct swap_cluster_list {
  * The in-memory structure used to track swap areas.
  */
 struct swap_info_struct {
+   struct percpu_ref users;/* serialization against concurrent 
swapoff */
unsigned long   flags;  /* SWP_USED etc: see above */
signed shortprio;   /* swap priority of this type */
struct plist_node list; /* entry in swap_active_head */
@@ -260,6 +261,8 @@ struct swap_info_struct {
struct block_device *bdev;  /* swap device or bdev of swap file */
struct file *swap_file; /* seldom referenced */
unsigned int old_block_size;/* seldom referenced */
+   bool ref_initialized;   /* seldom referenced */
+   struct completion comp; /* seldom referenced */
 #ifdef CONFIG_FRONTSWAP
unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
atomic_t frontswap_pages;   /* frontswap pages in-use counter */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 149e77454e3c..66515a3a2824 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -511,6 +512,14 @@ static void swap_discard_work(struct work_struct *work)
spin_unlock(>lock);
 }
 
+static void swap_users_ref_free(struct percpu_ref *ref)
+{
+   struct swap_info_struct *si;
+
+   si = container_of(ref, struct swap_info_struct, users);
+   complete(>comp);
+}
+
 static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
 {
struct swap_cluster_info *ci = si->cluster_info;
@@ -2500,7 +2509,7 @@ static void enable_swap_info(struct swap_info_struct *p, 
int prio,
 * Guarantee swap_map, cluster_info, etc. fields are valid
 * between get/put_swap_device() if SWP_VALID bit is set
 */
-   synchronize_rcu();
+   percpu_ref_resurrect(>users);
spin_lock(_lock);
spin_lock(>lock);
_enable_swap_info(p);
@@ -2621,11 +2630,18 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
specialfile)
p->flags &= ~SWP_VALID; /* mark swap device as invalid */
spin_unlock(>lock);
spin_unlock(_lock);
+
+   percpu_ref_kill(>users);
    /*
-* wait for swap operations protected by get/put_swap_device()
-* to complete
+* We need synchronize_rcu() here to protect the accessing
+* to the swap cache data structure.
 */
synchronize_rcu();
+   /*
+* Wait for swap operations protected by get/put_swap_device()
+* to complete.
+*/
+   wait_for_completion(>comp);
 
flush_work(>discard_work);
 
@@ -3132,7 +3148,7 @@ static bool swap_discardable(struct swap_info_struct *si)
 SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 {
struct swap_info_struct *p;
-   struct filename *name;
+   struct filename *name = NULL;
struct file *swap_file = NULL;
struct address_space *mapping;
int prio;
@@ -3163,6 +3179,15 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
specialfile, int, swap_flags)
 
INIT_WORK(>discard_work, swap_discard_work);
 
+   if (!p->ref_initialized) {
+   error = percpu_ref_init(>users, swap_users_ref_free,
+   PERCPU_REF_INIT_DEAD, GFP_KERNEL);
+   if (unlikely(error))
+   goto bad_swap;
+   init_completion(>comp);
+   p->ref_initialized = true;
+   }
+
name = getname(specialfile);
if (IS_ERR(name)) {
error = PTR_ERR(name);
-- 
2.19.1



Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-16 Thread Miaohe Lin
On 2021/4/16 14:25, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> On 2021/4/15 22:31, Dennis Zhou wrote:
>>> On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote:
>>>> Dennis Zhou  writes:
>>>>
>>>>> On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>>>>>> Dennis Zhou  writes:
>>>>>>
>>>>>>> On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>>>>>>>> Dennis Zhou  writes:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>
>>>>>>>>>>> On 2021/4/14 9:17, Huang, Ying wrote:
>>>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 2021/4/12 15:24, Huang, Ying wrote:
>>>>>>>>>>>>>> "Huang, Ying"  writes:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We will use percpu-refcount to serialize against concurrent 
>>>>>>>>>>>>>>>> swapoff. This
>>>>>>>>>>>>>>>> patch adds the percpu_ref support for later fixup.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Miaohe Lin 
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>  include/linux/swap.h |  2 ++
>>>>>>>>>>>>>>>>  mm/swapfile.c| 25 ++---
>>>>>>>>>>>>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>>>>>>>>>>>> index 144727041e78..849ba5265c11 100644
>>>>>>>>>>>>>>>> --- a/include/linux/swap.h
>>>>>>>>>>>>>>>> +++ b/include/linux/swap.h
>>>>>>>>>>>>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>>>>>>>>>>>>   * The in-memory structure used to track swap areas.
>>>>>>>>>>>>>>>>   */
>>>>>>>>>>>>>>>>  struct swap_info_struct {
>>>>>>>>>>>>>>>> +  struct percpu_ref users;/* serialization 
>>>>>>>>>>>>>>>> against concurrent swapoff */
>>>>>>>>>>>>>>>>unsigned long   flags;  /* SWP_USED etc: see 
>>>>>>>>>>>>>>>> above */
>>>>>>>>>>>>>>>>signed shortprio;   /* swap priority of 
>>>>>>>>>>>>>>>> this type */
>>>>>>>>>>>>>>>>struct plist_node list; /* entry in 
>>>>>>>>>>>>>>>> swap_active_head */
>>>>>>>>>>>>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>>>>>>>>>>>>>>>struct block_device *bdev;  /* swap device or bdev 
>>>>>>>>>>>>>>>> of swap file */
>>>>>>>>>>>>>>>>struct file *swap_file; /* seldom referenced */
>>>>>>>>>>>>>>>>unsigned int old_block_size;/* seldom referenced */
>>>>>>>>>>>>>>>> +  struct completion comp; /* seldom referenced */
>>>>>>>>>>>>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>>>>>>>>>>>>unsigned long *frontswap_map;   /* frontswap in-use, 
>>>>>>>>>>>>>>>

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-16 Thread Huang, Ying
Miaohe Lin  writes:

> On 2021/4/15 22:31, Dennis Zhou wrote:
>> On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote:
>>> Dennis Zhou  writes:
>>>
>>>> On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>>>>> Dennis Zhou  writes:
>>>>>
>>>>>> On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>>>>>>> Dennis Zhou  writes:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>
>>>>>>>>>> On 2021/4/14 9:17, Huang, Ying wrote:
>>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>>
>>>>>>>>>>>> On 2021/4/12 15:24, Huang, Ying wrote:
>>>>>>>>>>>>> "Huang, Ying"  writes:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We will use percpu-refcount to serialize against concurrent 
>>>>>>>>>>>>>>> swapoff. This
>>>>>>>>>>>>>>> patch adds the percpu_ref support for later fixup.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Miaohe Lin 
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>  include/linux/swap.h |  2 ++
>>>>>>>>>>>>>>>  mm/swapfile.c| 25 ++---
>>>>>>>>>>>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>>>>>>>>>>> index 144727041e78..849ba5265c11 100644
>>>>>>>>>>>>>>> --- a/include/linux/swap.h
>>>>>>>>>>>>>>> +++ b/include/linux/swap.h
>>>>>>>>>>>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>>>>>>>>>>>   * The in-memory structure used to track swap areas.
>>>>>>>>>>>>>>>   */
>>>>>>>>>>>>>>>  struct swap_info_struct {
>>>>>>>>>>>>>>> +   struct percpu_ref users;/* serialization 
>>>>>>>>>>>>>>> against concurrent swapoff */
>>>>>>>>>>>>>>> unsigned long   flags;  /* SWP_USED etc: see 
>>>>>>>>>>>>>>> above */
>>>>>>>>>>>>>>> signed shortprio;   /* swap priority of 
>>>>>>>>>>>>>>> this type */
>>>>>>>>>>>>>>> struct plist_node list; /* entry in 
>>>>>>>>>>>>>>> swap_active_head */
>>>>>>>>>>>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>>>>>>>>>>>>>> struct block_device *bdev;  /* swap device or bdev 
>>>>>>>>>>>>>>> of swap file */
>>>>>>>>>>>>>>> struct file *swap_file; /* seldom referenced */
>>>>>>>>>>>>>>> unsigned int old_block_size;/* seldom referenced */
>>>>>>>>>>>>>>> +   struct completion comp; /* seldom referenced */
>>>>>>>>>>>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>>>>>>>>>>> unsigned long *frontswap_map;   /* frontswap in-use, 
>>>>>>>>>>>>>>> one bit per page */
>>>>>>>>>>>>>>> atomic_t frontswap_pages;   /* frontswap pages 
>>>>>>>>>>>>>>> in-use counter */
>>>>>>>>>>>

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-15 Thread Miaohe Lin
On 2021/4/15 22:31, Dennis Zhou wrote:
> On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote:
>> Dennis Zhou  writes:
>>
>>> On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>>>> Dennis Zhou  writes:
>>>>
>>>>> On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>>>>>> Dennis Zhou  writes:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>>>>>>>> Miaohe Lin  writes:
>>>>>>>>
>>>>>>>>> On 2021/4/14 9:17, Huang, Ying wrote:
>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>
>>>>>>>>>>> On 2021/4/12 15:24, Huang, Ying wrote:
>>>>>>>>>>>> "Huang, Ying"  writes:
>>>>>>>>>>>>
>>>>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> We will use percpu-refcount to serialize against concurrent 
>>>>>>>>>>>>>> swapoff. This
>>>>>>>>>>>>>> patch adds the percpu_ref support for later fixup.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Miaohe Lin 
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>  include/linux/swap.h |  2 ++
>>>>>>>>>>>>>>  mm/swapfile.c| 25 ++---
>>>>>>>>>>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>>>>>>>>>> index 144727041e78..849ba5265c11 100644
>>>>>>>>>>>>>> --- a/include/linux/swap.h
>>>>>>>>>>>>>> +++ b/include/linux/swap.h
>>>>>>>>>>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>>>>>>>>>>   * The in-memory structure used to track swap areas.
>>>>>>>>>>>>>>   */
>>>>>>>>>>>>>>  struct swap_info_struct {
>>>>>>>>>>>>>> +struct percpu_ref users;/* serialization 
>>>>>>>>>>>>>> against concurrent swapoff */
>>>>>>>>>>>>>>  unsigned long   flags;  /* SWP_USED etc: see 
>>>>>>>>>>>>>> above */
>>>>>>>>>>>>>>  signed shortprio;   /* swap priority of 
>>>>>>>>>>>>>> this type */
>>>>>>>>>>>>>>  struct plist_node list; /* entry in 
>>>>>>>>>>>>>> swap_active_head */
>>>>>>>>>>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>>>>>>>>>>>>>  struct block_device *bdev;  /* swap device or bdev 
>>>>>>>>>>>>>> of swap file */
>>>>>>>>>>>>>>  struct file *swap_file; /* seldom referenced */
>>>>>>>>>>>>>>  unsigned int old_block_size;/* seldom referenced */
>>>>>>>>>>>>>> +struct completion comp; /* seldom referenced */
>>>>>>>>>>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>>>>>>>>>>  unsigned long *frontswap_map;   /* frontswap in-use, 
>>>>>>>>>>>>>> one bit per page */
>>>>>>>>>>>>>>  atomic_t frontswap_pages;   /* frontswap pages 
>>>>>>>>>>>>>> in-use counter */
>>>>>>>>>>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>>>>>>>>>>> index 149e77454e3c..724173cd7d0c 100644
>>>>>>>>>>>>>> --- a/mm/swapfile.c
>&g

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-15 Thread Huang, Ying
Dennis Zhou  writes:

> On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote:
>> Dennis Zhou  writes:
>> 
>> > On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>> >> Dennis Zhou  writes:
>> >> 
>> >> > On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>> >> >> Dennis Zhou  writes:
>> >> >> 
>> >> >> > Hello,
>> >> >> >
>> >> >> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>> >> >> >> Miaohe Lin  writes:
>> >> >> >> 
>> >> >> >> > On 2021/4/14 9:17, Huang, Ying wrote:
>> >> >> >> >> Miaohe Lin  writes:
>> >> >> >> >> 
>> >> >> >> >>> On 2021/4/12 15:24, Huang, Ying wrote:
>> >> >> >> >>>> "Huang, Ying"  writes:
>> >> >> >> >>>>
>> >> >> >> >>>>> Miaohe Lin  writes:
>> >> >> >> >>>>>
>> >> >> >> >>>>>> We will use percpu-refcount to serialize against concurrent 
>> >> >> >> >>>>>> swapoff. This
>> >> >> >> >>>>>> patch adds the percpu_ref support for later fixup.
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> Signed-off-by: Miaohe Lin 
>> >> >> >> >>>>>> ---
>> >> >> >> >>>>>>  include/linux/swap.h |  2 ++
>> >> >> >> >>>>>>  mm/swapfile.c| 25 ++---
>> >> >> >> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>> >> >> >> >>>>>>
>> >> >> >> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> >> >>>>>> index 144727041e78..849ba5265c11 100644
>> >> >> >> >>>>>> --- a/include/linux/swap.h
>> >> >> >> >>>>>> +++ b/include/linux/swap.h
>> >> >> >> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>> >> >> >> >>>>>>   * The in-memory structure used to track swap areas.
>> >> >> >> >>>>>>   */
>> >> >> >> >>>>>>  struct swap_info_struct {
>> >> >> >> >>>>>> +struct percpu_ref users;/* serialization 
>> >> >> >> >>>>>> against concurrent swapoff */
>> >> >> >> >>>>>>  unsigned long   flags;  /* SWP_USED etc: see 
>> >> >> >> >>>>>> above */
>> >> >> >> >>>>>>  signed shortprio;   /* swap priority of 
>> >> >> >> >>>>>> this type */
>> >> >> >> >>>>>>  struct plist_node list; /* entry in 
>> >> >> >> >>>>>> swap_active_head */
>> >> >> >> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>> >> >> >> >>>>>>  struct block_device *bdev;  /* swap device or bdev 
>> >> >> >> >>>>>> of swap file */
>> >> >> >> >>>>>>  struct file *swap_file; /* seldom referenced */
>> >> >> >> >>>>>>  unsigned int old_block_size;/* seldom referenced */
>> >> >> >> >>>>>> +struct completion comp; /* seldom referenced */
>> >> >> >> >>>>>>  #ifdef CONFIG_FRONTSWAP
>> >> >> >> >>>>>>  unsigned long *frontswap_map;   /* frontswap in-use, 
>> >> >> >> >>>>>> one bit per page */
>> >> >> >> >>>>>>  atomic_t frontswap_pages;   /* frontswap pages 
>> >> >> >> >>>>>> in-use counter */
>> >> >> >> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> >> >>>&g

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-15 Thread Dennis Zhou
On Thu, Apr 15, 2021 at 01:24:31PM +0800, Huang, Ying wrote:
> Dennis Zhou  writes:
> 
> > On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
> >> Dennis Zhou  writes:
> >> 
> >> > On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
> >> >> Dennis Zhou  writes:
> >> >> 
> >> >> > Hello,
> >> >> >
> >> >> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
> >> >> >> Miaohe Lin  writes:
> >> >> >> 
> >> >> >> > On 2021/4/14 9:17, Huang, Ying wrote:
> >> >> >> >> Miaohe Lin  writes:
> >> >> >> >> 
> >> >> >> >>> On 2021/4/12 15:24, Huang, Ying wrote:
> >> >> >> >>>> "Huang, Ying"  writes:
> >> >> >> >>>>
> >> >> >> >>>>> Miaohe Lin  writes:
> >> >> >> >>>>>
> >> >> >> >>>>>> We will use percpu-refcount to serialize against concurrent 
> >> >> >> >>>>>> swapoff. This
> >> >> >> >>>>>> patch adds the percpu_ref support for later fixup.
> >> >> >> >>>>>>
> >> >> >> >>>>>> Signed-off-by: Miaohe Lin 
> >> >> >> >>>>>> ---
> >> >> >> >>>>>>  include/linux/swap.h |  2 ++
> >> >> >> >>>>>>  mm/swapfile.c| 25 ++---
> >> >> >> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
> >> >> >> >>>>>>
> >> >> >> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> >> >>>>>> index 144727041e78..849ba5265c11 100644
> >> >> >> >>>>>> --- a/include/linux/swap.h
> >> >> >> >>>>>> +++ b/include/linux/swap.h
> >> >> >> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
> >> >> >> >>>>>>   * The in-memory structure used to track swap areas.
> >> >> >> >>>>>>   */
> >> >> >> >>>>>>  struct swap_info_struct {
> >> >> >> >>>>>> + struct percpu_ref users;/* serialization 
> >> >> >> >>>>>> against concurrent swapoff */
> >> >> >> >>>>>>   unsigned long   flags;  /* SWP_USED etc: see 
> >> >> >> >>>>>> above */
> >> >> >> >>>>>>   signed shortprio;   /* swap priority of 
> >> >> >> >>>>>> this type */
> >> >> >> >>>>>>   struct plist_node list; /* entry in 
> >> >> >> >>>>>> swap_active_head */
> >> >> >> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
> >> >> >> >>>>>>   struct block_device *bdev;  /* swap device or bdev 
> >> >> >> >>>>>> of swap file */
> >> >> >> >>>>>>   struct file *swap_file; /* seldom referenced */
> >> >> >> >>>>>>   unsigned int old_block_size;/* seldom referenced */
> >> >> >> >>>>>> + struct completion comp; /* seldom referenced */
> >> >> >> >>>>>>  #ifdef CONFIG_FRONTSWAP
> >> >> >> >>>>>>   unsigned long *frontswap_map;   /* frontswap in-use, 
> >> >> >> >>>>>> one bit per page */
> >> >> >> >>>>>>   atomic_t frontswap_pages;   /* frontswap pages 
> >> >> >> >>>>>> in-use counter */
> >> >> >> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> >> >>>>>> index 149e77454e3c..724173cd7d0c 100644
> >> >> >> >>>>>> --- a/mm/swapfile.c
> >> >> >> >>>>>> +++ b/mm/swapfile.c
> >> >> >> >>

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-15 Thread Miaohe Lin
On 2021/4/15 12:20, Dennis Zhou wrote:
> On Thu, Apr 15, 2021 at 11:16:42AM +0800, Miaohe Lin wrote:
>> On 2021/4/14 22:53, Dennis Zhou wrote:
>>> On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>>>> Dennis Zhou  writes:
>>>>
>>>>> On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>>>>>> Dennis Zhou  writes:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>>>>>>>> Miaohe Lin  writes:
>>>>>>>>
>>>>>>>>> On 2021/4/14 9:17, Huang, Ying wrote:
>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>
>>>>>>>>>>> On 2021/4/12 15:24, Huang, Ying wrote:
>>>>>>>>>>>> "Huang, Ying"  writes:
>>>>>>>>>>>>
>>>>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> We will use percpu-refcount to serialize against concurrent 
>>>>>>>>>>>>>> swapoff. This
>>>>>>>>>>>>>> patch adds the percpu_ref support for later fixup.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Miaohe Lin 
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>  include/linux/swap.h |  2 ++
>>>>>>>>>>>>>>  mm/swapfile.c| 25 ++---
>>>>>>>>>>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>>>>>>>>>> index 144727041e78..849ba5265c11 100644
>>>>>>>>>>>>>> --- a/include/linux/swap.h
>>>>>>>>>>>>>> +++ b/include/linux/swap.h
>>>>>>>>>>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>>>>>>>>>>   * The in-memory structure used to track swap areas.
>>>>>>>>>>>>>>   */
>>>>>>>>>>>>>>  struct swap_info_struct {
>>>>>>>>>>>>>> +struct percpu_ref users;/* serialization 
>>>>>>>>>>>>>> against concurrent swapoff */
>>>>>>>>>>>>>>  unsigned long   flags;  /* SWP_USED etc: see 
>>>>>>>>>>>>>> above */
>>>>>>>>>>>>>>  signed shortprio;   /* swap priority of 
>>>>>>>>>>>>>> this type */
>>>>>>>>>>>>>>  struct plist_node list; /* entry in 
>>>>>>>>>>>>>> swap_active_head */
>>>>>>>>>>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>>>>>>>>>>>>>  struct block_device *bdev;  /* swap device or bdev 
>>>>>>>>>>>>>> of swap file */
>>>>>>>>>>>>>>  struct file *swap_file; /* seldom referenced */
>>>>>>>>>>>>>>  unsigned int old_block_size;/* seldom referenced */
>>>>>>>>>>>>>> +struct completion comp; /* seldom referenced */
>>>>>>>>>>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>>>>>>>>>>  unsigned long *frontswap_map;   /* frontswap in-use, 
>>>>>>>>>>>>>> one bit per page */
>>>>>>>>>>>>>>  atomic_t frontswap_pages;   /* frontswap pages 
>>>>>>>>>>>>>> in-use counter */
>>>>>>>>>>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>>>>>>>>>>> index 149e77454e3c..724173cd7d0c 100644
>>>>>>>>>>>>>> --- a/mm/swapfile.c
>>>

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-14 Thread Huang, Ying
Dennis Zhou  writes:

> On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>> Dennis Zhou  writes:
>> 
>> > On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>> >> Dennis Zhou  writes:
>> >> 
>> >> > Hello,
>> >> >
>> >> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>> >> >> Miaohe Lin  writes:
>> >> >> 
>> >> >> > On 2021/4/14 9:17, Huang, Ying wrote:
>> >> >> >> Miaohe Lin  writes:
>> >> >> >> 
>> >> >> >>> On 2021/4/12 15:24, Huang, Ying wrote:
>> >> >> >>>> "Huang, Ying"  writes:
>> >> >> >>>>
>> >> >> >>>>> Miaohe Lin  writes:
>> >> >> >>>>>
>> >> >> >>>>>> We will use percpu-refcount to serialize against concurrent 
>> >> >> >>>>>> swapoff. This
>> >> >> >>>>>> patch adds the percpu_ref support for later fixup.
>> >> >> >>>>>>
>> >> >> >>>>>> Signed-off-by: Miaohe Lin 
>> >> >> >>>>>> ---
>> >> >> >>>>>>  include/linux/swap.h |  2 ++
>> >> >> >>>>>>  mm/swapfile.c| 25 ++---
>> >> >> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>> >> >> >>>>>>
>> >> >> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >> >>>>>> index 144727041e78..849ba5265c11 100644
>> >> >> >>>>>> --- a/include/linux/swap.h
>> >> >> >>>>>> +++ b/include/linux/swap.h
>> >> >> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>> >> >> >>>>>>   * The in-memory structure used to track swap areas.
>> >> >> >>>>>>   */
>> >> >> >>>>>>  struct swap_info_struct {
>> >> >> >>>>>> +   struct percpu_ref users;/* serialization 
>> >> >> >>>>>> against concurrent swapoff */
>> >> >> >>>>>> unsigned long   flags;  /* SWP_USED etc: see 
>> >> >> >>>>>> above */
>> >> >> >>>>>> signed shortprio;   /* swap priority of 
>> >> >> >>>>>> this type */
>> >> >> >>>>>> struct plist_node list; /* entry in 
>> >> >> >>>>>> swap_active_head */
>> >> >> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>> >> >> >>>>>> struct block_device *bdev;  /* swap device or bdev 
>> >> >> >>>>>> of swap file */
>> >> >> >>>>>> struct file *swap_file; /* seldom referenced */
>> >> >> >>>>>> unsigned int old_block_size;/* seldom referenced */
>> >> >> >>>>>> +   struct completion comp; /* seldom referenced */
>> >> >> >>>>>>  #ifdef CONFIG_FRONTSWAP
>> >> >> >>>>>> unsigned long *frontswap_map;   /* frontswap in-use, 
>> >> >> >>>>>> one bit per page */
>> >> >> >>>>>> atomic_t frontswap_pages;   /* frontswap pages 
>> >> >> >>>>>> in-use counter */
>> >> >> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >> >>>>>> index 149e77454e3c..724173cd7d0c 100644
>> >> >> >>>>>> --- a/mm/swapfile.c
>> >> >> >>>>>> +++ b/mm/swapfile.c
>> >> >> >>>>>> @@ -39,6 +39,7 @@
>> >> >> >>>>>>  #include 
>> >> >> >>>>>>  #include 
>> >> >> >>>>>>  #include 
>> >> >> >>>>>> +#include 
>> >> >> >>>>>>  
>&

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-14 Thread Dennis Zhou
On Thu, Apr 15, 2021 at 11:16:42AM +0800, Miaohe Lin wrote:
> On 2021/4/14 22:53, Dennis Zhou wrote:
> > On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
> >> Dennis Zhou  writes:
> >>
> >>> On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
> >>>> Dennis Zhou  writes:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
> >>>>>> Miaohe Lin  writes:
> >>>>>>
> >>>>>>> On 2021/4/14 9:17, Huang, Ying wrote:
> >>>>>>>> Miaohe Lin  writes:
> >>>>>>>>
> >>>>>>>>> On 2021/4/12 15:24, Huang, Ying wrote:
> >>>>>>>>>> "Huang, Ying"  writes:
> >>>>>>>>>>
> >>>>>>>>>>> Miaohe Lin  writes:
> >>>>>>>>>>>
> >>>>>>>>>>>> We will use percpu-refcount to serialize against concurrent 
> >>>>>>>>>>>> swapoff. This
> >>>>>>>>>>>> patch adds the percpu_ref support for later fixup.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Miaohe Lin 
> >>>>>>>>>>>> ---
> >>>>>>>>>>>>  include/linux/swap.h |  2 ++
> >>>>>>>>>>>>  mm/swapfile.c| 25 ++---
> >>>>>>>>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>
> >>>>>>>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>>>>>>>>>>> index 144727041e78..849ba5265c11 100644
> >>>>>>>>>>>> --- a/include/linux/swap.h
> >>>>>>>>>>>> +++ b/include/linux/swap.h
> >>>>>>>>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
> >>>>>>>>>>>>   * The in-memory structure used to track swap areas.
> >>>>>>>>>>>>   */
> >>>>>>>>>>>>  struct swap_info_struct {
> >>>>>>>>>>>> +struct percpu_ref users;/* serialization 
> >>>>>>>>>>>> against concurrent swapoff */
> >>>>>>>>>>>>  unsigned long   flags;  /* SWP_USED etc: see 
> >>>>>>>>>>>> above */
> >>>>>>>>>>>>  signed shortprio;   /* swap priority of 
> >>>>>>>>>>>> this type */
> >>>>>>>>>>>>  struct plist_node list; /* entry in 
> >>>>>>>>>>>> swap_active_head */
> >>>>>>>>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
> >>>>>>>>>>>>  struct block_device *bdev;  /* swap device or bdev 
> >>>>>>>>>>>> of swap file */
> >>>>>>>>>>>>  struct file *swap_file; /* seldom referenced */
> >>>>>>>>>>>>  unsigned int old_block_size;/* seldom referenced */
> >>>>>>>>>>>> +struct completion comp; /* seldom referenced */
> >>>>>>>>>>>>  #ifdef CONFIG_FRONTSWAP
> >>>>>>>>>>>>  unsigned long *frontswap_map;   /* frontswap in-use, 
> >>>>>>>>>>>> one bit per page */
> >>>>>>>>>>>>  atomic_t frontswap_pages;   /* frontswap pages 
> >>>>>>>>>>>> in-use counter */
> >>>>>>>>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >>>>>>>>>>>> index 149e77454e3c..724173cd7d0c 100644
> >>>>>>>>>>>> --- a/mm/swapfile.c
> >>>>>>>>>>>> +++ b/mm/swapfile.c
> >>>>>>>>>>>> @@ -39,6 +39,7 @@
> >>>>>>>>>>>>  #include 
> >>>>>

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-14 Thread Miaohe Lin
On 2021/4/15 0:13, Tim Chen wrote:
> 
> 
> On 4/13/21 6:04 PM, Huang, Ying wrote:
>> Tim Chen  writes:
>>
>>> On 4/12/21 6:27 PM, Huang, Ying wrote:
>>>
>>>>
>>>> This isn't the commit that introduces the race.  You can use `git blame`
>>>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>>>> swap: skip swapcache for swapin of synchronous device".
>>>>
>>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>>> picture.
>>>
>>> I'll suggest make fix to do_swap_page race with get/put_swap_device
>>> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
>>> be combined together.
>>
>> The original get/put_swap_device() use rcu_read_lock/unlock().  I don't
>> think it's good to wrap swap_read_page() with it.  After all, some
>> complex operations are done in swap_read_page(), including
>> blk_io_schedule().
>>
> 
> In that case then have the patches to make get/put_swap_device to use
> percpu_ref first.  And the patch to to fix the race in do_swap_page
> later in another patch.
> 
> Patch 2 is mixing the two.
> 

Looks like a good way to organize this patch series. Many thanks!

> Tim
> .
> 



Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-14 Thread Miaohe Lin
On 2021/4/14 22:53, Dennis Zhou wrote:
> On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
>> Dennis Zhou  writes:
>>
>>> On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>>>> Dennis Zhou  writes:
>>>>
>>>>> Hello,
>>>>>
>>>>> On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>>>>>> Miaohe Lin  writes:
>>>>>>
>>>>>>> On 2021/4/14 9:17, Huang, Ying wrote:
>>>>>>>> Miaohe Lin  writes:
>>>>>>>>
>>>>>>>>> On 2021/4/12 15:24, Huang, Ying wrote:
>>>>>>>>>> "Huang, Ying"  writes:
>>>>>>>>>>
>>>>>>>>>>> Miaohe Lin  writes:
>>>>>>>>>>>
>>>>>>>>>>>> We will use percpu-refcount to serialize against concurrent 
>>>>>>>>>>>> swapoff. This
>>>>>>>>>>>> patch adds the percpu_ref support for later fixup.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Miaohe Lin 
>>>>>>>>>>>> ---
>>>>>>>>>>>>  include/linux/swap.h |  2 ++
>>>>>>>>>>>>  mm/swapfile.c| 25 ++---
>>>>>>>>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>>>>>>>> index 144727041e78..849ba5265c11 100644
>>>>>>>>>>>> --- a/include/linux/swap.h
>>>>>>>>>>>> +++ b/include/linux/swap.h
>>>>>>>>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>>>>>>>>   * The in-memory structure used to track swap areas.
>>>>>>>>>>>>   */
>>>>>>>>>>>>  struct swap_info_struct {
>>>>>>>>>>>> +  struct percpu_ref users;/* serialization against 
>>>>>>>>>>>> concurrent swapoff */
>>>>>>>>>>>>unsigned long   flags;  /* SWP_USED etc: see above */
>>>>>>>>>>>>signed shortprio;   /* swap priority of this type */
>>>>>>>>>>>>struct plist_node list; /* entry in swap_active_head */
>>>>>>>>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>>>>>>>>>>>struct block_device *bdev;  /* swap device or bdev of swap 
>>>>>>>>>>>> file */
>>>>>>>>>>>>struct file *swap_file; /* seldom referenced */
>>>>>>>>>>>>unsigned int old_block_size;/* seldom referenced */
>>>>>>>>>>>> +  struct completion comp; /* seldom referenced */
>>>>>>>>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>>>>>>>>unsigned long *frontswap_map;   /* frontswap in-use, one bit 
>>>>>>>>>>>> per page */
>>>>>>>>>>>>atomic_t frontswap_pages;   /* frontswap pages in-use 
>>>>>>>>>>>> counter */
>>>>>>>>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>>>>>>>>> index 149e77454e3c..724173cd7d0c 100644
>>>>>>>>>>>> --- a/mm/swapfile.c
>>>>>>>>>>>> +++ b/mm/swapfile.c
>>>>>>>>>>>> @@ -39,6 +39,7 @@
>>>>>>>>>>>>  #include 
>>>>>>>>>>>>  #include 
>>>>>>>>>>>>  #include 
>>>>>>>>>>>> +#include 
>>>>>>>>>>>>  
>>>>>>>>>>>>  #include 
>>>>>>>>>>>>  #include 
>>>>>>>>>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct 
>>>>>>>>>>>> work_struct *work)
>>>>>>>>>>>>spin_unlock(>lock);
>>>>>>

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-14 Thread Tim Chen



On 4/13/21 6:04 PM, Huang, Ying wrote:
> Tim Chen  writes:
> 
>> On 4/12/21 6:27 PM, Huang, Ying wrote:
>>
>>>
>>> This isn't the commit that introduces the race.  You can use `git blame`
>>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>>> swap: skip swapcache for swapin of synchronous device".
>>>
>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>> picture.
>>
>> I'll suggest make fix to do_swap_page race with get/put_swap_device
>> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
>> be combined together.
> 
> The original get/put_swap_device() use rcu_read_lock/unlock().  I don't
> think it's good to wrap swap_read_page() with it.  After all, some
> complex operations are done in swap_read_page(), including
> blk_io_schedule().
> 

In that case then have the patches to make get/put_swap_device to use
percpu_ref first.  And the patch to to fix the race in do_swap_page
later in another patch.

Patch 2 is mixing the two.

Tim


Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-14 Thread Dennis Zhou
On Wed, Apr 14, 2021 at 01:44:58PM +0800, Huang, Ying wrote:
> Dennis Zhou  writes:
> 
> > On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
> >> Dennis Zhou  writes:
> >> 
> >> > Hello,
> >> >
> >> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
> >> >> Miaohe Lin  writes:
> >> >> 
> >> >> > On 2021/4/14 9:17, Huang, Ying wrote:
> >> >> >> Miaohe Lin  writes:
> >> >> >> 
> >> >> >>> On 2021/4/12 15:24, Huang, Ying wrote:
> >> >> >>>> "Huang, Ying"  writes:
> >> >> >>>>
> >> >> >>>>> Miaohe Lin  writes:
> >> >> >>>>>
> >> >> >>>>>> We will use percpu-refcount to serialize against concurrent 
> >> >> >>>>>> swapoff. This
> >> >> >>>>>> patch adds the percpu_ref support for later fixup.
> >> >> >>>>>>
> >> >> >>>>>> Signed-off-by: Miaohe Lin 
> >> >> >>>>>> ---
> >> >> >>>>>>  include/linux/swap.h |  2 ++
> >> >> >>>>>>  mm/swapfile.c| 25 ++---
> >> >> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
> >> >> >>>>>>
> >> >> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >> >>>>>> index 144727041e78..849ba5265c11 100644
> >> >> >>>>>> --- a/include/linux/swap.h
> >> >> >>>>>> +++ b/include/linux/swap.h
> >> >> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
> >> >> >>>>>>   * The in-memory structure used to track swap areas.
> >> >> >>>>>>   */
> >> >> >>>>>>  struct swap_info_struct {
> >> >> >>>>>> +struct percpu_ref users;/* serialization 
> >> >> >>>>>> against concurrent swapoff */
> >> >> >>>>>>  unsigned long   flags;  /* SWP_USED etc: see 
> >> >> >>>>>> above */
> >> >> >>>>>>  signed shortprio;   /* swap priority of 
> >> >> >>>>>> this type */
> >> >> >>>>>>  struct plist_node list; /* entry in 
> >> >> >>>>>> swap_active_head */
> >> >> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
> >> >> >>>>>>  struct block_device *bdev;  /* swap device or bdev 
> >> >> >>>>>> of swap file */
> >> >> >>>>>>  struct file *swap_file; /* seldom referenced */
> >> >> >>>>>>  unsigned int old_block_size;/* seldom referenced */
> >> >> >>>>>> +struct completion comp; /* seldom referenced */
> >> >> >>>>>>  #ifdef CONFIG_FRONTSWAP
> >> >> >>>>>>  unsigned long *frontswap_map;   /* frontswap in-use, 
> >> >> >>>>>> one bit per page */
> >> >> >>>>>>  atomic_t frontswap_pages;   /* frontswap pages 
> >> >> >>>>>> in-use counter */
> >> >> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >> >>>>>> index 149e77454e3c..724173cd7d0c 100644
> >> >> >>>>>> --- a/mm/swapfile.c
> >> >> >>>>>> +++ b/mm/swapfile.c
> >> >> >>>>>> @@ -39,6 +39,7 @@
> >> >> >>>>>>  #include 
> >> >> >>>>>>  #include 
> >> >> >>>>>>  #include 
> >> >> >>>>>> +#include 
> >> >> >>>>>>  
> >> >> >>>>>>  #include 
> >> >> >>>>>>  #include 
> >> >> >>>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct 
> >> >> >>>>>> work_st

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying
Dennis Zhou  writes:

> On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
>> Dennis Zhou  writes:
>> 
>> > Hello,
>> >
>> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>> >> Miaohe Lin  writes:
>> >> 
>> >> > On 2021/4/14 9:17, Huang, Ying wrote:
>> >> >> Miaohe Lin  writes:
>> >> >> 
>> >> >>> On 2021/4/12 15:24, Huang, Ying wrote:
>> >> >>>> "Huang, Ying"  writes:
>> >> >>>>
>> >> >>>>> Miaohe Lin  writes:
>> >> >>>>>
>> >> >>>>>> We will use percpu-refcount to serialize against concurrent 
>> >> >>>>>> swapoff. This
>> >> >>>>>> patch adds the percpu_ref support for later fixup.
>> >> >>>>>>
>> >> >>>>>> Signed-off-by: Miaohe Lin 
>> >> >>>>>> ---
>> >> >>>>>>  include/linux/swap.h |  2 ++
>> >> >>>>>>  mm/swapfile.c| 25 ++---
>> >> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>> >> >>>>>>
>> >> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >> >>>>>> index 144727041e78..849ba5265c11 100644
>> >> >>>>>> --- a/include/linux/swap.h
>> >> >>>>>> +++ b/include/linux/swap.h
>> >> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>> >> >>>>>>   * The in-memory structure used to track swap areas.
>> >> >>>>>>   */
>> >> >>>>>>  struct swap_info_struct {
>> >> >>>>>> +  struct percpu_ref users;/* serialization against 
>> >> >>>>>> concurrent swapoff */
>> >> >>>>>>unsigned long   flags;  /* SWP_USED etc: see above */
>> >> >>>>>>signed shortprio;   /* swap priority of this type */
>> >> >>>>>>struct plist_node list; /* entry in swap_active_head */
>> >> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>> >> >>>>>>struct block_device *bdev;  /* swap device or bdev of swap 
>> >> >>>>>> file */
>> >> >>>>>>struct file *swap_file; /* seldom referenced */
>> >> >>>>>>unsigned int old_block_size;/* seldom referenced */
>> >> >>>>>> +  struct completion comp; /* seldom referenced */
>> >> >>>>>>  #ifdef CONFIG_FRONTSWAP
>> >> >>>>>>unsigned long *frontswap_map;   /* frontswap in-use, one bit 
>> >> >>>>>> per page */
>> >> >>>>>>atomic_t frontswap_pages;   /* frontswap pages in-use 
>> >> >>>>>> counter */
>> >> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >> >>>>>> index 149e77454e3c..724173cd7d0c 100644
>> >> >>>>>> --- a/mm/swapfile.c
>> >> >>>>>> +++ b/mm/swapfile.c
>> >> >>>>>> @@ -39,6 +39,7 @@
>> >> >>>>>>  #include 
>> >> >>>>>>  #include 
>> >> >>>>>>  #include 
>> >> >>>>>> +#include 
>> >> >>>>>>  
>> >> >>>>>>  #include 
>> >> >>>>>>  #include 
>> >> >>>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct 
>> >> >>>>>> work_struct *work)
>> >> >>>>>>spin_unlock(>lock);
>> >> >>>>>>  }
>> >> >>>>>>  
>> >> >>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>> >> >>>>>> +{
>> >> >>>>>> +  struct swap_info_struct *si;
>> >> >>>>>> +
>> >> >>>>>> +  si = container_of(ref, struct swap_info_struct, users);
>> >> >>>>>> +  complete(>comp);
>> &

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Dennis Zhou
On Wed, Apr 14, 2021 at 11:59:03AM +0800, Huang, Ying wrote:
> Dennis Zhou  writes:
> 
> > Hello,
> >
> > On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
> >> Miaohe Lin  writes:
> >> 
> >> > On 2021/4/14 9:17, Huang, Ying wrote:
> >> >> Miaohe Lin  writes:
> >> >> 
> >> >>> On 2021/4/12 15:24, Huang, Ying wrote:
> >> >>>> "Huang, Ying"  writes:
> >> >>>>
> >> >>>>> Miaohe Lin  writes:
> >> >>>>>
> >> >>>>>> We will use percpu-refcount to serialize against concurrent 
> >> >>>>>> swapoff. This
> >> >>>>>> patch adds the percpu_ref support for later fixup.
> >> >>>>>>
> >> >>>>>> Signed-off-by: Miaohe Lin 
> >> >>>>>> ---
> >> >>>>>>  include/linux/swap.h |  2 ++
> >> >>>>>>  mm/swapfile.c| 25 ++---
> >> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
> >> >>>>>>
> >> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >> >>>>>> index 144727041e78..849ba5265c11 100644
> >> >>>>>> --- a/include/linux/swap.h
> >> >>>>>> +++ b/include/linux/swap.h
> >> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
> >> >>>>>>   * The in-memory structure used to track swap areas.
> >> >>>>>>   */
> >> >>>>>>  struct swap_info_struct {
> >> >>>>>> +   struct percpu_ref users;/* serialization against 
> >> >>>>>> concurrent swapoff */
> >> >>>>>> unsigned long   flags;  /* SWP_USED etc: see above */
> >> >>>>>> signed shortprio;   /* swap priority of this type */
> >> >>>>>> struct plist_node list; /* entry in swap_active_head */
> >> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
> >> >>>>>> struct block_device *bdev;  /* swap device or bdev of swap 
> >> >>>>>> file */
> >> >>>>>> struct file *swap_file; /* seldom referenced */
> >> >>>>>> unsigned int old_block_size;/* seldom referenced */
> >> >>>>>> +   struct completion comp; /* seldom referenced */
> >> >>>>>>  #ifdef CONFIG_FRONTSWAP
> >> >>>>>> unsigned long *frontswap_map;   /* frontswap in-use, one bit 
> >> >>>>>> per page */
> >> >>>>>> atomic_t frontswap_pages;   /* frontswap pages in-use 
> >> >>>>>> counter */
> >> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >> >>>>>> index 149e77454e3c..724173cd7d0c 100644
> >> >>>>>> --- a/mm/swapfile.c
> >> >>>>>> +++ b/mm/swapfile.c
> >> >>>>>> @@ -39,6 +39,7 @@
> >> >>>>>>  #include 
> >> >>>>>>  #include 
> >> >>>>>>  #include 
> >> >>>>>> +#include 
> >> >>>>>>  
> >> >>>>>>  #include 
> >> >>>>>>  #include 
> >> >>>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct 
> >> >>>>>> work_struct *work)
> >> >>>>>> spin_unlock(>lock);
> >> >>>>>>  }
> >> >>>>>>  
> >> >>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
> >> >>>>>> +{
> >> >>>>>> +   struct swap_info_struct *si;
> >> >>>>>> +
> >> >>>>>> +   si = container_of(ref, struct swap_info_struct, users);
> >> >>>>>> +   complete(>comp);
> >> >>>>>> +   percpu_ref_exit(>users);
> >> >>>>>
> >> >>>>> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() 
> >> >>>>> in
> >> >>>>> get_swap_device(), better to add 

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying
Dennis Zhou  writes:

> Hello,
>
> On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>> > On 2021/4/14 9:17, Huang, Ying wrote:
>> >> Miaohe Lin  writes:
>> >> 
>> >>> On 2021/4/12 15:24, Huang, Ying wrote:
>> >>>> "Huang, Ying"  writes:
>> >>>>
>> >>>>> Miaohe Lin  writes:
>> >>>>>
>> >>>>>> We will use percpu-refcount to serialize against concurrent swapoff. 
>> >>>>>> This
>> >>>>>> patch adds the percpu_ref support for later fixup.
>> >>>>>>
>> >>>>>> Signed-off-by: Miaohe Lin 
>> >>>>>> ---
>> >>>>>>  include/linux/swap.h |  2 ++
>> >>>>>>  mm/swapfile.c| 25 ++---
>> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>> >>>>>>
>> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >>>>>> index 144727041e78..849ba5265c11 100644
>> >>>>>> --- a/include/linux/swap.h
>> >>>>>> +++ b/include/linux/swap.h
>> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>> >>>>>>   * The in-memory structure used to track swap areas.
>> >>>>>>   */
>> >>>>>>  struct swap_info_struct {
>> >>>>>> + struct percpu_ref users;/* serialization against 
>> >>>>>> concurrent swapoff */
>> >>>>>>   unsigned long   flags;  /* SWP_USED etc: see above */
>> >>>>>>   signed shortprio;   /* swap priority of this type */
>> >>>>>>   struct plist_node list; /* entry in swap_active_head */
>> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>> >>>>>>   struct block_device *bdev;  /* swap device or bdev of swap 
>> >>>>>> file */
>> >>>>>>   struct file *swap_file; /* seldom referenced */
>> >>>>>>   unsigned int old_block_size;/* seldom referenced */
>> >>>>>> + struct completion comp; /* seldom referenced */
>> >>>>>>  #ifdef CONFIG_FRONTSWAP
>> >>>>>>   unsigned long *frontswap_map;   /* frontswap in-use, one bit 
>> >>>>>> per page */
>> >>>>>>   atomic_t frontswap_pages;   /* frontswap pages in-use 
>> >>>>>> counter */
>> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >>>>>> index 149e77454e3c..724173cd7d0c 100644
>> >>>>>> --- a/mm/swapfile.c
>> >>>>>> +++ b/mm/swapfile.c
>> >>>>>> @@ -39,6 +39,7 @@
>> >>>>>>  #include 
>> >>>>>>  #include 
>> >>>>>>  #include 
>> >>>>>> +#include 
>> >>>>>>  
>> >>>>>>  #include 
>> >>>>>>  #include 
>> >>>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct 
>> >>>>>> *work)
>> >>>>>>   spin_unlock(>lock);
>> >>>>>>  }
>> >>>>>>  
>> >>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>> >>>>>> +{
>> >>>>>> + struct swap_info_struct *si;
>> >>>>>> +
>> >>>>>> + si = container_of(ref, struct swap_info_struct, users);
>> >>>>>> + complete(>comp);
>> >>>>>> + percpu_ref_exit(>users);
>> >>>>>
>> >>>>> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
>> >>>>> get_swap_device(), better to add comments there.
>> >>>>
>> >>>> I just noticed that the comments of percpu_ref_tryget_live() says,
>> >>>>
>> >>>>  * This function is safe to call as long as @ref is between init and 
>> >>>> exit.
>> >>>>
>> >>>> While we need to call get_swap_device() almost at any time, so it's
>> >>>> b

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Dennis Zhou
Hello,

On Wed, Apr 14, 2021 at 10:06:48AM +0800, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
> > On 2021/4/14 9:17, Huang, Ying wrote:
> >> Miaohe Lin  writes:
> >> 
> >>> On 2021/4/12 15:24, Huang, Ying wrote:
> >>>> "Huang, Ying"  writes:
> >>>>
> >>>>> Miaohe Lin  writes:
> >>>>>
> >>>>>> We will use percpu-refcount to serialize against concurrent swapoff. 
> >>>>>> This
> >>>>>> patch adds the percpu_ref support for later fixup.
> >>>>>>
> >>>>>> Signed-off-by: Miaohe Lin 
> >>>>>> ---
> >>>>>>  include/linux/swap.h |  2 ++
> >>>>>>  mm/swapfile.c| 25 ++---
> >>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>>>>> index 144727041e78..849ba5265c11 100644
> >>>>>> --- a/include/linux/swap.h
> >>>>>> +++ b/include/linux/swap.h
> >>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
> >>>>>>   * The in-memory structure used to track swap areas.
> >>>>>>   */
> >>>>>>  struct swap_info_struct {
> >>>>>> +  struct percpu_ref users;/* serialization against 
> >>>>>> concurrent swapoff */
> >>>>>>unsigned long   flags;  /* SWP_USED etc: see above */
> >>>>>>signed shortprio;   /* swap priority of this type */
> >>>>>>struct plist_node list; /* entry in swap_active_head */
> >>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
> >>>>>>struct block_device *bdev;  /* swap device or bdev of swap 
> >>>>>> file */
> >>>>>>struct file *swap_file; /* seldom referenced */
> >>>>>>unsigned int old_block_size;/* seldom referenced */
> >>>>>> +  struct completion comp; /* seldom referenced */
> >>>>>>  #ifdef CONFIG_FRONTSWAP
> >>>>>>unsigned long *frontswap_map;   /* frontswap in-use, one bit 
> >>>>>> per page */
> >>>>>>atomic_t frontswap_pages;   /* frontswap pages in-use 
> >>>>>> counter */
> >>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >>>>>> index 149e77454e3c..724173cd7d0c 100644
> >>>>>> --- a/mm/swapfile.c
> >>>>>> +++ b/mm/swapfile.c
> >>>>>> @@ -39,6 +39,7 @@
> >>>>>>  #include 
> >>>>>>  #include 
> >>>>>>  #include 
> >>>>>> +#include 
> >>>>>>  
> >>>>>>  #include 
> >>>>>>  #include 
> >>>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct 
> >>>>>> *work)
> >>>>>>spin_unlock(>lock);
> >>>>>>  }
> >>>>>>  
> >>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
> >>>>>> +{
> >>>>>> +  struct swap_info_struct *si;
> >>>>>> +
> >>>>>> +  si = container_of(ref, struct swap_info_struct, users);
> >>>>>> +  complete(>comp);
> >>>>>> +  percpu_ref_exit(>users);
> >>>>>
> >>>>> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
> >>>>> get_swap_device(), better to add comments there.
> >>>>
> >>>> I just noticed that the comments of percpu_ref_tryget_live() says,
> >>>>
> >>>>  * This function is safe to call as long as @ref is between init and 
> >>>> exit.
> >>>>
> >>>> While we need to call get_swap_device() almost at any time, so it's
> >>>> better to avoid to call percpu_ref_exit() at all.  This will waste some
> >>>> memory, but we need to follow the API definition to avoid potential
> >>>> issues in the long term.
> >>>
> >>> I have to admit that I'am not really familiar with percpu_ref. So I read 
> >>

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Miaohe Lin
On 2021/4/14 11:07, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> On 2021/4/13 9:27, Huang, Ying wrote:
>>> Miaohe Lin  writes:
>>>
>>>> When I was investigating the swap code, I found the below possible race
>>>> window:
>>>>
>>>> CPU 1  CPU 2
>>>> -  -
>>>> do_swap_page
>>>>   synchronous swap_readpage
>>>> alloc_page_vma
>>>>swapoff
>>>>  release swap_file, bdev, or ...
>>>>   swap_readpage
>>>>check sis->flags is ok
>>>>  access swap_file, bdev...[oops!]
>>>>si->flags = 0
>>>>
>>>> Using current get/put_swap_device() to guard against concurrent swapoff for
>>>> swap_readpage() looks terrible because swap_readpage() may take really long
>>>> time. And this race may not be really pernicious because swapoff is usually
>>>> done when system shutdown only. To reduce the performance overhead on the
>>>> hot-path as much as possible, it appears we can use the percpu_ref to close
>>>> this race window(as suggested by Huang, Ying).
>>>>
>>>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>>>
>>> This isn't the commit that introduces the race.  You can use `git blame`
>>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>>> swap: skip swapcache for swapin of synchronous device".
>>>
>>
>> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race 
>> between
>> swapoff and some swap operations"). And I think this commit does not fix the 
>> race
>> condition completely, so I reuse the Fixes tag inside it.
>>
>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>> picture.
>>>
>>>> Signed-off-by: Miaohe Lin 
>>>> ---
>>>>  include/linux/swap.h |  2 +-
>>>>  mm/memory.c  | 10 ++
>>>>  mm/swapfile.c| 28 +++-
>>>>  3 files changed, 22 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 849ba5265c11..9066addb57fd 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>>>  
>>>>  static inline void put_swap_device(struct swap_info_struct *si)
>>>>  {
>>>> -  rcu_read_unlock();
>>>> +  percpu_ref_put(>users);
>>>>  }
>>>>  
>>>>  #else /* CONFIG_SWAP */
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index cc71a445c76c..8543c47b955c 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>  {
>>>>struct vm_area_struct *vma = vmf->vma;
>>>>struct page *page = NULL, *swapcache;
>>>> +  struct swap_info_struct *si = NULL;
>>>>swp_entry_t entry;
>>>>pte_t pte;
>>>>int locked;
>>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>}
>>>>  
>>>>
>>>
>>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>>>
>>> /* Prevent swapoff from happening to us */
>>
>> Ok.
>>
>>>
>>>> +  si = get_swap_device(entry);
>>>> +  /* In case we raced with swapoff. */
>>>> +  if (unlikely(!si))
>>>> +  goto out;
>>>> +
>>>
>>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>>> now.  We can remove several get/put_swap_device() for function called by
>>> do_swap_page().  That can be another optimization patch.
>>
>> I tried to remove several get/put_swap_device() for function called
>> by do_swap_page() only before I send this series. But it seems they have
>> other callers without proper get/put_swap_device().
> 
> Then we need to revise these callers instead.  Anyway, can be another
> series.

Yes. can be another series.
Thanks.

> 
> Best Regards,
> Huang, Ying
> 
> .
> 



Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Huang, Ying
Miaohe Lin  writes:

> On 2021/4/13 9:27, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1   CPU 2
>>> -   -
>>> do_swap_page
>>>   synchronous swap_readpage
>>> alloc_page_vma
>>> swapoff
>>>   release swap_file, bdev, or ...
>>>   swap_readpage
>>> check sis->flags is ok
>>>   access swap_file, bdev...[oops!]
>>> si->flags = 0
>>>
>>> Using current get/put_swap_device() to guard against concurrent swapoff for
>>> swap_readpage() looks terrible because swap_readpage() may take really long
>>> time. And this race may not be really pernicious because swapoff is usually
>>> done when system shutdown only. To reduce the performance overhead on the
>>> hot-path as much as possible, it appears we can use the percpu_ref to close
>>> this race window(as suggested by Huang, Ying).
>>>
>>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>> 
>> This isn't the commit that introduces the race.  You can use `git blame`
>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>> 
>
> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race 
> between
> swapoff and some swap operations"). And I think this commit does not fix the 
> race
> condition completely, so I reuse the Fixes tag inside it.
>
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>> 
>>> Signed-off-by: Miaohe Lin 
>>> ---
>>>  include/linux/swap.h |  2 +-
>>>  mm/memory.c  | 10 ++
>>>  mm/swapfile.c| 28 +++-
>>>  3 files changed, 22 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 849ba5265c11..9066addb57fd 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>>  
>>>  static inline void put_swap_device(struct swap_info_struct *si)
>>>  {
>>> -   rcu_read_unlock();
>>> +   percpu_ref_put(>users);
>>>  }
>>>  
>>>  #else /* CONFIG_SWAP */
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index cc71a445c76c..8543c47b955c 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>  {
>>> struct vm_area_struct *vma = vmf->vma;
>>> struct page *page = NULL, *swapcache;
>>> +   struct swap_info_struct *si = NULL;
>>> swp_entry_t entry;
>>> pte_t pte;
>>> int locked;
>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> }
>>>  
>>>
>> 
>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>> 
>>  /* Prevent swapoff from happening to us */
>
> Ok.
>
>> 
>>> +   si = get_swap_device(entry);
>>> +   /* In case we raced with swapoff. */
>>> +   if (unlikely(!si))
>>> +   goto out;
>>> +
>> 
>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>> now.  We can remove several get/put_swap_device() for function called by
>> do_swap_page().  That can be another optimization patch.
>
> I tried to remove several get/put_swap_device() for function called
> by do_swap_page() only before I send this series. But it seems they have
> other callers without proper get/put_swap_device().

Then we need to revise these callers instead.  Anyway, can be another
series.

Best Regards,
Huang, Ying


Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Miaohe Lin
On 2021/4/13 9:27, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> When I was investigating the swap code, I found the below possible race
>> window:
>>
>> CPU 1CPU 2
>> --
>> do_swap_page
>>   synchronous swap_readpage
>> alloc_page_vma
>>  swapoff
>>release swap_file, bdev, or ...
>>   swap_readpage
>>  check sis->flags is ok
>>access swap_file, bdev...[oops!]
>>  si->flags = 0
>>
>> Using current get/put_swap_device() to guard against concurrent swapoff for
>> swap_readpage() looks terrible because swap_readpage() may take really long
>> time. And this race may not be really pernicious because swapoff is usually
>> done when system shutdown only. To reduce the performance overhead on the
>> hot-path as much as possible, it appears we can use the percpu_ref to close
>> this race window(as suggested by Huang, Ying).
>>
>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
> 
> This isn't the commit that introduces the race.  You can use `git blame`
> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
> swap: skip swapcache for swapin of synchronous device".
> 

Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race 
between
swapoff and some swap operations"). And I think this commit does not fix the 
race
condition completely, so I reuse the Fixes tag inside it.

> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
> picture.
> 
>> Signed-off-by: Miaohe Lin 
>> ---
>>  include/linux/swap.h |  2 +-
>>  mm/memory.c  | 10 ++
>>  mm/swapfile.c| 28 +++-
>>  3 files changed, 22 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 849ba5265c11..9066addb57fd 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>  
>>  static inline void put_swap_device(struct swap_info_struct *si)
>>  {
>> -rcu_read_unlock();
>> +percpu_ref_put(>users);
>>  }
>>  
>>  #else /* CONFIG_SWAP */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index cc71a445c76c..8543c47b955c 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  {
>>  struct vm_area_struct *vma = vmf->vma;
>>  struct page *page = NULL, *swapcache;
>> +struct swap_info_struct *si = NULL;
>>  swp_entry_t entry;
>>  pte_t pte;
>>  int locked;
>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  }
>>  
>>
> 
> I suggest to add comments here as follows (words copy from Matthew Wilcox)
> 
>   /* Prevent swapoff from happening to us */

Ok.

> 
>> +si = get_swap_device(entry);
>> +/* In case we raced with swapoff. */
>> +if (unlikely(!si))
>> +goto out;
>> +
> 
> Because we wrap the whole do_swap_page() with get/put_swap_device()
> now.  We can remove several get/put_swap_device() for function called by
> do_swap_page().  That can be another optimization patch.

I tried to remove several get/put_swap_device() for function called
by do_swap_page() only before I send this series. But it seems they have
other callers without proper get/put_swap_device().

> 
>>  delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>>  page = lookup_swap_cache(entry, vma, vmf->address);
>>  swapcache = page;
>> @@ -3514,6 +3520,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  unlock:
>>  pte_unmap_unlock(vmf->pte, vmf->ptl);
>>  out:
>> +if (si)
>> +put_swap_device(si);
>>  return ret;
>>  out_nomap:
>>  pte_unmap_unlock(vmf->pte, vmf->ptl);
>> @@ -3525,6 +3533,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  unlock_page(swapcache);
>>  put_page(swapcache);
>>  }
>> +if (si)
>> +put_swap_device(si);
>>  return ret;
>>  }
>>  
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 724173cd7d0c..01032c72ceae 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1280,18 +1280,12 @@ static unsigned char __swap_entry_free_locked(struct 
>> swap_info_struc

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Miaohe Lin
On 2021/4/14 9:04, Huang, Ying wrote:
> Tim Chen  writes:
> 
>> On 4/12/21 6:27 PM, Huang, Ying wrote:
>>
>>>
>>> This isn't the commit that introduces the race.  You can use `git blame`
>>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>>> swap: skip swapcache for swapin of synchronous device".
>>>
>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>> picture.
>>
>> I'll suggest make fix to do_swap_page race with get/put_swap_device
>> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
>> be combined together.
> 
> The original get/put_swap_device() use rcu_read_lock/unlock().  I don't
> think it's good to wrap swap_read_page() with it.  After all, some
> complex operations are done in swap_read_page(), including
> blk_io_schedule().
> 

The patch was split to make it easier to review originally, i.e. 1/5 introduces
the percpu_ref to swap and 2/5 uses it to fix the race between do_swap_page()
and swapoff.
Btw, I have no preference for merging 1/5 and 2/5 or not.

> Best Regards,
> Huang, Ying
> .
> 



Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying
Miaohe Lin  writes:

> On 2021/4/14 9:17, Huang, Ying wrote:
>> Miaohe Lin  writes:
>> 
>>> On 2021/4/12 15:24, Huang, Ying wrote:
>>>> "Huang, Ying"  writes:
>>>>
>>>>> Miaohe Lin  writes:
>>>>>
>>>>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>>>>> patch adds the percpu_ref support for later fixup.
>>>>>>
>>>>>> Signed-off-by: Miaohe Lin 
>>>>>> ---
>>>>>>  include/linux/swap.h |  2 ++
>>>>>>  mm/swapfile.c| 25 ++---
>>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>> index 144727041e78..849ba5265c11 100644
>>>>>> --- a/include/linux/swap.h
>>>>>> +++ b/include/linux/swap.h
>>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>>   * The in-memory structure used to track swap areas.
>>>>>>   */
>>>>>>  struct swap_info_struct {
>>>>>> +struct percpu_ref users;/* serialization against 
>>>>>> concurrent swapoff */
>>>>>>  unsigned long   flags;  /* SWP_USED etc: see above */
>>>>>>  signed shortprio;   /* swap priority of this type */
>>>>>>  struct plist_node list; /* entry in swap_active_head */
>>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>>>>>  struct block_device *bdev;  /* swap device or bdev of swap 
>>>>>> file */
>>>>>>  struct file *swap_file; /* seldom referenced */
>>>>>>  unsigned int old_block_size;/* seldom referenced */
>>>>>> +struct completion comp; /* seldom referenced */
>>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>>  unsigned long *frontswap_map;   /* frontswap in-use, one bit 
>>>>>> per page */
>>>>>>  atomic_t frontswap_pages;   /* frontswap pages in-use 
>>>>>> counter */
>>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>>> index 149e77454e3c..724173cd7d0c 100644
>>>>>> --- a/mm/swapfile.c
>>>>>> +++ b/mm/swapfile.c
>>>>>> @@ -39,6 +39,7 @@
>>>>>>  #include 
>>>>>>  #include 
>>>>>>  #include 
>>>>>> +#include 
>>>>>>  
>>>>>>  #include 
>>>>>>  #include 
>>>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct 
>>>>>> *work)
>>>>>>  spin_unlock(>lock);
>>>>>>  }
>>>>>>  
>>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>>>>> +{
>>>>>> +struct swap_info_struct *si;
>>>>>> +
>>>>>> +si = container_of(ref, struct swap_info_struct, users);
>>>>>> +complete(>comp);
>>>>>> +percpu_ref_exit(>users);
>>>>>
>>>>> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
>>>>> get_swap_device(), better to add comments there.
>>>>
>>>> I just noticed that the comments of percpu_ref_tryget_live() says,
>>>>
>>>>  * This function is safe to call as long as @ref is between init and exit.
>>>>
>>>> While we need to call get_swap_device() almost at any time, so it's
>>>> better to avoid to call percpu_ref_exit() at all.  This will waste some
>>>> memory, but we need to follow the API definition to avoid potential
>>>> issues in the long term.
>>>
>>> I have to admit that I'am not really familiar with percpu_ref. So I read the
>>> implementation code of the percpu_ref and found percpu_ref_tryget_live() 
>>> could
>>> be called after exit now. But you're right we need to follow the API 
>>> definition
>>> to avoid potential issues in the long term.
>>>
>>>>
>>>> And we need to call percpu_ref_init() before insert the swap_info_struct
>>>> into the swap_info[].
>>>
>>> If we remove the call to percpu_ref_exit(), we should not use 
>>&

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Miaohe Lin
On 2021/4/14 9:17, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> On 2021/4/12 15:24, Huang, Ying wrote:
>>> "Huang, Ying"  writes:
>>>
>>>> Miaohe Lin  writes:
>>>>
>>>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>>>> patch adds the percpu_ref support for later fixup.
>>>>>
>>>>> Signed-off-by: Miaohe Lin 
>>>>> ---
>>>>>  include/linux/swap.h |  2 ++
>>>>>  mm/swapfile.c| 25 ++---
>>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>> index 144727041e78..849ba5265c11 100644
>>>>> --- a/include/linux/swap.h
>>>>> +++ b/include/linux/swap.h
>>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>>   * The in-memory structure used to track swap areas.
>>>>>   */
>>>>>  struct swap_info_struct {
>>>>> + struct percpu_ref users;/* serialization against concurrent 
>>>>> swapoff */
>>>>>   unsigned long   flags;  /* SWP_USED etc: see above */
>>>>>   signed shortprio;   /* swap priority of this type */
>>>>>   struct plist_node list; /* entry in swap_active_head */
>>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>>>>   struct block_device *bdev;  /* swap device or bdev of swap file */
>>>>>   struct file *swap_file; /* seldom referenced */
>>>>>   unsigned int old_block_size;/* seldom referenced */
>>>>> + struct completion comp; /* seldom referenced */
>>>>>  #ifdef CONFIG_FRONTSWAP
>>>>>   unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>>>>   atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>> index 149e77454e3c..724173cd7d0c 100644
>>>>> --- a/mm/swapfile.c
>>>>> +++ b/mm/swapfile.c
>>>>> @@ -39,6 +39,7 @@
>>>>>  #include 
>>>>>  #include 
>>>>>  #include 
>>>>> +#include 
>>>>>  
>>>>>  #include 
>>>>>  #include 
>>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct 
>>>>> *work)
>>>>>   spin_unlock(>lock);
>>>>>  }
>>>>>  
>>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>>>> +{
>>>>> + struct swap_info_struct *si;
>>>>> +
>>>>> + si = container_of(ref, struct swap_info_struct, users);
>>>>> + complete(>comp);
>>>>> + percpu_ref_exit(>users);
>>>>
>>>> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
>>>> get_swap_device(), better to add comments there.
>>>
>>> I just noticed that the comments of percpu_ref_tryget_live() says,
>>>
>>>  * This function is safe to call as long as @ref is between init and exit.
>>>
>>> While we need to call get_swap_device() almost at any time, so it's
>>> better to avoid to call percpu_ref_exit() at all.  This will waste some
>>> memory, but we need to follow the API definition to avoid potential
>>> issues in the long term.
>>
>> I have to admit that I'am not really familiar with percpu_ref. So I read the
>> implementation code of the percpu_ref and found percpu_ref_tryget_live() 
>> could
>> be called after exit now. But you're right we need to follow the API 
>> definition
>> to avoid potential issues in the long term.
>>
>>>
>>> And we need to call percpu_ref_init() before insert the swap_info_struct
>>> into the swap_info[].
>>
>> If we remove the call to percpu_ref_exit(), we should not use 
>> percpu_ref_init()
>> here because *percpu_ref->data is assumed to be NULL* in percpu_ref_init() 
>> while
>> this is not the case as we do not call percpu_ref_exit(). Maybe 
>> percpu_ref_reinit()
>> or percpu_ref_resurrect() will do the work.
>>
>> One more thing, how could I distinguish the killed percpu_ref from newly 
>> allocated one?
>> It seems percpu_ref_is_dying is only safe to call when @ref is between init 
>> and exit.
>> Maybe I could do this in alloc_swap_info()?
>

Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Huang, Ying
Miaohe Lin  writes:

> On 2021/4/12 15:24, Huang, Ying wrote:
>> "Huang, Ying"  writes:
>> 
>>> Miaohe Lin  writes:
>>>
>>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>>> patch adds the percpu_ref support for later fixup.
>>>>
>>>> Signed-off-by: Miaohe Lin 
>>>> ---
>>>>  include/linux/swap.h |  2 ++
>>>>  mm/swapfile.c| 25 ++---
>>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 144727041e78..849ba5265c11 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>>   * The in-memory structure used to track swap areas.
>>>>   */
>>>>  struct swap_info_struct {
>>>> +  struct percpu_ref users;/* serialization against concurrent 
>>>> swapoff */
>>>>unsigned long   flags;  /* SWP_USED etc: see above */
>>>>signed shortprio;   /* swap priority of this type */
>>>>struct plist_node list; /* entry in swap_active_head */
>>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>>>struct block_device *bdev;  /* swap device or bdev of swap file */
>>>>struct file *swap_file; /* seldom referenced */
>>>>unsigned int old_block_size;/* seldom referenced */
>>>> +  struct completion comp; /* seldom referenced */
>>>>  #ifdef CONFIG_FRONTSWAP
>>>>unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>>>atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>> index 149e77454e3c..724173cd7d0c 100644
>>>> --- a/mm/swapfile.c
>>>> +++ b/mm/swapfile.c
>>>> @@ -39,6 +39,7 @@
>>>>  #include 
>>>>  #include 
>>>>  #include 
>>>> +#include 
>>>>  
>>>>  #include 
>>>>  #include 
>>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct 
>>>> *work)
>>>>spin_unlock(>lock);
>>>>  }
>>>>  
>>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>>> +{
>>>> +  struct swap_info_struct *si;
>>>> +
>>>> +  si = container_of(ref, struct swap_info_struct, users);
>>>> +  complete(>comp);
>>>> +  percpu_ref_exit(>users);
>>>
>>> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
>>> get_swap_device(), better to add comments there.
>> 
>> I just noticed that the comments of percpu_ref_tryget_live() says,
>> 
>>  * This function is safe to call as long as @ref is between init and exit.
>> 
>> While we need to call get_swap_device() almost at any time, so it's
>> better to avoid to call percpu_ref_exit() at all.  This will waste some
>> memory, but we need to follow the API definition to avoid potential
>> issues in the long term.
>
> I have to admit that I'am not really familiar with percpu_ref. So I read the
> implementation code of the percpu_ref and found percpu_ref_tryget_live() could
> be called after exit now. But you're right we need to follow the API 
> definition
> to avoid potential issues in the long term.
>
>> 
>> And we need to call percpu_ref_init() before insert the swap_info_struct
>> into the swap_info[].
>
> If we remove the call to percpu_ref_exit(), we should not use 
> percpu_ref_init()
> here because *percpu_ref->data is assumed to be NULL* in percpu_ref_init() 
> while
> this is not the case as we do not call percpu_ref_exit(). Maybe 
> percpu_ref_reinit()
> or percpu_ref_resurrect() will do the work.
>
> One more thing, how could I distinguish the killed percpu_ref from newly 
> allocated one?
> It seems percpu_ref_is_dying is only safe to call when @ref is between init 
> and exit.
> Maybe I could do this in alloc_swap_info()?

Yes.  In alloc_swap_info(), you can distinguish newly allocated and
reused swap_info_struct.

>> 
>>>> +}
>>>> +
>>>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>>>  {
>>>>struct swap_cluster_info *ci = si->cluster_info;
>>>> @@ -2500,7 +2510,7 @@ static void enable_swap_info(struct swap_info_struct 
&

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Huang, Ying
Tim Chen  writes:

> On 4/12/21 6:27 PM, Huang, Ying wrote:
>
>> 
>> This isn't the commit that introduces the race.  You can use `git blame`
>> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>> 
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>
> I'll suggest make fix to do_swap_page race with get/put_swap_device
> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
> be combined together.

The original get/put_swap_device() use rcu_read_lock/unlock().  I don't
think it's good to wrap swap_read_page() with it.  After all, some
complex operations are done in swap_read_page(), including
blk_io_schedule().

Best Regards,
Huang, Ying


[PATCH net-next v2 3/3] net: phy: marvell-88x2222: swap 1G/10G modes on autoneg

2021-04-13 Thread Ivan Bornyakov
Setting 10G without autonegotiation is invalid according to
phy_ethtool_ksettings_set(). Thus, we need to set it during
autonegotiation.

If 1G autonegotiation can't complete for quite a time, but there is
signal in line, switch line interface type to 10GBase-R, if supported,
in hope for link to be established.

And vice versa. If 10GBase-R link can't be established for quite a time,
and autonegotiation is enabled, and there is signal in line, switch line
interface type to appropriate 1G mode, i.e. 1000Base-X or SGMII, if
supported.

Signed-off-by: Ivan Bornyakov 
---
 drivers/net/phy/marvell-88x.c | 117 +++---
 1 file changed, 89 insertions(+), 28 deletions(-)

diff --git a/drivers/net/phy/marvell-88x.c 
b/drivers/net/phy/marvell-88x.c
index 640b133f1371..9b9ac3ef735d 100644
--- a/drivers/net/phy/marvell-88x.c
+++ b/drivers/net/phy/marvell-88x.c
@@ -52,6 +52,8 @@
 #defineMV_1GBX_PHY_STAT_SPEED100   BIT(14)
 #defineMV_1GBX_PHY_STAT_SPEED1000  BIT(15)
 
+#defineAUTONEG_TIMEOUT 3
+
 struct mv_data {
phy_interface_t line_interface;
__ETHTOOL_DECLARE_LINK_MODE_MASK(supported);
@@ -173,6 +175,24 @@ static bool mv_is_1gbx_capable(struct phy_device 
*phydev)
 priv->supported);
 }
 
+static bool mv_is_sgmii_capable(struct phy_device *phydev)
+{
+   struct mv_data *priv = phydev->priv;
+
+   return (linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
+ priv->supported) ||
+   linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
+ priv->supported) ||
+   linkmode_test_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT,
+ priv->supported) ||
+   linkmode_test_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT,
+ priv->supported) ||
+   linkmode_test_bit(ETHTOOL_LINK_MODE_10baseT_Full_BIT,
+ priv->supported) ||
+   linkmode_test_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT,
+ priv->supported));
+}
+
 static int mv_config_line(struct phy_device *phydev)
 {
struct mv_data *priv = phydev->priv;
@@ -192,7 +212,8 @@ static int mv_config_line(struct phy_device *phydev)
}
 }
 
-static int mv_setup_forced(struct phy_device *phydev)
+/* Switch between 1G (1000Base-X/SGMII) and 10G (10GBase-R) modes */
+static int mv_swap_line_type(struct phy_device *phydev)
 {
struct mv_data *priv = phydev->priv;
bool changed = false;
@@ -200,25 +221,23 @@ static int mv_setup_forced(struct phy_device *phydev)
 
switch (priv->line_interface) {
case PHY_INTERFACE_MODE_10GBASER:
-   if (phydev->speed == SPEED_1000 &&
-   mv_is_1gbx_capable(phydev)) {
+   if (mv_is_1gbx_capable(phydev)) {
priv->line_interface = PHY_INTERFACE_MODE_1000BASEX;
changed = true;
}
 
-   break;
-   case PHY_INTERFACE_MODE_1000BASEX:
-   if (phydev->speed == SPEED_1 &&
-   mv_is_10g_capable(phydev)) {
-   priv->line_interface = PHY_INTERFACE_MODE_10GBASER;
+   if (mv_is_sgmii_capable(phydev)) {
+   priv->line_interface = PHY_INTERFACE_MODE_SGMII;
changed = true;
}
 
break;
+   case PHY_INTERFACE_MODE_1000BASEX:
case PHY_INTERFACE_MODE_SGMII:
-   ret = mv_set_sgmii_speed(phydev);
-   if (ret < 0)
-   return ret;
+   if (mv_is_10g_capable(phydev)) {
+   priv->line_interface = PHY_INTERFACE_MODE_10GBASER;
+   changed = true;
+   }
 
break;
default:
@@ -231,6 +250,29 @@ static int mv_setup_forced(struct phy_device *phydev)
return ret;
}
 
+   return 0;
+}
+
+static int mv_setup_forced(struct phy_device *phydev)
+{
+   struct mv_data *priv = phydev->priv;
+   int ret;
+
+   if (priv->line_interface == PHY_INTERFACE_MODE_10GBASER) {
+   if (phydev->speed < SPEED_1 &&
+   phydev->speed != SPEED_UNKNOWN) {
+   ret = mv_swap_line_type(phydev);
+   if (ret < 0)
+   return ret;
+   }
+   }
+
+   if (priv->line_interface == PHY_INTERFACE_MODE_SGMII) {
+   ret = mv_set_sgmii_speed(phydev);
+   if (ret < 0)
+   return ret;
+   }
+
return mv_disable_aneg(phydev);
 }
 
@@ -244,17 +286,9 @@ static int mv_config_aneg(struct phy_device *phydev)
return 0;
 
 

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-13 Thread Tim Chen



On 4/12/21 6:27 PM, Huang, Ying wrote:

> 
> This isn't the commit that introduces the race.  You can use `git blame`
> find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
> swap: skip swapcache for swapin of synchronous device".
> 
> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
> picture.

I'll suggest make fix to do_swap_page race with get/put_swap_device
as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
be combined together.

Tim


Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-13 Thread Miaohe Lin
On 2021/4/12 15:24, Huang, Ying wrote:
> "Huang, Ying"  writes:
> 
>> Miaohe Lin  writes:
>>
>>> We will use percpu-refcount to serialize against concurrent swapoff. This
>>> patch adds the percpu_ref support for later fixup.
>>>
>>> Signed-off-by: Miaohe Lin 
>>> ---
>>>  include/linux/swap.h |  2 ++
>>>  mm/swapfile.c| 25 ++---
>>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 144727041e78..849ba5265c11 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>>   * The in-memory structure used to track swap areas.
>>>   */
>>>  struct swap_info_struct {
>>> +   struct percpu_ref users;/* serialization against concurrent 
>>> swapoff */
>>> unsigned long   flags;  /* SWP_USED etc: see above */
>>> signed short    prio;   /* swap priority of this type */
>>> struct plist_node list; /* entry in swap_active_head */
>>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>> struct block_device *bdev;  /* swap device or bdev of swap file */
>>> struct file *swap_file; /* seldom referenced */
>>> unsigned int old_block_size;/* seldom referenced */
>>> +   struct completion comp; /* seldom referenced */
>>>  #ifdef CONFIG_FRONTSWAP
>>> unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>> atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index 149e77454e3c..724173cd7d0c 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -39,6 +39,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  
>>>  #include 
>>>  #include 
>>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct *work)
>>> spin_unlock(>lock);
>>>  }
>>>  
>>> +static void swap_users_ref_free(struct percpu_ref *ref)
>>> +{
>>> +   struct swap_info_struct *si;
>>> +
>>> +   si = container_of(ref, struct swap_info_struct, users);
>>> +   complete(>comp);
>>> +   percpu_ref_exit(>users);
>>
>> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
>> get_swap_device(), better to add comments there.
> 
> I just noticed that the comments of percpu_ref_tryget_live() says,
> 
>  * This function is safe to call as long as @ref is between init and exit.
> 
> While we need to call get_swap_device() almost at any time, so it's
> better to avoid to call percpu_ref_exit() at all.  This will waste some
> memory, but we need to follow the API definition to avoid potential
> issues in the long term.

I have to admit that I'am not really familiar with percpu_ref. So I read the
implementation code of the percpu_ref and found percpu_ref_tryget_live() could
be called after exit now. But you're right we need to follow the API definition
to avoid potential issues in the long term.

> 
> And we need to call percpu_ref_init() before insert the swap_info_struct
> into the swap_info[].

If we remove the call to percpu_ref_exit(), we should not use percpu_ref_init()
here because *percpu_ref->data is assumed to be NULL* in percpu_ref_init() while
this is not the case as we do not call percpu_ref_exit(). Maybe 
percpu_ref_reinit()
or percpu_ref_resurrect() will do the work.

One more thing, how could I distinguish the killed percpu_ref from newly 
allocated one?
It seems percpu_ref_is_dying is only safe to call when @ref is between init and 
exit.
Maybe I could do this in alloc_swap_info()?

> 
>>> +}
>>> +
>>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>>  {
>>> struct swap_cluster_info *ci = si->cluster_info;
>>> @@ -2500,7 +2510,7 @@ static void enable_swap_info(struct swap_info_struct 
>>> *p, int prio,
>>>  * Guarantee swap_map, cluster_info, etc. fields are valid
>>>  * between get/put_swap_device() if SWP_VALID bit is set
>>>  */
>>> -   synchronize_rcu();
>>> +   percpu_ref_reinit(>users);
>>
>> Although the effect is same, I think it's better to use
>> percpu_ref_resurrect() here to improve code readability.
> 
> Check the original commit description for commit eb085574a752 "mm, swap:
> fix race between swapoff 

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-12 Thread Huang, Ying
Miaohe Lin  writes:

> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> - -
> do_swap_page
>   synchronous swap_readpage
> alloc_page_vma
>   swapoff
> release swap_file, bdev, or ...
>   swap_readpage
>   check sis->flags is ok
> access swap_file, bdev...[oops!]
>   si->flags = 0
>
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).
>
> Fixes: 235b62176712 ("mm/swap: add cluster lock")

This isn't the commit that introduces the race.  You can use `git blame`
find out the correct commit.  For this it's commit 0bcac06f27d7 "mm,
swap: skip swapcache for swapin of synchronous device".

And I suggest to merge 1/5 and 2/5 to make it easy to get the full
picture.

> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h |  2 +-
>  mm/memory.c  | 10 ++
>  mm/swapfile.c| 28 +++-
>  3 files changed, 22 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 849ba5265c11..9066addb57fd 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>  
>  static inline void put_swap_device(struct swap_info_struct *si)
>  {
> - rcu_read_unlock();
> + percpu_ref_put(>users);
>  }
>  
>  #else /* CONFIG_SWAP */
> diff --git a/mm/memory.c b/mm/memory.c
> index cc71a445c76c..8543c47b955c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>   struct vm_area_struct *vma = vmf->vma;
>   struct page *page = NULL, *swapcache;
> + struct swap_info_struct *si = NULL;
>   swp_entry_t entry;
>   pte_t pte;
>   int locked;
> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   }
>  
>

I suggest to add comments here as follows (words copy from Matthew Wilcox)

/* Prevent swapoff from happening to us */

> + si = get_swap_device(entry);
> + /* In case we raced with swapoff. */
> + if (unlikely(!si))
> + goto out;
> +

Because we wrap the whole do_swap_page() with get/put_swap_device()
now.  We can remove several get/put_swap_device() for function called by
do_swap_page().  That can be another optimization patch.

>   delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>   page = lookup_swap_cache(entry, vma, vmf->address);
>   swapcache = page;
> @@ -3514,6 +3520,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  unlock:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>  out:
> + if (si)
> + put_swap_device(si);
>   return ret;
>  out_nomap:
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3525,6 +3533,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   unlock_page(swapcache);
>   put_page(swapcache);
>   }
> + if (si)
> + put_swap_device(si);
>   return ret;
>  }
>  
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 724173cd7d0c..01032c72ceae 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1280,18 +1280,12 @@ static unsigned char __swap_entry_free_locked(struct 
> swap_info_struct *p,
>   * via preventing the swap device from being swapoff, until
>   * put_swap_device() is called.  Otherwise return NULL.
>   *
> - * The entirety of the RCU read critical section must come before the
> - * return from or after the call to synchronize_rcu() in
> - * enable_swap_info() or swapoff().  So if "si->flags & SWP_VALID" is
> - * true, the si->map, si->cluster_info, etc. must be valid in the
> - * critical section.
> - *
>   * Notice that swapoff or swapoff+swapon can still happen before the
> - * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
> - * in put_swap_device() if there isn't any other way to prevent
> - * swapoff, such as page lock, page table lock, etc.  The caller must
> - * be prepared for that.  For example, the following situation is
> - * possible.
> + * percpu_ref_tryget_live() in get_s

[PATCH net-next 2/2] net: phy: marvell-88x2222: swap 1G/10G modes on autoneg

2021-04-12 Thread Ivan Bornyakov
Setting 10G without autonegotiation is invalid according to
phy_ethtool_ksettings_set(). Thus, if autonegotiation can't complete for
quite a time, but there is signal in line, switch line interface type to
10GBase-R, if supported, in hope for link to be established. And vice
versa.

Signed-off-by: Ivan Bornyakov 
---
 drivers/net/phy/marvell-88x.c | 276 ++
 1 file changed, 168 insertions(+), 108 deletions(-)

diff --git a/drivers/net/phy/marvell-88x.c 
b/drivers/net/phy/marvell-88x.c
index fb285ac741b2..d16c81f08334 100644
--- a/drivers/net/phy/marvell-88x.c
+++ b/drivers/net/phy/marvell-88x.c
@@ -48,6 +48,8 @@
 #defineMV_1GBX_PHY_STAT_SPEED100   BIT(14)
 #defineMV_1GBX_PHY_STAT_SPEED1000  BIT(15)
 
+#defineAUTONEG_TIMEOUT 3
+
 struct mv_data {
phy_interface_t line_interface;
__ETHTOOL_DECLARE_LINK_MODE_MASK(supported);
@@ -82,89 +84,6 @@ static int mv_soft_reset(struct phy_device *phydev)
 5000, 100, true);
 }
 
-/* Returns negative on error, 0 if link is down, 1 if link is up */
-static int mv_read_status_10g(struct phy_device *phydev)
-{
-   int val, link = 0;
-
-   val = phy_read_mmd(phydev, MDIO_MMD_PCS, MDIO_STAT1);
-   if (val < 0)
-   return val;
-
-   if (val & MDIO_STAT1_LSTATUS) {
-   link = 1;
-
-   /* 10GBASE-R do not support auto-negotiation */
-   phydev->autoneg = AUTONEG_DISABLE;
-   phydev->speed = SPEED_1;
-   phydev->duplex = DUPLEX_FULL;
-   }
-
-   return link;
-}
-
-/* Returns negative on error, 0 if link is down, 1 if link is up */
-static int mv_read_status_1g(struct phy_device *phydev)
-{
-   int val, link = 0;
-
-   val = phy_read_mmd(phydev, MDIO_MMD_PCS, MV_1GBX_STAT);
-   if (val < 0)
-   return val;
-
-   if (!(val & BMSR_LSTATUS) ||
-   (phydev->autoneg == AUTONEG_ENABLE &&
-!(val & BMSR_ANEGCOMPLETE)))
-   return 0;
-
-   link = 1;
-
-   val = phy_read_mmd(phydev, MDIO_MMD_PCS, MV_1GBX_PHY_STAT);
-   if (val < 0)
-   return val;
-
-   if (val & MV_1GBX_PHY_STAT_AN_RESOLVED) {
-   if (val & MV_1GBX_PHY_STAT_DUPLEX)
-   phydev->duplex = DUPLEX_FULL;
-   else
-   phydev->duplex = DUPLEX_HALF;
-
-   if (val & MV_1GBX_PHY_STAT_SPEED1000)
-   phydev->speed = SPEED_1000;
-   else if (val & MV_1GBX_PHY_STAT_SPEED100)
-   phydev->speed = SPEED_100;
-   else
-   phydev->speed = SPEED_10;
-   }
-
-   return link;
-}
-
-static int mv_read_status(struct phy_device *phydev)
-{
-   struct mv_data *priv = phydev->priv;
-   int link;
-
-   phydev->link = 0;
-   phydev->speed = SPEED_UNKNOWN;
-   phydev->duplex = DUPLEX_UNKNOWN;
-
-   if (!priv->sfp_link)
-   return 0;
-
-   if (priv->line_interface == PHY_INTERFACE_MODE_10GBASER)
-   link = mv_read_status_10g(phydev);
-   else
-   link = mv_read_status_1g(phydev);
-
-   if (link < 0)
-   return link;
-
-   phydev->link = link;
-
-   return 0;
-}
-
 static int mv_disable_aneg(struct phy_device *phydev)
 {
int ret = phy_clear_bits_mmd(phydev, MDIO_MMD_PCS, MV_1GBX_CTRL,
@@ -252,6 +171,24 @@ static bool mv_is_1gbx_capable(struct phy_device 
*phydev)
 priv->supported);
 }
 
+static bool mv_is_sgmii_capable(struct phy_device *phydev)
+{
+   struct mv_data *priv = phydev->priv;
+
+   return (linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseT_Full_BIT,
+ priv->supported) ||
+   linkmode_test_bit(ETHTOOL_LINK_MODE_1000baseT_Half_BIT,
+ priv->supported) ||
+   linkmode_test_bit(ETHTOOL_LINK_MODE_100baseT_Full_BIT,
+ priv->supported) ||
+   linkmode_test_bit(ETHTOOL_LINK_MODE_100baseT_Half_BIT,
+ priv->supported) ||
+   linkmode_test_bit(ETHTOOL_LINK_MODE_10baseT_Full_BIT,
+ priv->supported) ||
+   linkmode_test_bit(ETHTOOL_LINK_MODE_10baseT_Half_BIT,
+ priv->supported));
+}
+
 static int mv_config_line(struct phy_device *phydev)
 {
struct mv_data *priv = phydev->priv;
@@ -271,7 +208,7 @@ static int mv_config_line(struct phy_device *phydev)
}
 }
 
-static int mv_setup_forced(struct phy_device *phydev)
+static int mv_swap_line_type(struct phy_device *phydev)
 {
struct mv_data *priv = phydev->priv;
bool changed = false;
@@ -279,25 +216,23 @@ static int mv_setup_forced(struct 

[PATCH v1 2/3] perf session: Add swap operation for event TIME_CONV

2021-04-12 Thread Leo Yan
Since commit d110162cafc8 ("perf tsc: Support cap_user_time_short for
event TIME_CONV"), the event PERF_RECORD_TIME_CONV has extended the data
structure for clock parameters.

To be backwards-compatible, this patch adds a dedicated swap operation
for the event PERF_RECORD_TIME_CONV, based on checking the event size,
it can support both for the old and new event formats.

Fixes: d110162cafc8 ("perf tsc: Support cap_user_time_short for event 
TIME_CONV")
Signed-off-by: Leo Yan 
---
 tools/perf/util/session.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 9a8808507bd9..afca3d5fc851 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -949,6 +949,26 @@ static void perf_event__stat_round_swap(union perf_event 
*event,
event->stat_round.time = bswap_64(event->stat_round.time);
 }
 
+static void perf_event__time_conv_swap(union perf_event *event,
+  bool sample_id_all __maybe_unused)
+{
+   size_t time_zero_size;
+
+   event->time_conv.time_shift = bswap_64(event->time_conv.time_shift);
+   event->time_conv.time_mult  = bswap_64(event->time_conv.time_mult);
+   event->time_conv.time_zero  = bswap_64(event->time_conv.time_zero);
+
+   time_zero_size = (void *)>time_conv.time_cycles - (void *)event;
+   if (event->header.size > time_zero_size) {
+   event->time_conv.time_cycles = 
bswap_64(event->time_conv.time_cycles);
+   event->time_conv.time_mask = 
bswap_64(event->time_conv.time_mask);
+   event->time_conv.cap_user_time_zero =
+   bswap_32(event->time_conv.cap_user_time_zero);
+   event->time_conv.cap_user_time_short =
+   bswap_32(event->time_conv.cap_user_time_short);
+   }
+}
+
 typedef void (*perf_event__swap_op)(union perf_event *event,
bool sample_id_all);
 
@@ -985,7 +1005,7 @@ static perf_event__swap_op perf_event__swap_ops[] = {
[PERF_RECORD_STAT]= perf_event__stat_swap,
[PERF_RECORD_STAT_ROUND]  = perf_event__stat_round_swap,
[PERF_RECORD_EVENT_UPDATE]= perf_event__event_update_swap,
-   [PERF_RECORD_TIME_CONV]   = perf_event__all64_swap,
+   [PERF_RECORD_TIME_CONV]   = perf_event__time_conv_swap,
[PERF_RECORD_HEADER_MAX]  = NULL,
 };
 
-- 
2.25.1



Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-12 Thread Huang, Ying
"Huang, Ying"  writes:

> Miaohe Lin  writes:
>
>> We will use percpu-refcount to serialize against concurrent swapoff. This
>> patch adds the percpu_ref support for later fixup.
>>
>> Signed-off-by: Miaohe Lin 
>> ---
>>  include/linux/swap.h |  2 ++
>>  mm/swapfile.c| 25 ++---
>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 144727041e78..849ba5265c11 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>   * The in-memory structure used to track swap areas.
>>   */
>>  struct swap_info_struct {
>> +struct percpu_ref users;/* serialization against concurrent 
>> swapoff */
>>  unsigned long   flags;  /* SWP_USED etc: see above */
>>  signed shortprio;   /* swap priority of this type */
>>  struct plist_node list; /* entry in swap_active_head */
>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>  struct block_device *bdev;  /* swap device or bdev of swap file */
>>  struct file *swap_file; /* seldom referenced */
>>  unsigned int old_block_size;/* seldom referenced */
>> +struct completion comp; /* seldom referenced */
>>  #ifdef CONFIG_FRONTSWAP
>>  unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>  atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 149e77454e3c..724173cd7d0c 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -39,6 +39,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct *work)
>>  spin_unlock(>lock);
>>  }
>>  
>> +static void swap_users_ref_free(struct percpu_ref *ref)
>> +{
>> +struct swap_info_struct *si;
>> +
>> +si = container_of(ref, struct swap_info_struct, users);
>> +complete(>comp);
>> +percpu_ref_exit(>users);
>
> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
> get_swap_device(), better to add comments there.

I just noticed that the comments of percpu_ref_tryget_live() says,

 * This function is safe to call as long as @ref is between init and exit.

While we need to call get_swap_device() almost at any time, so it's
better to avoid to call percpu_ref_exit() at all.  This will waste some
memory, but we need to follow the API definition to avoid potential
issues in the long term.

And we need to call percpu_ref_init() before insert the swap_info_struct
into the swap_info[].

>> +}
>> +
>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>  {
>>  struct swap_cluster_info *ci = si->cluster_info;
>> @@ -2500,7 +2510,7 @@ static void enable_swap_info(struct swap_info_struct 
>> *p, int prio,
>>   * Guarantee swap_map, cluster_info, etc. fields are valid
>>   * between get/put_swap_device() if SWP_VALID bit is set
>>   */
>> -synchronize_rcu();
>> +percpu_ref_reinit(>users);
>
> Although the effect is same, I think it's better to use
> percpu_ref_resurrect() here to improve code readability.

Check the original commit description for commit eb085574a752 "mm, swap:
fix race between swapoff and some swap operations" and discussion email
thread as follows again,

https://lore.kernel.org/linux-mm/20171219053650.gb7...@linux.vnet.ibm.com/

I found that the synchronize_rcu() here is to avoid to call smp_rmb() or
smp_load_acquire() in get_swap_device().  Now we will use
percpu_ref_tryget_live() in get_swap_device(), so we will need to add
the necessary memory barrier, or make sure percpu_ref_tryget_live() has
ACQUIRE semantics.  Per my understanding, we need to change
percpu_ref_tryget_live() for that.

>>  spin_lock(_lock);
>>  spin_lock(>lock);
>>  _enable_swap_info(p);
>> @@ -2621,11 +2631,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
>> specialfile)
>>  p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>>  spin_unlock(>lock);
>>  spin_unlock(_lock);
>> +
>> +percpu_ref_kill(>users);
>>  /*
>>   * wait for swap operations protected by get/put_swap_device()
>>   * to complete
>>   */
>> -synchronize_rcu();
>> +wait_for_completion(>comp);
>
> Better to move percpu_ref_kill() after the comments.  And maybe revise
> the comments.

After reading the original commit description as above, I found that we
need synchronize_rcu() here to protect the accessing to the swap cache
data structure.  Because there's call_rcu() during percpu_ref_kill(), it
appears OK to keep the synchronize_rcu() here.  And we need to revise
the comments to make it clear what is protected by which operation.

Best Regards,
Huang, Ying

[snip]


Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-12 Thread Miaohe Lin
On 2021/4/12 11:30, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> We will use percpu-refcount to serialize against concurrent swapoff. This
>> patch adds the percpu_ref support for later fixup.
>>
>> Signed-off-by: Miaohe Lin 
>> ---
>>  include/linux/swap.h |  2 ++
>>  mm/swapfile.c| 25 ++---
>>  2 files changed, 24 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 144727041e78..849ba5265c11 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>>   * The in-memory structure used to track swap areas.
>>   */
>>  struct swap_info_struct {
>> +struct percpu_ref users;/* serialization against concurrent 
>> swapoff */
>>  unsigned long   flags;  /* SWP_USED etc: see above */
>>  signed shortprio;   /* swap priority of this type */
>>  struct plist_node list; /* entry in swap_active_head */
>> @@ -260,6 +261,7 @@ struct swap_info_struct {
>>  struct block_device *bdev;  /* swap device or bdev of swap file */
>>  struct file *swap_file; /* seldom referenced */
>>  unsigned int old_block_size;/* seldom referenced */
>> +struct completion comp; /* seldom referenced */
>>  #ifdef CONFIG_FRONTSWAP
>>  unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>>  atomic_t frontswap_pages;   /* frontswap pages in-use counter */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 149e77454e3c..724173cd7d0c 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -39,6 +39,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct *work)
>>  spin_unlock(>lock);
>>  }
>>  
>> +static void swap_users_ref_free(struct percpu_ref *ref)
>> +{
>> +struct swap_info_struct *si;
>> +
>> +si = container_of(ref, struct swap_info_struct, users);
>> +complete(>comp);
>> +percpu_ref_exit(>users);
> 
> Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
> get_swap_device(), better to add comments there.

Will do.

> 
>> +}
>> +
>>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>>  {
>>  struct swap_cluster_info *ci = si->cluster_info;
>> @@ -2500,7 +2510,7 @@ static void enable_swap_info(struct swap_info_struct 
>> *p, int prio,
>>   * Guarantee swap_map, cluster_info, etc. fields are valid
>>   * between get/put_swap_device() if SWP_VALID bit is set
>>   */
>> -synchronize_rcu();
>> +percpu_ref_reinit(>users);
> 
> Although the effect is same, I think it's better to use
> percpu_ref_resurrect() here to improve code readability.
> 

Agree.

>>  spin_lock(_lock);
>>  spin_lock(>lock);
>>  _enable_swap_info(p);
>> @@ -2621,11 +2631,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
>> specialfile)
>>  p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>>  spin_unlock(>lock);
>>  spin_unlock(_lock);
>> +
>> +percpu_ref_kill(>users);
>>  /*
>>   * wait for swap operations protected by get/put_swap_device()
>>   * to complete
>>   */
>> -synchronize_rcu();
>> +wait_for_completion(>comp);
> 
> Better to move percpu_ref_kill() after the comments.  And maybe revise
> the comments.

Will do.

> 
>>  
>>  flush_work(>discard_work);
>>  
>> @@ -3132,7 +3144,7 @@ static bool swap_discardable(struct swap_info_struct 
>> *si)
>>  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>>  {
>>  struct swap_info_struct *p;
>> -struct filename *name;
>> +struct filename *name = NULL;
>>  struct file *swap_file = NULL;
>>  struct address_space *mapping;
>>  int prio;
>> @@ -3163,6 +3175,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
>> specialfile, int, swap_flags)
>>  
>>  INIT_WORK(>discard_work, swap_discard_work);
>>  
>> +init_completion(>comp);
>> +error = percpu_ref_init(>users, swap_users_ref_free,
>> +PERCPU_REF_INIT_DEAD, GFP_KERNEL);
>> +if (unlikely(error))
>> +goto bad_swap;
>> +
>>  name = getname(specialfile);
>>  if (IS_ERR(name)) {
>>  error = PTR_ERR(name);
>> @@ -3356,6 +3374,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
>> specialfile, int, swap_flags)
>>  bad_swap_unlock_inode:
>>  inode_unlock(inode);
>>  bad_swap:
>> +percpu_ref_exit(>users);
> 
> Usually the resource freeing order matches their allocating order
> reversely.  So, if there's no special reason, please follow that rule.
> 

My oversight. Will fix it in V2.

> Best Regards,
> Huang, Ying
> 
>>  free_percpu(p->percpu_cluster);
>>  p->percpu_cluster = NULL;
>>  free_percpu(p->cluster_next_cpu);
> .
> 

Many thanks for review and nice suggestion! :)


Re: [PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-11 Thread Huang, Ying
Miaohe Lin  writes:

> We will use percpu-refcount to serialize against concurrent swapoff. This
> patch adds the percpu_ref support for later fixup.
>
> Signed-off-by: Miaohe Lin 
> ---
>  include/linux/swap.h |  2 ++
>  mm/swapfile.c| 25 ++---
>  2 files changed, 24 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 144727041e78..849ba5265c11 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -240,6 +240,7 @@ struct swap_cluster_list {
>   * The in-memory structure used to track swap areas.
>   */
>  struct swap_info_struct {
> + struct percpu_ref users;/* serialization against concurrent 
> swapoff */
>   unsigned long   flags;  /* SWP_USED etc: see above */
>   signed shortprio;   /* swap priority of this type */
>   struct plist_node list; /* entry in swap_active_head */
> @@ -260,6 +261,7 @@ struct swap_info_struct {
>   struct block_device *bdev;  /* swap device or bdev of swap file */
>   struct file *swap_file; /* seldom referenced */
>   unsigned int old_block_size;/* seldom referenced */
> + struct completion comp; /* seldom referenced */
>  #ifdef CONFIG_FRONTSWAP
>   unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
>   atomic_t frontswap_pages;   /* frontswap pages in-use counter */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 149e77454e3c..724173cd7d0c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -39,6 +39,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct *work)
>   spin_unlock(>lock);
>  }
>  
> +static void swap_users_ref_free(struct percpu_ref *ref)
> +{
> + struct swap_info_struct *si;
> +
> + si = container_of(ref, struct swap_info_struct, users);
> + complete(>comp);
> + percpu_ref_exit(>users);

Because percpu_ref_exit() is used, we cannot use percpu_ref_tryget() in
get_swap_device(), better to add comments there.

> +}
> +
>  static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
>  {
>   struct swap_cluster_info *ci = si->cluster_info;
> @@ -2500,7 +2510,7 @@ static void enable_swap_info(struct swap_info_struct 
> *p, int prio,
>* Guarantee swap_map, cluster_info, etc. fields are valid
>* between get/put_swap_device() if SWP_VALID bit is set
>*/
> - synchronize_rcu();
> + percpu_ref_reinit(>users);

Although the effect is same, I think it's better to use
percpu_ref_resurrect() here to improve code readability.

>   spin_lock(_lock);
>   spin_lock(>lock);
>   _enable_swap_info(p);
> @@ -2621,11 +2631,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
> specialfile)
>   p->flags &= ~SWP_VALID; /* mark swap device as invalid */
>   spin_unlock(>lock);
>   spin_unlock(_lock);
> +
> + percpu_ref_kill(>users);
>   /*
>* wait for swap operations protected by get/put_swap_device()
>* to complete
>*/
> - synchronize_rcu();
> + wait_for_completion(>comp);

Better to move percpu_ref_kill() after the comments.  And maybe revise
the comments.

>  
>   flush_work(>discard_work);
>  
> @@ -3132,7 +3144,7 @@ static bool swap_discardable(struct swap_info_struct 
> *si)
>  SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  {
>   struct swap_info_struct *p;
> - struct filename *name;
> + struct filename *name = NULL;
>   struct file *swap_file = NULL;
>   struct address_space *mapping;
>   int prio;
> @@ -3163,6 +3175,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
> specialfile, int, swap_flags)
>  
>   INIT_WORK(>discard_work, swap_discard_work);
>  
> + init_completion(>comp);
> + error = percpu_ref_init(>users, swap_users_ref_free,
> + PERCPU_REF_INIT_DEAD, GFP_KERNEL);
> + if (unlikely(error))
> + goto bad_swap;
> +
>   name = getname(specialfile);
>   if (IS_ERR(name)) {
>   error = PTR_ERR(name);
> @@ -3356,6 +3374,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
> specialfile, int, swap_flags)
>  bad_swap_unlock_inode:
>   inode_unlock(inode);
>  bad_swap:
> + percpu_ref_exit(>users);

Usually the resource freeing order matches their allocating order
reversely.  So, if there's no special reason, please follow that rule.

Best Regards,
Huang, Ying

>   free_percpu(p->percpu_cluster);
>   p->percpu_cluster = NULL;
>   free_percpu(p->cluster_next_cpu);


Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-11 Thread Miaohe Lin
On 2021/4/12 9:44, Huang, Ying wrote:
> Miaohe Lin  writes:
> 
>> On 2021/4/10 1:17, Tim Chen wrote:
>>>
>>>
>>> On 4/9/21 1:42 AM, Miaohe Lin wrote:
>>>> On 2021/4/9 5:34, Tim Chen wrote:
>>>>>
>>>>>
>>>>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>>>>>> When I was investigating the swap code, I found the below possible race
>>>>>> window:
>>>>>>
>>>>>> CPU 1CPU 2
>>>>>> --
>>>>>> do_swap_page
>>>>>>   synchronous swap_readpage
>>>>>> alloc_page_vma
>>>>>>  swapoff
>>>>>>release swap_file, bdev, or ...
>>>>>
>>>>
>>>> Many thanks for quick review and reply!
>>>>
>>>>> Perhaps I'm missing something.  The release of swap_file, bdev etc
>>>>> happens after we have cleared the SWP_VALID bit in si->flags in 
>>>>> destroy_swap_extents
>>>>> if I read the swapoff code correctly.
>>>> Agree. Let's look this more close:
>>>> CPU1   CPU2
>>>> -  -
>>>> swap_readpage
>>>>   if (data_race(sis->flags & SWP_FS_OPS)) {
>>>>swapoff
>>>>  p->swap_file 
>>>> = NULL;
>>>> struct file *swap_file = sis->swap_file;
>>>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>>>  ...
>>>>  p->flags = 0;
>>>> ...
>>>>
>>>> Does this make sense for you?
>>>
>>> p->swapfile = NULL happens after the 
>>> p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence 
>>> in swapoff().
>>>
>>> So I don't think the sequence you illustrated on CPU2 is in the right order.
>>> That said, without get_swap_device/put_swap_device in swap_readpage, you 
>>> could
>>> potentially blow pass synchronize_rcu() on CPU2 and causes a problem.  so I 
>>> think
>>> the problematic race looks something like the following:
>>>
>>>
>>> CPU1CPU2
>>> -   -
>>> swap_readpage
>>>   if (data_race(sis->flags & SWP_FS_OPS)) {
>>> swapoff
>>>   p->flags = &= 
>>> ~SWP_VALID;
>>>   ..
>>>   
>>> synchronize_rcu();
>>>   ..
>>>   p->swap_file 
>>> = NULL;
>>> struct file *swap_file = sis->swap_file;
>>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>>   ...
>>> ...
>>>
>>
>> Agree. This is also what I meant to illustrate. And you provide a better 
>> one. Many thanks!
> 
> For the pages that are swapped in through swap cache.  That isn't an
> issue.  Because the page is locked, the swap entry will be marked with
> SWAP_HAS_CACHE, so swapoff() cannot proceed until the page has been
> unlocked.
> 
> So the race is for the fast path as follows,
> 
>   if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>   __swap_count(entry) == 1)
> 
> I found it in your original patch description.  But please make it more
> explicit to reduce the potential confusing.

Sure. Should I rephrase the commit log to clarify this or add a comment in the 
code?

Thanks.

> 
> Best Regards,
> Huang, Ying
> .
> 



Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-11 Thread Huang, Ying
Miaohe Lin  writes:

> On 2021/4/10 1:17, Tim Chen wrote:
>> 
>> 
>> On 4/9/21 1:42 AM, Miaohe Lin wrote:
>>> On 2021/4/9 5:34, Tim Chen wrote:
>>>>
>>>>
>>>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>>>>> When I was investigating the swap code, I found the below possible race
>>>>> window:
>>>>>
>>>>> CPU 1 CPU 2
>>>>> - -
>>>>> do_swap_page
>>>>>   synchronous swap_readpage
>>>>> alloc_page_vma
>>>>>   swapoff
>>>>> release swap_file, bdev, or ...
>>>>
>>>
>>> Many thanks for quick review and reply!
>>>
>>>> Perhaps I'm missing something.  The release of swap_file, bdev etc
>>>> happens after we have cleared the SWP_VALID bit in si->flags in 
>>>> destroy_swap_extents
>>>> if I read the swapoff code correctly.
>>> Agree. Let's look this more close:
>>> CPU1CPU2
>>> -   -
>>> swap_readpage
>>>   if (data_race(sis->flags & SWP_FS_OPS)) {
>>> swapoff
>>>   p->swap_file 
>>> = NULL;
>>> struct file *swap_file = sis->swap_file;
>>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>>   ...
>>>   p->flags = 0;
>>> ...
>>>
>>> Does this make sense for you?
>> 
>> p->swapfile = NULL happens after the 
>> p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence 
>> in swapoff().
>> 
>> So I don't think the sequence you illustrated on CPU2 is in the right order.
>> That said, without get_swap_device/put_swap_device in swap_readpage, you 
>> could
>> potentially blow pass synchronize_rcu() on CPU2 and causes a problem.  so I 
>> think
>> the problematic race looks something like the following:
>> 
>> 
>> CPU1 CPU2
>> --
>> swap_readpage
>>   if (data_race(sis->flags & SWP_FS_OPS)) {
>>  swapoff
>>p->flags = &= 
>> ~SWP_VALID;
>>..
>>
>> synchronize_rcu();
>>..
>>p->swap_file 
>> = NULL;
>> struct file *swap_file = sis->swap_file;
>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>...
>> ...
>> 
>
> Agree. This is also what I meant to illustrate. And you provide a better one. 
> Many thanks!

For the pages that are swapped in through swap cache.  That isn't an
issue.  Because the page is locked, the swap entry will be marked with
SWAP_HAS_CACHE, so swapoff() cannot proceed until the page has been
unlocked.

So the race is for the fast path as follows,

if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1)

I found it in your original patch description.  But please make it more
explicit to reduce the potential confusing.

Best Regards,
Huang, Ying


Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-09 Thread Miaohe Lin
On 2021/4/10 1:17, Tim Chen wrote:
> 
> 
> On 4/9/21 1:42 AM, Miaohe Lin wrote:
>> On 2021/4/9 5:34, Tim Chen wrote:
>>>
>>>
>>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>>>> When I was investigating the swap code, I found the below possible race
>>>> window:
>>>>
>>>> CPU 1  CPU 2
>>>> -  -
>>>> do_swap_page
>>>>   synchronous swap_readpage
>>>> alloc_page_vma
>>>>swapoff
>>>>  release swap_file, bdev, or ...
>>>
>>
>> Many thanks for quick review and reply!
>>
>>> Perhaps I'm missing something.  The release of swap_file, bdev etc
>>> happens after we have cleared the SWP_VALID bit in si->flags in 
>>> destroy_swap_extents
>>> if I read the swapoff code correctly.
>> Agree. Let's look this more close:
>> CPU1 CPU2
>> --
>> swap_readpage
>>   if (data_race(sis->flags & SWP_FS_OPS)) {
>>  swapoff
>>p->swap_file 
>> = NULL;
>> struct file *swap_file = sis->swap_file;
>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>...
>>p->flags = 0;
>> ...
>>
>> Does this make sense for you?
> 
> p->swapfile = NULL happens after the 
> p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence in 
> swapoff().
> 
> So I don't think the sequence you illustrated on CPU2 is in the right order.
> That said, without get_swap_device/put_swap_device in swap_readpage, you could
> potentially blow pass synchronize_rcu() on CPU2 and causes a problem.  so I 
> think
> the problematic race looks something like the following:
> 
> 
> CPU1  CPU2
> - -
> swap_readpage
>   if (data_race(sis->flags & SWP_FS_OPS)) {
>   swapoff
> p->flags = &= 
> ~SWP_VALID;
> ..
> 
> synchronize_rcu();
> ..
> p->swap_file 
> = NULL;
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[oops!]
> ...
> ...
> 

Agree. This is also what I meant to illustrate. And you provide a better one. 
Many thanks!

> By adding get_swap_device/put_swap_device, then the race is fixed.
> 
> 
> CPU1  CPU2
> - -
> swap_readpage
>   get_swap_device()
>   ..
>   if (data_race(sis->flags & SWP_FS_OPS)) {
>   swapoff
> p->flags = &= 
> ~SWP_VALID;
> ..
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[valid value]
>   ..
>   put_swap_device()
> 
> synchronize_rcu();
> ..
> p->swap_file 
> = NULL;
> 
> 
>>
>>>>
>>>>   swap_readpage
>>>>check sis->flags is ok
>>>>  access swap_file, bdev...[oops!]
>>>>si->flags = 0
>>>
>>> This happens after we clear the si->flags
>>> synchronize_rcu()
>>> release swap_file, bdev, in 
>>> destroy_swap_extents()
>>>
>&

Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-09 Thread Tim Chen



On 4/9/21 1:42 AM, Miaohe Lin wrote:
> On 2021/4/9 5:34, Tim Chen wrote:
>>
>>
>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1   CPU 2
>>> -   -
>>> do_swap_page
>>>   synchronous swap_readpage
>>> alloc_page_vma
>>> swapoff
>>>   release swap_file, bdev, or ...
>>
> 
> Many thanks for quick review and reply!
> 
>> Perhaps I'm missing something.  The release of swap_file, bdev etc
>> happens after we have cleared the SWP_VALID bit in si->flags in 
>> destroy_swap_extents
>> if I read the swapoff code correctly.
> Agree. Let's look this more close:
> CPU1  CPU2
> - -
> swap_readpage
>   if (data_race(sis->flags & SWP_FS_OPS)) {
>   swapoff
> p->swap_file 
> = NULL;
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[oops!]
> ...
> p->flags = 0;
> ...
> 
> Does this make sense for you?

p->swapfile = NULL happens after the 
p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence in 
swapoff().

So I don't think the sequence you illustrated on CPU2 is in the right order.
That said, without get_swap_device/put_swap_device in swap_readpage, you could
potentially blow pass synchronize_rcu() on CPU2 and causes a problem.  so I 
think
the problematic race looks something like the following:


CPU1CPU2
-   -
swap_readpage
  if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
  p->flags = &= 
~SWP_VALID;
  ..
  
synchronize_rcu();
  ..
  p->swap_file 
= NULL;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[oops!]
  ...
...

By adding get_swap_device/put_swap_device, then the race is fixed.


CPU1CPU2
-   -
swap_readpage
  get_swap_device()
  ..
  if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
  p->flags = &= 
~SWP_VALID;
  ..
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[valid value]
  ..
  put_swap_device()
  
synchronize_rcu();
  ..
  p->swap_file 
= NULL;


> 
>>>
>>>   swap_readpage
>>> check sis->flags is ok
>>>   access swap_file, bdev...[oops!]
>>> si->flags = 0
>>
>> This happens after we clear the si->flags
>>  synchronize_rcu()
>>  release swap_file, bdev, in 
>> destroy_swap_extents()
>>
>> So I think if we have get_swap_device/put_swap_device in do_swap_page,
>> it should fix the race you've pointed out here.  
>> Then synchronize_rcu() will wait till we have completed do_swap_page and
>> call put_swap_device.
> 
> Right, get_swap_device/put_swap_device could fix this race. __But__ 
> rcu_read_lock()
> in get_swap_device() could disable preempt and do_swap_page() may take a 
> really long
> time because it involves I/O. It may not be acceptable to disable preempt for 
> such a
> long time. :(

I can see that it is not a good idea to hold rcu read lock for a long
time over slow file I/O operation, which will be the side effect of
introducing get/put_swap_device to swap_readpage.  So using percpu_ref
will then be preferable for synchronization once we introduce 
get/put_swap_device into swap_readpage.

Tim


[PATCH net-next 2/3] net: ethernet: add property "nvmem_macaddr_swap" to swap macaddr bytes order

2021-04-09 Thread Joakim Zhang
From: Fugang Duan 

ethernet controller driver call .of_get_mac_address() to get
the mac address from devictree tree, if these properties are
not present, then try to read from nvmem.

For example, read MAC address from nvmem:
of_get_mac_address()
of_get_mac_addr_nvmem()
nvmem_get_mac_address()

i.MX6x/7D/8MQ/8MM platforms ethernet MAC address read from
nvmem ocotp eFuses, but it requires to swap the six bytes
order.

The patch add optional property "nvmem_macaddr_swap" to swap
macaddr bytes order.

Signed-off-by: Fugang Duan 
Signed-off-by: Joakim Zhang 
---
 net/ethernet/eth.c | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 4106373180c6..11057671a9d6 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -534,8 +534,10 @@ EXPORT_SYMBOL(eth_platform_get_mac_address);
 int nvmem_get_mac_address(struct device *dev, void *addrbuf)
 {
struct nvmem_cell *cell;
-   const void *mac;
+   const unsigned char *mac;
+   unsigned char macaddr[ETH_ALEN];
size_t len;
+   int i = 0;
 
cell = nvmem_cell_get(dev, "mac-address");
if (IS_ERR(cell))
@@ -547,14 +549,27 @@ int nvmem_get_mac_address(struct device *dev, void 
*addrbuf)
if (IS_ERR(mac))
return PTR_ERR(mac);
 
-   if (len != ETH_ALEN || !is_valid_ether_addr(mac)) {
-   kfree(mac);
-   return -EINVAL;
+   if (len != ETH_ALEN)
+   goto invalid_addr;
+
+   if (dev->of_node &&
+   of_property_read_bool(dev->of_node, "nvmem_macaddr_swap")) {
+   for (i = 0; i < ETH_ALEN; i++)
+   macaddr[i] = mac[ETH_ALEN - i - 1];
+   } else {
+   ether_addr_copy(macaddr, mac);
}
 
-   ether_addr_copy(addrbuf, mac);
+   if (!is_valid_ether_addr(macaddr))
+   goto invalid_addr;
+
+   ether_addr_copy(addrbuf, macaddr);
kfree(mac);
 
return 0;
+
+invalid_addr:
+   kfree(mac);
+   return -EINVAL;
 }
 EXPORT_SYMBOL(nvmem_get_mac_address);
-- 
2.17.1



Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-09 Thread Miaohe Lin
On 2021/4/9 5:34, Tim Chen wrote:
> 
> 
> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>> When I was investigating the swap code, I found the below possible race
>> window:
>>
>> CPU 1CPU 2
>> --
>> do_swap_page
>>   synchronous swap_readpage
>> alloc_page_vma
>>  swapoff
>>release swap_file, bdev, or ...
> 

Many thanks for quick review and reply!

> Perhaps I'm missing something.  The release of swap_file, bdev etc
> happens after we have cleared the SWP_VALID bit in si->flags in 
> destroy_swap_extents
> if I read the swapoff code correctly.
Agree. Let's look this more close:
CPU1CPU2
-   -
swap_readpage
  if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
  p->swap_file 
= NULL;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[oops!]
  ...
  p->flags = 0;
...

Does this make sense for you?

> >
>>   swap_readpage
>>  check sis->flags is ok
>>access swap_file, bdev...[oops!]
>>  si->flags = 0
> 
> This happens after we clear the si->flags
>   synchronize_rcu()
>   release swap_file, bdev, in 
> destroy_swap_extents()
> 
> So I think if we have get_swap_device/put_swap_device in do_swap_page,
> it should fix the race you've pointed out here.  
> Then synchronize_rcu() will wait till we have completed do_swap_page and
> call put_swap_device.

Right, get_swap_device/put_swap_device could fix this race. __But__ 
rcu_read_lock()
in get_swap_device() could disable preempt and do_swap_page() may take a really 
long
time because it involves I/O. It may not be acceptable to disable preempt for 
such a
long time. :(

>   
>>
>> Using current get/put_swap_device() to guard against concurrent swapoff for
>> swap_readpage() looks terrible because swap_readpage() may take really long
>> time. And this race may not be really pernicious because swapoff is usually
>> done when system shutdown only. To reduce the performance overhead on the
>> hot-path as much as possible, it appears we can use the percpu_ref to close
>> this race window(as suggested by Huang, Ying).
> 
> I think it is better to break this patch into two.
> > One patch is to fix the race in do_swap_page and swapoff
> by adding get_swap_device/put_swap_device in do_swap_page.
> 
> The second patch is to modify get_swap_device and put_swap_device
> with percpu_ref. But swapoff is a relatively rare events.  

Sounds reasonable. Will do it.

> 
> I am not sure making percpu_ref change for performance is really beneficial.
> Did you encounter a real use case where you see a problem with swapoff?
> The delay in swapoff is primarily in try_to_unuse to bring all
> the swapped off pages back into memory.  Synchronizing with other
> CPU for paging in probably is a small component in overall scheme
> of things.
> 

I can't find a more simple and stable way to fix this potential and 
*theoretical* issue.
This could happen in real word but the race window should be very small. While 
swapoff
is usually done when system shutdown only, I'am not really sure if this effort 
is worth.

But IMO, we should eliminate any potential trouble. :)

> Thanks.
> 

Thanks again.

> Tim
> 
> .
> 


[PATCH net-next 2/3] net: ethernet: add property "nvmem_macaddr_swap" to swap macaddr bytes order

2021-04-09 Thread Joakim Zhang
From: Fugang Duan 

ethernet controller driver call .of_get_mac_address() to get
the mac address from devictree tree, if these properties are
not present, then try to read from nvmem.

For example, read MAC address from nvmem:
of_get_mac_address()
of_get_mac_addr_nvmem()
nvmem_get_mac_address()

i.MX6x/7D/8MQ/8MM platforms ethernet MAC address read from
nvmem ocotp eFuses, but it requires to swap the six bytes
order.

The patch add optional property "nvmem_macaddr_swap" to swap
macaddr bytes order.

Signed-off-by: Fugang Duan 
Signed-off-by: Joakim Zhang 
---
 net/ethernet/eth.c | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 4106373180c6..11057671a9d6 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -534,8 +534,10 @@ EXPORT_SYMBOL(eth_platform_get_mac_address);
 int nvmem_get_mac_address(struct device *dev, void *addrbuf)
 {
struct nvmem_cell *cell;
-   const void *mac;
+   const unsigned char *mac;
+   unsigned char macaddr[ETH_ALEN];
size_t len;
+   int i = 0;
 
cell = nvmem_cell_get(dev, "mac-address");
if (IS_ERR(cell))
@@ -547,14 +549,27 @@ int nvmem_get_mac_address(struct device *dev, void 
*addrbuf)
if (IS_ERR(mac))
return PTR_ERR(mac);
 
-   if (len != ETH_ALEN || !is_valid_ether_addr(mac)) {
-   kfree(mac);
-   return -EINVAL;
+   if (len != ETH_ALEN)
+   goto invalid_addr;
+
+   if (dev->of_node &&
+   of_property_read_bool(dev->of_node, "nvmem_macaddr_swap")) {
+   for (i = 0; i < ETH_ALEN; i++)
+   macaddr[i] = mac[ETH_ALEN - i - 1];
+   } else {
+   ether_addr_copy(macaddr, mac);
}
 
-   ether_addr_copy(addrbuf, mac);
+   if (!is_valid_ether_addr(macaddr))
+   goto invalid_addr;
+
+   ether_addr_copy(addrbuf, macaddr);
kfree(mac);
 
return 0;
+
+invalid_addr:
+   kfree(mac);
+   return -EINVAL;
 }
 EXPORT_SYMBOL(nvmem_get_mac_address);
-- 
2.17.1



Re: [PATCH 0/5] close various race windows for swap

2021-04-09 Thread Miaohe Lin
On 2021/4/8 22:55, riteshh wrote:
> On 21/04/08 09:08AM, Miaohe Lin wrote:
>> Hi all,
>> When I was investigating the swap code, I found some possible race
>> windows. This series aims to fix all these races. But using current
>> get/put_swap_device() to guard against concurrent swapoff for
>> swap_readpage() looks terrible because swap_readpage() may take really
>> long time. And to reduce the performance overhead on the hot-path as
>> much as possible, it appears we can use the percpu_ref to close this
>> race window(as suggested by Huang, Ying). The patch 1 adds percpu_ref
>> support for swap and the rest of the patches use this to close various
>> race windows. More details can be found in the respective changelogs.
>> Thanks!
>>
>> Miaohe Lin (5):
>>   mm/swapfile: add percpu_ref support for swap
>>   swap: fix do_swap_page() race with swapoff
>>   mm/swap_state: fix get_shadow_from_swap_cache() race with swapoff
>>   mm/swap_state: fix potential faulted in race in swap_ra_info()
>>   mm/swap_state: fix swap_cluster_readahead() race with swapoff
> 

Many thanks for quick respond.

> Somehow I see Patch-1 and Patch-2 are missing on linux-mm[1].

I have no idea why Patch-1 and Patch-2 are missing. But they could be found at:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2542188.html

> Also I wanted to ask if you have a way to trigger this in a more controlled
> environment (consistently)?
> 

This is *theoretical* issue. The race window is very small but not impossible.
Please see the discussion:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2530094.html

> [1]: 
> https://patchwork.kernel.org/project/linux-mm/cover/20210408130820.48233-1-linmia...@huawei.com/
> 

Thanks again.

> -ritesh
> 
>>
>>  include/linux/swap.h |  4 +++-
>>  mm/memory.c  | 10 +
>>  mm/swap_state.c  | 33 +
>>  mm/swapfile.c| 50 +++-
>>  4 files changed, 68 insertions(+), 29 deletions(-)
>>
>> --
>> 2.19.1
>>
>>
> .
> 



Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-08 Thread Tim Chen



On 4/8/21 6:08 AM, Miaohe Lin wrote:
> When I was investigating the swap code, I found the below possible race
> window:
> 
> CPU 1 CPU 2
> - -
> do_swap_page
>   synchronous swap_readpage
> alloc_page_vma
>   swapoff
> release swap_file, bdev, or ...

Perhaps I'm missing something.  The release of swap_file, bdev etc
happens after we have cleared the SWP_VALID bit in si->flags in 
destroy_swap_extents
if I read the swapoff code correctly.
 

>   swap_readpage
>   check sis->flags is ok
> access swap_file, bdev...[oops!]
>   si->flags = 0

This happens after we clear the si->flags
synchronize_rcu()
release swap_file, bdev, in 
destroy_swap_extents()

So I think if we have get_swap_device/put_swap_device in do_swap_page,
it should fix the race you've pointed out here.  
Then synchronize_rcu() will wait till we have completed do_swap_page and
call put_swap_device.

> 
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).

I think it is better to break this patch into two.

One patch is to fix the race in do_swap_page and swapoff
by adding get_swap_device/put_swap_device in do_swap_page.

The second patch is to modify get_swap_device and put_swap_device
with percpu_ref. But swapoff is a relatively rare events.  

I am not sure making percpu_ref change for performance is really beneficial.
Did you encounter a real use case where you see a problem with swapoff?
The delay in swapoff is primarily in try_to_unuse to bring all
the swapped off pages back into memory.  Synchronizing with other
CPU for paging in probably is a small component in overall scheme
of things.

Thanks.

Tim



Re: [PATCH 0/5] close various race windows for swap

2021-04-08 Thread riteshh
On 21/04/08 09:08AM, Miaohe Lin wrote:
> Hi all,
> When I was investigating the swap code, I found some possible race
> windows. This series aims to fix all these races. But using current
> get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really
> long time. And to reduce the performance overhead on the hot-path as
> much as possible, it appears we can use the percpu_ref to close this
> race window(as suggested by Huang, Ying). The patch 1 adds percpu_ref
> support for swap and the rest of the patches use this to close various
> race windows. More details can be found in the respective changelogs.
> Thanks!
>
> Miaohe Lin (5):
>   mm/swapfile: add percpu_ref support for swap
>   swap: fix do_swap_page() race with swapoff
>   mm/swap_state: fix get_shadow_from_swap_cache() race with swapoff
>   mm/swap_state: fix potential faulted in race in swap_ra_info()
>   mm/swap_state: fix swap_cluster_readahead() race with swapoff

Somehow I see Patch-1 and Patch-2 are missing on linux-mm[1].
Also I wanted to ask if you have a way to trigger this in a more controlled
environment (consistently)?

[1]: 
https://patchwork.kernel.org/project/linux-mm/cover/20210408130820.48233-1-linmia...@huawei.com/

-ritesh

>
>  include/linux/swap.h |  4 +++-
>  mm/memory.c  | 10 +
>  mm/swap_state.c  | 33 +
>  mm/swapfile.c| 50 +++-
>  4 files changed, 68 insertions(+), 29 deletions(-)
>
> --
> 2.19.1
>
>


[PATCH 2/5] swap: fix do_swap_page() race with swapoff

2021-04-08 Thread Miaohe Lin
When I was investigating the swap code, I found the below possible race
window:

CPU 1   CPU 2
-   -
do_swap_page
  synchronous swap_readpage
alloc_page_vma
swapoff
  release swap_file, bdev, or ...
  swap_readpage
check sis->flags is ok
  access swap_file, bdev...[oops!]
si->flags = 0

Using current get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really long
time. And this race may not be really pernicious because swapoff is usually
done when system shutdown only. To reduce the performance overhead on the
hot-path as much as possible, it appears we can use the percpu_ref to close
this race window(as suggested by Huang, Ying).

Fixes: 235b62176712 ("mm/swap: add cluster lock")
Signed-off-by: Miaohe Lin 
---
 include/linux/swap.h |  2 +-
 mm/memory.c  | 10 ++
 mm/swapfile.c| 28 +++-
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 849ba5265c11..9066addb57fd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
 
 static inline void put_swap_device(struct swap_info_struct *si)
 {
-   rcu_read_unlock();
+   percpu_ref_put(>users);
 }
 
 #else /* CONFIG_SWAP */
diff --git a/mm/memory.c b/mm/memory.c
index cc71a445c76c..8543c47b955c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *swapcache;
+   struct swap_info_struct *si = NULL;
swp_entry_t entry;
pte_t pte;
int locked;
@@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
 
 
+   si = get_swap_device(entry);
+   /* In case we raced with swapoff. */
+   if (unlikely(!si))
+   goto out;
+
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry, vma, vmf->address);
swapcache = page;
@@ -3514,6 +3520,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
+   if (si)
+   put_swap_device(si);
return ret;
 out_nomap:
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -3525,6 +3533,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
unlock_page(swapcache);
put_page(swapcache);
}
+   if (si)
+   put_swap_device(si);
return ret;
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 724173cd7d0c..01032c72ceae 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1280,18 +1280,12 @@ static unsigned char __swap_entry_free_locked(struct 
swap_info_struct *p,
  * via preventing the swap device from being swapoff, until
  * put_swap_device() is called.  Otherwise return NULL.
  *
- * The entirety of the RCU read critical section must come before the
- * return from or after the call to synchronize_rcu() in
- * enable_swap_info() or swapoff().  So if "si->flags & SWP_VALID" is
- * true, the si->map, si->cluster_info, etc. must be valid in the
- * critical section.
- *
  * Notice that swapoff or swapoff+swapon can still happen before the
- * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
- * in put_swap_device() if there isn't any other way to prevent
- * swapoff, such as page lock, page table lock, etc.  The caller must
- * be prepared for that.  For example, the following situation is
- * possible.
+ * percpu_ref_tryget_live() in get_swap_device() or after the
+ * percpu_ref_put() in put_swap_device() if there isn't any other way
+ * to prevent swapoff, such as page lock, page table lock, etc.  The
+ * caller must be prepared for that.  For example, the following
+ * situation is possible.
  *
  *   CPU1  CPU2
  *   do_swap_page()
@@ -1319,21 +1313,21 @@ struct swap_info_struct *get_swap_device(swp_entry_t 
entry)
si = swp_swap_info(entry);
if (!si)
goto bad_nofile;
-
-   rcu_read_lock();
if (data_race(!(si->flags & SWP_VALID)))
-   goto unlock_out;
+   goto out;
+   if (!percpu_ref_tryget_live(>users))
+   goto out;
offset = swp_offset(entry);
if (offset >= si->max)
-   goto unlock_out;
+   goto put_out;
 
return si;
 bad_nofile:
pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
 out:
return NULL;
-unlock_out:
-   rcu_read_unlock();
+put_out:
+   percpu_ref_put(>users);
return NULL;
 }
 
-- 
2.19.1



[PATCH 1/5] mm/swapfile: add percpu_ref support for swap

2021-04-08 Thread Miaohe Lin
We will use percpu-refcount to serialize against concurrent swapoff. This
patch adds the percpu_ref support for later fixup.

Signed-off-by: Miaohe Lin 
---
 include/linux/swap.h |  2 ++
 mm/swapfile.c| 25 ++---
 2 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 144727041e78..849ba5265c11 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -240,6 +240,7 @@ struct swap_cluster_list {
  * The in-memory structure used to track swap areas.
  */
 struct swap_info_struct {
+   struct percpu_ref users;/* serialization against concurrent 
swapoff */
unsigned long   flags;  /* SWP_USED etc: see above */
signed shortprio;   /* swap priority of this type */
struct plist_node list; /* entry in swap_active_head */
@@ -260,6 +261,7 @@ struct swap_info_struct {
struct block_device *bdev;  /* swap device or bdev of swap file */
struct file *swap_file; /* seldom referenced */
unsigned int old_block_size;/* seldom referenced */
+   struct completion comp; /* seldom referenced */
 #ifdef CONFIG_FRONTSWAP
unsigned long *frontswap_map;   /* frontswap in-use, one bit per page */
atomic_t frontswap_pages;   /* frontswap pages in-use counter */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 149e77454e3c..724173cd7d0c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -511,6 +512,15 @@ static void swap_discard_work(struct work_struct *work)
spin_unlock(>lock);
 }
 
+static void swap_users_ref_free(struct percpu_ref *ref)
+{
+   struct swap_info_struct *si;
+
+   si = container_of(ref, struct swap_info_struct, users);
+   complete(>comp);
+   percpu_ref_exit(>users);
+}
+
 static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
 {
struct swap_cluster_info *ci = si->cluster_info;
@@ -2500,7 +2510,7 @@ static void enable_swap_info(struct swap_info_struct *p, 
int prio,
 * Guarantee swap_map, cluster_info, etc. fields are valid
 * between get/put_swap_device() if SWP_VALID bit is set
 */
-   synchronize_rcu();
+   percpu_ref_reinit(>users);
spin_lock(_lock);
spin_lock(>lock);
_enable_swap_info(p);
@@ -2621,11 +2631,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, 
specialfile)
p->flags &= ~SWP_VALID; /* mark swap device as invalid */
spin_unlock(>lock);
spin_unlock(_lock);
+
+   percpu_ref_kill(>users);
/*
 * wait for swap operations protected by get/put_swap_device()
 * to complete
 */
-   synchronize_rcu();
+   wait_for_completion(>comp);
 
flush_work(>discard_work);
 
@@ -3132,7 +3144,7 @@ static bool swap_discardable(struct swap_info_struct *si)
 SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 {
struct swap_info_struct *p;
-   struct filename *name;
+   struct filename *name = NULL;
struct file *swap_file = NULL;
struct address_space *mapping;
int prio;
@@ -3163,6 +3175,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
specialfile, int, swap_flags)
 
INIT_WORK(>discard_work, swap_discard_work);
 
+   init_completion(>comp);
+   error = percpu_ref_init(>users, swap_users_ref_free,
+   PERCPU_REF_INIT_DEAD, GFP_KERNEL);
+   if (unlikely(error))
+   goto bad_swap;
+
name = getname(specialfile);
if (IS_ERR(name)) {
error = PTR_ERR(name);
@@ -3356,6 +3374,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, 
int, swap_flags)
 bad_swap_unlock_inode:
inode_unlock(inode);
 bad_swap:
+   percpu_ref_exit(>users);
free_percpu(p->percpu_cluster);
p->percpu_cluster = NULL;
free_percpu(p->cluster_next_cpu);
-- 
2.19.1



[PATCH 0/5] close various race windows for swap

2021-04-08 Thread Miaohe Lin
Hi all,
When I was investigating the swap code, I found some possible race
windows. This series aims to fix all these races. But using current
get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really
long time. And to reduce the performance overhead on the hot-path as
much as possible, it appears we can use the percpu_ref to close this
race window(as suggested by Huang, Ying). The patch 1 adds percpu_ref
support for swap and the rest of the patches use this to close various
race windows. More details can be found in the respective changelogs.
Thanks!

Miaohe Lin (5):
  mm/swapfile: add percpu_ref support for swap
  swap: fix do_swap_page() race with swapoff
  mm/swap_state: fix get_shadow_from_swap_cache() race with swapoff
  mm/swap_state: fix potential faulted in race in swap_ra_info()
  mm/swap_state: fix swap_cluster_readahead() race with swapoff

 include/linux/swap.h |  4 +++-
 mm/memory.c  | 10 +
 mm/swap_state.c  | 33 +
 mm/swapfile.c| 50 +++-
 4 files changed, 68 insertions(+), 29 deletions(-)

-- 
2.19.1



[PATCH v8 2/8] mm/swapops: Rework swap entry manipulation code

2021-04-07 Thread Alistair Popple
Both migration and device private pages use special swap entries that
are manipluated by a range of inline functions. The arguments to these
are somewhat inconsitent so rework them to remove flag type arguments
and to make the arguments similar for both read and write entry
creation.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ralph Campbell 
---
 include/linux/swapops.h | 56 ++---
 mm/debug_vm_pgtable.c   | 12 -
 mm/hmm.c|  2 +-
 mm/huge_memory.c| 26 +--
 mm/hugetlb.c| 10 +---
 mm/memory.c | 10 +---
 mm/migrate.c| 26 ++-
 mm/mprotect.c   | 10 +---
 mm/rmap.c   | 10 +---
 9 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 139be8235ad2..4dfd807ae52a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,35 +100,35 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 }
 
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
-   return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
-page_to_pfn(page));
+   return swp_entry(SWP_DEVICE_READ, offset);
 }
 
-static inline bool is_device_private_entry(swp_entry_t entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
-   int type = swp_type(entry);
-   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+   return swp_entry(SWP_DEVICE_WRITE, offset);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline bool is_device_private_entry(swp_entry_t entry)
 {
-   *entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+   int type = swp_type(entry);
+   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
 #else /* CONFIG_DEVICE_PRIVATE */
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
return swp_entry(0, 0);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
+   return swp_entry(0, 0);
 }
 
 static inline bool is_device_private_entry(swp_entry_t entry)
@@ -136,35 +136,32 @@ static inline bool is_device_private_entry(swp_entry_t 
entry)
return false;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return false;
 }
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
-static inline swp_entry_t make_migration_entry(struct page *page, int write)
-{
-   BUG_ON(!PageLocked(compound_head(page)));
-
-   return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
-   page_to_pfn(page));
-}
-
 static inline int is_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
-   *entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
+   return swp_entry(SWP_MIGRATION_READ, offset);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
@@ -174,21 +171,28 @@ extern void migration_entry_wait(struct mm_struct *mm, 
pmd_t *pmd,
 extern void migration_entry_wait_huge(struct vm_area_struct *vma,
struct mm_struct *mm, pte_t *pte);
 #else
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
 
-#define make_migration_entry(page, write) swp_entry(0, 0)
 static inline int is_migration_entry(swp_entry_t swp)
 {
return 0;
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void __migration_entry_wait(struct mm_struct *mm

[PATCH v8 1/8] mm: Remove special swap entry functions

2021-04-07 Thread Alistair Popple
Remove multiple similar inline functions for dealing with different
types of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct
page for each swap entry type use a common function
pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn()
functions as this results is shorter code that is easier to understand.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 

---

v7:
* Reworded commit message to include pfn_swap_entry_to_page()
* Added Christoph's Reviewed-by

v6:
* Removed redundant compound_page() call from inside PageLocked()
* Fixed a minor build issue for s390 reported by kernel test bot

v4:
* Added pfn_swap_entry_to_page()
* Reinstated check that migration entries point to locked pages
* Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
  builds
---
 arch/s390/mm/pgtable.c  |  2 +-
 fs/proc/task_mmu.c  | 23 +-
 include/linux/swap.h|  4 +--
 include/linux/swapops.h | 69 ++---
 mm/hmm.c|  5 ++-
 mm/huge_memory.c|  4 +--
 mm/memcontrol.c |  2 +-
 mm/memory.c | 10 +++---
 mm/migrate.c|  6 ++--
 mm/page_vma_mapped.c|  6 ++--
 10 files changed, 50 insertions(+), 81 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 18205f851c24..eec3a9d7176e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, 
swp_entry_t entry)
if (!non_swap_entry(entry))
dec_mm_counter(mm, MM_SWAPENTS);
else if (is_migration_entry(entry)) {
-   struct page *page = migration_entry_to_page(entry);
+   struct page *page = pfn_swap_entry_to_page(entry);
 
dec_mm_counter(mm, mm_counter(page));
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3cec6fbef725..08ee59d945c0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
} else {
mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
}
-   } else if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   } else if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
&& pte_none(*pte))) {
page = xa_load(>vm_file->f_mapping->i_pages,
@@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
+   page = pfn_swap_entry_to_page(entry);
}
if (IS_ERR_OR_NULL(page))
return;
@@ -691,10 +689,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
-   if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
}
if (page) {
int mapcount = page_mapcount(page);
@@ -1383,11 +1379,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
pagemapread *pm,
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
flags |= PM_SWAP;
-   if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
-
-   if (is_device_private_entry(entry))
-   page = device_private_entry_to_page(entry);
+   if (is_pfn_swap_entry(entry))
+   page = pfn_swap_entry_to_page(entry);
}
 
if (page && !PageAnon(page))
@@ -1444,7 +1437,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
if (pmd_swp_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
VM_BUG_ON(!is_pmd_migration_entry(pmd));
-   page = migration_entry_to_page(entry);
+   page = pfn_s

Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop

2021-04-05 Thread Alexey Avramov
> In the case of high system memory and load pressure, we ran ltp test
> and found that the system was stuck, the direct memory reclaim was
> all stuck in io_schedule

> For the first time involving the swap part, there is no good way to fix
> the problem

The solution is protecting the clean file pages.

Look at this:

> On ChromiumOS, we do not use swap. When memory is low, the only 
> way to free memory is to reclaim pages from the file list. This 
> results in a lot of thrashing under low memory conditions. We see 
> the system become unresponsive for minutes before it eventually OOMs. 
> We also see very slow browser tab switching under low memory. Instead 
> of an unresponsive system, we'd really like the kernel to OOM as soon 
> as it starts to thrash. If it can't keep the working set in memory, 
> then OOM. Losing one of many tabs is a better behaviour for the user 
> than an unresponsive system.

> This patch create a new sysctl, min_filelist_kbytes, which disables 
> reclaim of file-backed pages when when there are less than min_filelist_bytes 
> worth of such pages in the cache. This tunable is handy for low memory 
> systems using solid-state storage where interactive response is more 
> important 
> than not OOMing.

> With this patch and min_filelist_kbytes set to 5, I see very little block 
> layer activity during low memory. The system stays responsive under low 
> memory and browser tab switching is fast. Eventually, a process a gets killed 
> by OOM. Without this patch, the system gets wedged for minutes before it 
> eventually OOMs.

— https://lore.kernel.org/patchwork/patch/222042/

This patch can almost completely eliminate thrashing under memory pressure.

Effects
- Improving system responsiveness under low-memory conditions;
- Improving performans in I/O bound tasks under memory pressure;
- OOM killer comes faster (with hard protection);
- Fast system reclaiming after OOM.

Read more: https://github.com/hakavlad/le9-patch

The patch:

From 371e3e5290652e97d5279d8cd215cd356c1fb47b Mon Sep 17 00:00:00 2001
From: Alexey Avramov 
Date: Mon, 5 Apr 2021 01:53:26 +0900
Subject: [PATCH] mm/vmscan: add sysctl knobs for protecting the specified
 amount of clean file cache

The kernel does not have a mechanism for targeted protection of clean
file pages (CFP). A certain amount of the CFP is required by the userspace
for normal operation. First of all, you need a cache of shared libraries
and executable files. If the volume of the CFP cache falls below a certain
level, thrashing and even livelock occurs.

Protection of CFP may be used to prevent thrashing and reducing I/O under
memory pressure. Hard protection of CFP may be used to avoid high latency
and prevent livelock in near-OOM conditions. The patch provides sysctl
knobs for protecting the specified amount of clean file cache under memory
pressure.

The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of
CFP. The CFP on the current node won't be reclaimed uder memory pressure
when their volume is below vm.clean_low_kbytes *unless* we threaten to OOM
or have no swap space or vm.swappiness=0. Setting it to a high value may
result in a early eviction of anonymous pages into the swap space by
attempting to hold the protected amount of clean file pages in memory. The
default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
Kconfig).

The vm.clean_min_kbytes sysctl knob provides *hard* protection of CFP. The
CFP on the current node won't be reclaimed under memory pressure when their
volume is below vm.clean_min_kbytes. Setting it to a high value may result
in a early out-of-memory condition due to the inability to reclaim the
protected amount of CFP when other types of pages cannot be reclaimed. The
default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
Kconfig).

Reported-by: Artem S. Tashkinov 
Signed-off-by: Alexey Avramov 
---
 Documentation/admin-guide/sysctl/vm.rst | 37 +
 include/linux/mm.h  |  3 ++
 kernel/sysctl.c | 14 
 mm/Kconfig  | 35 +++
 mm/vmscan.c | 59 +
 5 files changed, 148 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst 
b/Documentation/admin-guide/sysctl/vm.rst
index f455fa00c..5d5ddfc85 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -26,6 +26,8 @@ Currently, these files are in /proc/sys/vm:

 - admin_reserve_kbytes
 - block_dump
+- clean_low_kbytes
+- clean_min_kbytes
 - compact_memory
 - compaction_proactiveness
 - compact_unevictable_allowed
@@ -113,6 +115,41 @@ block_dump enables block I/O debugging when set to a 
nonzero value. More
 information on block I/O debugging is in 
Documentation/admin-guide/laptops/laptop-mode.rst.


+clean_low_kbytes
+===

[PATCH 8/8] drm/msm: Support evicting GEM objects to swap

2021-04-05 Thread Rob Clark
From: Rob Clark 

Now that tracking is wired up for potentially evictable GEM objects,
wire up shrinker and the remaining GEM bits for unpinning backing pages
of inactive objects.

Signed-off-by: Rob Clark 
---
 drivers/gpu/drm/msm/msm_gem.c  | 23 
 drivers/gpu/drm/msm/msm_gem_shrinker.c | 37 +-
 drivers/gpu/drm/msm/msm_gpu_trace.h| 13 +
 3 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c
index 163a1d30b5c9..2b731cf42294 100644
--- a/drivers/gpu/drm/msm/msm_gem.c
+++ b/drivers/gpu/drm/msm/msm_gem.c
@@ -759,6 +759,29 @@ void msm_gem_purge(struct drm_gem_object *obj)
0, (loff_t)-1);
 }
 
+/**
+ * Unpin the backing pages and make them available to be swapped out.
+ */
+void msm_gem_evict(struct drm_gem_object *obj)
+{
+   struct drm_device *dev = obj->dev;
+   struct msm_gem_object *msm_obj = to_msm_bo(obj);
+
+   GEM_WARN_ON(!msm_gem_is_locked(obj));
+   GEM_WARN_ON(is_unevictable(msm_obj));
+   GEM_WARN_ON(!msm_obj->evictable);
+   GEM_WARN_ON(msm_obj->active_count);
+
+   /* Get rid of any iommu mapping(s): */
+   put_iova_spaces(obj, false);
+
+   drm_vma_node_unmap(>vma_node, dev->anon_inode->i_mapping);
+
+   put_pages(obj);
+
+   update_inactive(msm_obj);
+}
+
 void msm_gem_vunmap(struct drm_gem_object *obj)
 {
struct msm_gem_object *msm_obj = to_msm_bo(obj);
diff --git a/drivers/gpu/drm/msm/msm_gem_shrinker.c 
b/drivers/gpu/drm/msm/msm_gem_shrinker.c
index 38bf919f8508..52828028b9d4 100644
--- a/drivers/gpu/drm/msm/msm_gem_shrinker.c
+++ b/drivers/gpu/drm/msm/msm_gem_shrinker.c
@@ -9,12 +9,26 @@
 #include "msm_gpu.h"
 #include "msm_gpu_trace.h"
 
+bool enable_swap = true;
+MODULE_PARM_DESC(enable_swap, "Enable swappable GEM buffers");
+module_param(enable_swap, bool, 0600);
+
+static bool can_swap(void)
+{
+   return enable_swap && get_nr_swap_pages() > 0;
+}
+
 static unsigned long
 msm_gem_shrinker_count(struct shrinker *shrinker, struct shrink_control *sc)
 {
struct msm_drm_private *priv =
container_of(shrinker, struct msm_drm_private, shrinker);
-   return priv->shrinkable_count;
+   unsigned count = priv->shrinkable_count;
+
+   if (can_swap())
+   count += priv->evictable_count;
+
+   return count;
 }
 
 static bool
@@ -32,6 +46,17 @@ purge(struct msm_gem_object *msm_obj)
return true;
 }
 
+static bool
+evict(struct msm_gem_object *msm_obj)
+{
+   if (is_unevictable(msm_obj))
+   return false;
+
+   msm_gem_evict(_obj->base);
+
+   return true;
+}
+
 static unsigned long
 scan(struct msm_drm_private *priv, unsigned nr_to_scan, struct list_head *list,
bool (*shrink)(struct msm_gem_object *msm_obj))
@@ -104,6 +129,16 @@ msm_gem_shrinker_scan(struct shrinker *shrinker, struct 
shrink_control *sc)
if (freed > 0)
trace_msm_gem_purge(freed << PAGE_SHIFT);
 
+   if (can_swap() && freed < sc->nr_to_scan) {
+   int evicted = scan(priv, sc->nr_to_scan - freed,
+   >inactive_willneed, evict);
+
+   if (evicted > 0)
+   trace_msm_gem_evict(evicted << PAGE_SHIFT);
+
+   freed += evicted;
+   }
+
return (freed > 0) ? freed : SHRINK_STOP;
 }
 
diff --git a/drivers/gpu/drm/msm/msm_gpu_trace.h 
b/drivers/gpu/drm/msm/msm_gpu_trace.h
index 03e0c2536b94..ca0b08d7875b 100644
--- a/drivers/gpu/drm/msm/msm_gpu_trace.h
+++ b/drivers/gpu/drm/msm/msm_gpu_trace.h
@@ -128,6 +128,19 @@ TRACE_EVENT(msm_gem_purge,
 );
 
 
+TRACE_EVENT(msm_gem_evict,
+   TP_PROTO(u32 bytes),
+   TP_ARGS(bytes),
+   TP_STRUCT__entry(
+   __field(u32, bytes)
+   ),
+   TP_fast_assign(
+   __entry->bytes = bytes;
+   ),
+   TP_printk("Evicting %u bytes", __entry->bytes)
+);
+
+
 TRACE_EVENT(msm_gem_purge_vmaps,
TP_PROTO(u32 unmapped),
TP_ARGS(unmapped),
-- 
2.30.2



Re: [PATCH v3] coccinelle: misc: add swap script

2021-04-04 Thread Julia Lawall



On Sun, 28 Mar 2021, Denis Efremov wrote:

> Ping?

Applied.  Thanks.

>
> On 3/5/21 1:09 PM, Denis Efremov wrote:
> > Check for opencoded swap() implementation.
> >
> > Signed-off-by: Denis Efremov 
> > ---
> > Changes in v2:
> >   - additional patch rule to drop excessive {}
> >   - fix indentation in patch mode by anchoring ;
> > Changes in v3:
> >   - Rule added for simple (without var init) swap highlighting in !patch
> > mode
> >   - "depends on patch && (rpvar || rp)" fixed
> >
> >   scripts/coccinelle/misc/swap.cocci | 122 +
> >   1 file changed, 122 insertions(+)
> >   create mode 100644 scripts/coccinelle/misc/swap.cocci
> >
> > diff --git a/scripts/coccinelle/misc/swap.cocci
> > b/scripts/coccinelle/misc/swap.cocci
> > new file mode 100644
> > index ..c5e71b7ef7f5
> > --- /dev/null
> > +++ b/scripts/coccinelle/misc/swap.cocci
> > @@ -0,0 +1,122 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +///
> > +/// Check for opencoded swap() implementation.
> > +///
> > +// Confidence: High
> > +// Copyright: (C) 2021 Denis Efremov ISPRAS
> > +// Options: --no-includes --include-headers
> > +//
> > +// Keywords: swap
> > +//
> > +
> > +virtual patch
> > +virtual org
> > +virtual report
> > +virtual context
> > +
> > +@rvar depends on !patch@
> > +identifier tmp;
> > +expression a, b;
> > +type T;
> > +position p;
> > +@@
> > +
> > +(
> > +* T tmp;
> > +|
> > +* T tmp = 0;
> > +|
> > +* T *tmp = NULL;
> > +)
> > +... when != tmp
> > +* tmp = a;
> > +* a = b;@p
> > +* b = tmp;
> > +... when != tmp
> > +
> > +@r depends on !patch@
> > +identifier tmp;
> > +expression a, b;
> > +position p != rvar.p;
> > +@@
> > +
> > +* tmp = a;
> > +* a = b;@p
> > +* b = tmp;
> > +
> > +@rpvar depends on patch@
> > +identifier tmp;
> > +expression a, b;
> > +type T;
> > +@@
> > +
> > +(
> > +- T tmp;
> > +|
> > +- T tmp = 0;
> > +|
> > +- T *tmp = NULL;
> > +)
> > +... when != tmp
> > +- tmp = a;
> > +- a = b;
> > +- b = tmp
> > ++ swap(a, b)
> > +  ;
> > +... when != tmp
> > +
> > +@rp depends on patch@
> > +identifier tmp;
> > +expression a, b;
> > +@@
> > +
> > +- tmp = a;
> > +- a = b;
> > +- b = tmp
> > ++ swap(a, b)
> > +  ;
> > +
> > +@depends on patch && (rpvar || rp)@
> > +@@
> > +
> > +(
> > +  for (...;...;...)
> > +- {
> > +   swap(...);
> > +- }
> > +|
> > +  while (...)
> > +- {
> > +   swap(...);
> > +- }
> > +|
> > +  if (...)
> > +- {
> > +   swap(...);
> > +- }
> > +)
> > +
> > +
> > +@script:python depends on report@
> > +p << r.p;
> > +@@
> > +
> > +coccilib.report.print_report(p[0], "WARNING opportunity for swap()")
> > +
> > +@script:python depends on org@
> > +p << r.p;
> > +@@
> > +
> > +coccilib.org.print_todo(p[0], "WARNING opportunity for swap()")
> > +
> > +@script:python depends on report@
> > +p << rvar.p;
> > +@@
> > +
> > +coccilib.report.print_report(p[0], "WARNING opportunity for swap()")
> > +
> > +@script:python depends on org@
> > +p << rvar.p;
> > +@@
> > +
> > +coccilib.org.print_todo(p[0], "WARNING opportunity for swap()")
> >
>


Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop

2021-04-02 Thread Andrew Morton
On Fri, 2 Apr 2021 15:03:37 +0800 Stillinux  wrote:

> In the case of high system memory and load pressure, we ran ltp test
> and found that the system was stuck, the direct memory reclaim was
> all stuck in io_schedule, the waiting request was stuck in the blk_plug
> flow of one process, and this process fell into an infinite loop.
> not do the action of brushing out the request.
> 
> The call flow of this process is swap_cluster_readahead.
> Use blk_start/finish_plug for blk_plug operation,
> flow swap_cluster_readahead->__read_swap_cache_async->swapcache_prepare.
> When swapcache_prepare return -EEXIST, it will fall into an infinite loop,
> even if cond_resched is called, but according to the schedule,
> sched_submit_work will be based on tsk->state, and will not flash out
> the blk_plug request, so will hang io, causing the overall system  hang.
> 
> For the first time involving the swap part, there is no good way to fix
> the problem from the fundamental problem. In order to solve the
> engineering situation, we chose to make swap_cluster_readahead aware of
> the memory pressure situation as soon as possible, and do io_schedule to
> flush out the blk_plug request, thereby changing the allocation flag in
> swap_readpage to GFP_NOIO , No longer do the memory reclaim of flush io.
> Although system operating normally, but not the most fundamental way.
> 

Thanks.

I'm not understanding why swapcache_prepare() repeatedly returns
-EEXIST in this situation?

And how does the switch to GFP_NOIO fix this?  Simply by avoiding
direct reclaim altogether?

> ---
>  mm/page_io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_io.c b/mm/page_io.c
> index c493ce9ebcf5..87392ffabb12 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -403,7 +403,7 @@ int swap_readpage(struct page *page, bool synchronous)
>   }
> 
>   ret = 0;
> - bio = bio_alloc(GFP_KERNEL, 1);
> + bio = bio_alloc(GFP_NOIO, 1);
>   bio_set_dev(bio, sis->bdev);
>   bio->bi_opf = REQ_OP_READ;
>   bio->bi_iter.bi_sector = swap_page_sector(page);



Re: [PATCH 08/10] mm/vmscan: Consider anonymous pages without swap

2021-04-01 Thread Wei Xu
On Thu, Apr 1, 2021 at 11:35 AM Dave Hansen  wrote:
>
>
> From: Keith Busch 
>
> Reclaim anonymous pages if a migration path is available now that
> demotion provides a non-swap recourse for reclaiming anon pages.
>
> Note that this check is subtly different from the
> anon_should_be_aged() checks.  This mechanism checks whether a
> specific page in a specific context *can* actually be reclaimed, given
> current swap space and cgroup limits
>
> anon_should_be_aged() is a much simpler and more preliminary check
> which just says whether there is a possibility of future reclaim.
>
> #Signed-off-by: Keith Busch 
> Cc: Keith Busch 
> Signed-off-by: Dave Hansen 
> Reviewed-by: Yang Shi 
> Cc: Wei Xu 
> Cc: David Rientjes 
> Cc: Huang Ying 
> Cc: Dan Williams 
> Cc: David Hildenbrand 
> Cc: osalvador 
>
> --
>
> Changes from Dave 10/2020:
>  * remove 'total_swap_pages' modification
>
> Changes from Dave 06/2020:
>  * rename reclaim_anon_pages()->can_reclaim_anon_pages()
>
> Note: Keith's Intel SoB is commented out because he is no
> longer at Intel and his @intel.com mail will bounce.
> ---
>
>  b/mm/vmscan.c |   35 ---
>  1 file changed, 32 insertions(+), 3 deletions(-)
>
> diff -puN mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap 
> mm/vmscan.c
> --- a/mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap  
> 2021-03-31 15:17:19.388000242 -0700
> +++ b/mm/vmscan.c   2021-03-31 15:17:19.407000242 -0700
> @@ -287,6 +287,34 @@ static bool writeback_throttling_sane(st
>  }
>  #endif
>
> +static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
> +     int node_id)
> +{
> +   if (memcg == NULL) {
> +   /*
> +* For non-memcg reclaim, is there
> +* space in any swap device?
> +*/
> +   if (get_nr_swap_pages() > 0)
> +   return true;
> +   } else {
> +   /* Is the memcg below its swap limit? */
> +   if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
> +   return true;
> +   }
> +
> +   /*
> +* The page can not be swapped.
> +*
> +* Can it be reclaimed from this node via demotion?
> +*/
> +   if (next_demotion_node(node_id) >= 0)
> +   return true;

When neither swap space nor RECLAIM_MIGRATE is enabled, but
next_demotion_node() is configured, inactive pages cannot be swapped out
nor demoted.  However, this check can still cause these pages to be sent
to shrink_page_list() (e.g., when can_reclaim_anon_pages() is called by
get_scan_count()) and make the THP pages being unnecessarily split there.

One fix would be to guard this next_demotion_node() check with the
RECLAIM_MIGRATE node_reclaim_mode check.  This RECLAIM_MIGRATE
check needs to be applied to other calls to next_demotion_node() in
vmscan.c as well.

> +
> +   /* No way to reclaim anon pages */
> +   return false;
> +}
> +
>  /*
>   * This misses isolated pages which are not accounted for to save counters.
>   * As the data only determines if reclaim or compaction continues, it is
> @@ -298,7 +326,7 @@ unsigned long zone_reclaimable_pages(str
>
> nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
> zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
> -   if (get_nr_swap_pages() > 0)
> +   if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
> nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
> zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
>
> @@ -2323,6 +2351,7 @@ enum scan_balance {
>  static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
>unsigned long *nr)
>  {
> +   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> unsigned long anon_cost, file_cost, total_cost;
> int swappiness = mem_cgroup_swappiness(memcg);
> @@ -2333,7 +2362,7 @@ static void get_scan_count(struct lruvec
> enum lru_list lru;
>
> /* If we have no swap space, do not bother scanning anon pages. */
> -   if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
> +   if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {

Demotion of anon pages still depends on sc->may_swap.  Any thoughts on
decoupling
demotion from swapping more completely?

> scan_balance = SCAN_FILE;
> goto out;
> }
> @@ -2708,7 +2737,7 @@ static inlin

[PATCH 08/10] mm/vmscan: Consider anonymous pages without swap

2021-04-01 Thread Dave Hansen


From: Keith Busch 

Reclaim anonymous pages if a migration path is available now that
demotion provides a non-swap recourse for reclaiming anon pages.

Note that this check is subtly different from the
anon_should_be_aged() checks.  This mechanism checks whether a
specific page in a specific context *can* actually be reclaimed, given
current swap space and cgroup limits

anon_should_be_aged() is a much simpler and more preliminary check
which just says whether there is a possibility of future reclaim.

#Signed-off-by: Keith Busch 
Cc: Keith Busch 
Signed-off-by: Dave Hansen 
Reviewed-by: Yang Shi 
Cc: Wei Xu 
Cc: David Rientjes 
Cc: Huang Ying 
Cc: Dan Williams 
Cc: David Hildenbrand 
Cc: osalvador 

--

Changes from Dave 10/2020:
 * remove 'total_swap_pages' modification

Changes from Dave 06/2020:
 * rename reclaim_anon_pages()->can_reclaim_anon_pages()

Note: Keith's Intel SoB is commented out because he is no
longer at Intel and his @intel.com mail will bounce.
---

 b/mm/vmscan.c |   35 ---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff -puN mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap 
mm/vmscan.c
--- a/mm/vmscan.c~0009-mm-vmscan-Consider-anonymous-pages-without-swap  
2021-03-31 15:17:19.388000242 -0700
+++ b/mm/vmscan.c   2021-03-31 15:17:19.407000242 -0700
@@ -287,6 +287,34 @@ static bool writeback_throttling_sane(st
 }
 #endif
 
+static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
+ int node_id)
+{
+   if (memcg == NULL) {
+   /*
+* For non-memcg reclaim, is there
+* space in any swap device?
+*/
+   if (get_nr_swap_pages() > 0)
+   return true;
+   } else {
+   /* Is the memcg below its swap limit? */
+   if (mem_cgroup_get_nr_swap_pages(memcg) > 0)
+   return true;
+   }
+
+   /*
+* The page can not be swapped.
+*
+* Can it be reclaimed from this node via demotion?
+*/
+   if (next_demotion_node(node_id) >= 0)
+   return true;
+
+   /* No way to reclaim anon pages */
+   return false;
+}
+
 /*
  * This misses isolated pages which are not accounted for to save counters.
  * As the data only determines if reclaim or compaction continues, it is
@@ -298,7 +326,7 @@ unsigned long zone_reclaimable_pages(str
 
nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
-   if (get_nr_swap_pages() > 0)
+   if (can_reclaim_anon_pages(NULL, zone_to_nid(zone)))
nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
 
@@ -2323,6 +2351,7 @@ enum scan_balance {
 static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
   unsigned long *nr)
 {
+   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
unsigned long anon_cost, file_cost, total_cost;
int swappiness = mem_cgroup_swappiness(memcg);
@@ -2333,7 +2362,7 @@ static void get_scan_count(struct lruvec
enum lru_list lru;
 
/* If we have no swap space, do not bother scanning anon pages. */
-   if (!sc->may_swap || mem_cgroup_get_nr_swap_pages(memcg) <= 0) {
+   if (!sc->may_swap || !can_reclaim_anon_pages(memcg, pgdat->node_id)) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2708,7 +2737,7 @@ static inline bool should_continue_recla
 */
pages_for_compaction = compact_gap(sc->order);
inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-   if (get_nr_swap_pages() > 0)
+   if (can_reclaim_anon_pages(NULL, pgdat->node_id))
inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 
return inactive_lru_pages > pages_for_compaction;
_


Re: [PATCH v7 1/8] mm: Remove special swap entry functions

2021-03-30 Thread Jason Gunthorpe
On Fri, Mar 26, 2021 at 11:07:58AM +1100, Alistair Popple wrote:
> Remove multiple similar inline functions for dealing with different
> types of special swap entries.
> 
> Both migration and device private swap entries use the swap offset to
> store a pfn. Instead of multiple inline functions to obtain a struct
> page for each swap entry type use a common function
> pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn()
> functions as this results is shorter code that is easier to understand.
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Ralph Campbell 
> Reviewed-by: Christoph Hellwig 
> 
> ---
> 
> v7:
> * Reworded commit message to include pfn_swap_entry_to_page()
> * Added Christoph's Reviewed-by
> 
> v6:
> * Removed redundant compound_page() call from inside PageLocked()
> * Fixed a minor build issue for s390 reported by kernel test bot
> 
> v4:
> * Added pfn_swap_entry_to_page()
> * Reinstated check that migration entries point to locked pages
> * Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
>   builds
> ---
>  arch/s390/mm/pgtable.c  |  2 +-
>  fs/proc/task_mmu.c  | 23 +-
>  include/linux/swap.h|  4 +--
>  include/linux/swapops.h | 69 ++---
>  mm/hmm.c|  5 ++-
>  mm/huge_memory.c|  4 +--
>  mm/memcontrol.c |  2 +-
>  mm/memory.c | 10 +++---
>  mm/migrate.c|  6 ++--
>  mm/page_vma_mapped.c|  6 ++--
>  10 files changed, 50 insertions(+), 81 deletions(-)

Looks good

Reviewed-by: Jason Gunthorpe 

> diff --git a/mm/hmm.c b/mm/hmm.c
> index 943cb2ba4442..3b2dda71d0ed 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -214,7 +214,7 @@ static inline bool hmm_is_device_private_entry(struct 
> hmm_range *range,
>   swp_entry_t entry)
>  {
>   return is_device_private_entry(entry) &&
> - device_private_entry_to_page(entry)->pgmap->owner ==
> + pfn_swap_entry_to_page(entry)->pgmap->owner ==
>   range->dev_private_owner;
>  }
>  
> @@ -257,8 +257,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
> unsigned long addr,
>   cpu_flags = HMM_PFN_VALID;
>   if (is_write_device_private_entry(entry))
>   cpu_flags |= HMM_PFN_WRITE;
> - *hmm_pfn = device_private_entry_to_pfn(entry) |
> - cpu_flags;
> + *hmm_pfn = swp_offset(entry) | cpu_flags;

Though swp_offset() seems poor here

Something like this seems nicer, maybe as an additional patch in this
series?

diff --git a/mm/hmm.c b/mm/hmm.c
index 943cb2ba444232..c06cbc4e3981b7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -210,14 +210,6 @@ int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long 
addr,
unsigned long end, unsigned long hmm_pfns[], pmd_t pmd);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline bool hmm_is_device_private_entry(struct hmm_range *range,
-   swp_entry_t entry)
-{
-   return is_device_private_entry(entry) &&
-   device_private_entry_to_page(entry)->pgmap->owner ==
-   range->dev_private_owner;
-}
-
 static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
 pte_t pte)
 {
@@ -226,6 +218,32 @@ static inline unsigned long pte_to_hmm_pfn_flags(struct 
hmm_range *range,
return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
 }
 
+static bool hmm_pte_handle_device_private(struct hmm_range *range, pte_t pte,
+ unsigned long *hmm_pfn)
+{
+   swp_entry_t entry = pte_to_swp_entry(pte);
+   struct page *device_page;
+   unsigned long cpu_flags;
+
+   if (is_device_private_entry(entry))
+   return false;
+
+   /*
+* If the device private page matches the device the caller understands
+* then return the private pfn directly. The caller must know what to do
+* with it.
+*/
+   device_page = pfn_swap_entry_to_page(entry);
+   if (device_page->pgmap->owner != range->dev_private_owner)
+   return false;
+
+   cpu_flags = HMM_PFN_VALID;
+   if (is_write_device_private_entry(entry))
+   cpu_flags |= HMM_PFN_WRITE;
+   *hmm_pfn = page_to_pfn(device_page) | cpu_flags;
+   return true;
+}
+
 static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
  unsigned long end, pmd_t *pmdp, pte_t *ptep,
  unsigned long *hmm_pfn)
@@ -247,20 +265,8 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
unsigned long

[PATCH v8 1/2] dt-bindings: serial: Add rx-tx-swap to stm32-usart

2021-03-28 Thread Martin Devera
Add new rx-tx-swap property to allow for RX & TX pin swapping.

Signed-off-by: Martin Devera 
Acked-by: Fabrice Gasnier 
Reviewed-by: Rob Herring 
---
v8:
  - rebase to the latest tty-next
v7:
  - fix yaml linter warning
v6:
  - add version changelog
v5:
  - yaml fixes based on Rob Herring comments
- add serial.yaml reference
- move compatible from 'then' to 'if'
v3:
  - don't allow rx-tx-swap for st,stm32-uart (suggested
by Fabrice Gasnier)
v2:
  - change st,swap to rx-tx-swap (suggested by Rob Herring)
---
 .../devicetree/bindings/serial/st,stm32-uart.yaml  | 29 ++
 1 file changed, 19 insertions(+), 10 deletions(-)
---
 .../devicetree/bindings/serial/st,stm32-uart.yaml  | 29 ++
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/Documentation/devicetree/bindings/serial/st,stm32-uart.yaml 
b/Documentation/devicetree/bindings/serial/st,stm32-uart.yaml
index 8631678283f9..126e07566965 100644
--- a/Documentation/devicetree/bindings/serial/st,stm32-uart.yaml
+++ b/Documentation/devicetree/bindings/serial/st,stm32-uart.yaml
@@ -9,9 +9,6 @@ maintainers:
 
 title: STMicroelectronics STM32 USART bindings
 
-allOf:
-  - $ref: rs485.yaml
-
 properties:
   compatible:
 enum:
@@ -40,6 +37,8 @@ properties:
 
   uart-has-rtscts: true
 
+  rx-tx-swap: true
+
   dmas:
 minItems: 1
 maxItems: 2
@@ -66,13 +65,23 @@ properties:
   linux,rs485-enabled-at-boot-time: true
   rs485-rx-during-tx: true
 
-if:
-  required:
-- st,hw-flow-ctrl
-then:
-  properties:
-cts-gpios: false
-rts-gpios: false
+allOf:
+  - $ref: rs485.yaml#
+  - $ref: serial.yaml#
+  - if:
+  required:
+- st,hw-flow-ctrl
+then:
+  properties:
+cts-gpios: false
+rts-gpios: false
+  - if:
+  properties:
+compatible:
+  const: st,stm32-uart
+then:
+  properties:
+rx-tx-swap: false
 
 required:
   - compatible
-- 
2.11.0



[PATCH v8 2/2] tty/serial: Add rx-tx-swap OF option to stm32-usart

2021-03-28 Thread Martin Devera
STM32 F7/H7 usarts supports RX & TX pin swapping.
Add option to turn it on.
Tested on STM32MP157.

Signed-off-by: Martin Devera 
Acked-by: Fabrice Gasnier 
---
v8:
  - rebase to the latest tty-next
v6:
  - add version changelog
v4:
  - delete superfluous has_swap=false
v3:
  - add has_swap to stm32_usart_info (because F4 line
doesn't support swapping)
  - move swap variable init from stm32_usart_of_get_port
to stm32_usart_init_port because info struct is not
initialized in stm32_usart_of_get_port yet
  - set USART_CR2_SWAP in stm32_usart_startup too
v2:
  - change st,swap to rx-tx-swap (pointed out by Rob Herring)
  - rebase patches as suggested by Greg Kroah-Hartman
---
 drivers/tty/serial/stm32-usart.c | 11 ++-
 drivers/tty/serial/stm32-usart.h |  4 
 2 files changed, 14 insertions(+), 1 deletion(-)
---
 drivers/tty/serial/stm32-usart.c | 11 ++-
 drivers/tty/serial/stm32-usart.h |  4 
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/serial/stm32-usart.c b/drivers/tty/serial/stm32-usart.c
index cba4f4ddf164..4d277804c63e 100644
--- a/drivers/tty/serial/stm32-usart.c
+++ b/drivers/tty/serial/stm32-usart.c
@@ -671,6 +671,12 @@ static int stm32_usart_startup(struct uart_port *port)
if (ret)
return ret;
 
+   if (stm32_port->swap) {
+   val = readl_relaxed(port->membase + ofs->cr2);
+   val |= USART_CR2_SWAP;
+   writel_relaxed(val, port->membase + ofs->cr2);
+   }
+
/* RX FIFO Flush */
if (ofs->rqr != UNDEF_REG)
writel_relaxed(USART_RQR_RXFRQ, port->membase + ofs->rqr);
@@ -789,7 +795,7 @@ static void stm32_usart_set_termios(struct uart_port *port,
cr1 = USART_CR1_TE | USART_CR1_RE;
if (stm32_port->fifoen)
cr1 |= USART_CR1_FIFOEN;
-   cr2 = 0;
+   cr2 = stm32_port->swap ? USART_CR2_SWAP : 0;
 
/* Tx and RX FIFO configuration */
cr3 = readl_relaxed(port->membase + ofs->cr3);
@@ -1047,6 +1053,9 @@ static int stm32_usart_init_port(struct stm32_port 
*stm32port,
stm32port->wakeup_src = stm32port->info->cfg.has_wakeup &&
of_property_read_bool(pdev->dev.of_node, "wakeup-source");
 
+   stm32port->swap = stm32port->info->cfg.has_swap &&
+   of_property_read_bool(pdev->dev.of_node, "rx-tx-swap");
+
stm32port->fifoen = stm32port->info->cfg.has_fifo;
 
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
diff --git a/drivers/tty/serial/stm32-usart.h b/drivers/tty/serial/stm32-usart.h
index a86773f1a4c4..77d1ac082e89 100644
--- a/drivers/tty/serial/stm32-usart.h
+++ b/drivers/tty/serial/stm32-usart.h
@@ -25,6 +25,7 @@ struct stm32_usart_offsets {
 struct stm32_usart_config {
u8 uart_enable_bit; /* USART_CR1_UE */
bool has_7bits_data;
+   bool has_swap;
bool has_wakeup;
bool has_fifo;
int fifosize;
@@ -76,6 +77,7 @@ struct stm32_usart_info stm32f7_info = {
.cfg = {
.uart_enable_bit = 0,
.has_7bits_data = true,
+   .has_swap = true,
.fifosize = 1,
}
 };
@@ -97,6 +99,7 @@ struct stm32_usart_info stm32h7_info = {
.cfg = {
.uart_enable_bit = 0,
.has_7bits_data = true,
+   .has_swap = true,
.has_wakeup = true,
.has_fifo = true,
.fifosize = 16,
@@ -268,6 +271,7 @@ struct stm32_port {
int last_res;
    bool tx_dma_busy;    /* dma tx busy   */
bool hw_flow_control;
+   bool swap;   /* swap RX & TX pins */
bool fifoen;
bool wakeup_src;
int rdr_mask;   /* receive data register mask */
-- 
2.11.0



Re: [PATCH v3] coccinelle: misc: add swap script

2021-03-28 Thread Denis Efremov

Ping?

On 3/5/21 1:09 PM, Denis Efremov wrote:

Check for opencoded swap() implementation.

Signed-off-by: Denis Efremov 
---
Changes in v2:
  - additional patch rule to drop excessive {}
  - fix indentation in patch mode by anchoring ;
Changes in v3:
  - Rule added for simple (without var init) swap highlighting in !patch mode
  - "depends on patch && (rpvar || rp)" fixed

  scripts/coccinelle/misc/swap.cocci | 122 +
  1 file changed, 122 insertions(+)
  create mode 100644 scripts/coccinelle/misc/swap.cocci

diff --git a/scripts/coccinelle/misc/swap.cocci 
b/scripts/coccinelle/misc/swap.cocci
new file mode 100644
index ..c5e71b7ef7f5
--- /dev/null
+++ b/scripts/coccinelle/misc/swap.cocci
@@ -0,0 +1,122 @@
+// SPDX-License-Identifier: GPL-2.0-only
+///
+/// Check for opencoded swap() implementation.
+///
+// Confidence: High
+// Copyright: (C) 2021 Denis Efremov ISPRAS
+// Options: --no-includes --include-headers
+//
+// Keywords: swap
+//
+
+virtual patch
+virtual org
+virtual report
+virtual context
+
+@rvar depends on !patch@
+identifier tmp;
+expression a, b;
+type T;
+position p;
+@@
+
+(
+* T tmp;
+|
+* T tmp = 0;
+|
+* T *tmp = NULL;
+)
+... when != tmp
+* tmp = a;
+* a = b;@p
+* b = tmp;
+... when != tmp
+
+@r depends on !patch@
+identifier tmp;
+expression a, b;
+position p != rvar.p;
+@@
+
+* tmp = a;
+* a = b;@p
+* b = tmp;
+
+@rpvar depends on patch@
+identifier tmp;
+expression a, b;
+type T;
+@@
+
+(
+- T tmp;
+|
+- T tmp = 0;
+|
+- T *tmp = NULL;
+)
+... when != tmp
+- tmp = a;
+- a = b;
+- b = tmp
++ swap(a, b)
+  ;
+... when != tmp
+
+@rp depends on patch@
+identifier tmp;
+expression a, b;
+@@
+
+- tmp = a;
+- a = b;
+- b = tmp
++ swap(a, b)
+  ;
+
+@depends on patch && (rpvar || rp)@
+@@
+
+(
+  for (...;...;...)
+- {
+   swap(...);
+- }
+|
+  while (...)
+- {
+   swap(...);
+- }
+|
+  if (...)
+- {
+   swap(...);
+- }
+)
+
+
+@script:python depends on report@
+p << r.p;
+@@
+
+coccilib.report.print_report(p[0], "WARNING opportunity for swap()")
+
+@script:python depends on org@
+p << r.p;
+@@
+
+coccilib.org.print_todo(p[0], "WARNING opportunity for swap()")
+
+@script:python depends on report@
+p << rvar.p;
+@@
+
+coccilib.report.print_report(p[0], "WARNING opportunity for swap()")
+
+@script:python depends on org@
+p << rvar.p;
+@@
+
+coccilib.org.print_todo(p[0], "WARNING opportunity for swap()")



Re: [PATCH v7 2/2] tty/serial: Add rx-tx-swap OF option to stm32-usart

2021-03-26 Thread Greg Kroah-Hartman
On Fri, Mar 12, 2021 at 04:37:02PM +0100, Martin Devera wrote:
> STM32 F7/H7 usarts supports RX & TX pin swapping.
> Add option to turn it on.
> Tested on STM32MP157.
> 
> Signed-off-by: Martin Devera 
> Acked-by: Fabrice Gasnier 

This does not apply to my tty-next branch at all.  Can you please rebase
this series (and keep Rob's ack of patch 1) and resend?

thanks,

greg k-h


[PATCH v7 2/8] mm/swapops: Rework swap entry manipulation code

2021-03-25 Thread Alistair Popple
Both migration and device private pages use special swap entries that
are manipluated by a range of inline functions. The arguments to these
are somewhat inconsitent so rework them to remove flag type arguments
and to make the arguments similar for both read and write entry
creation.

Signed-off-by: Alistair Popple 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Ralph Campbell 
---
 include/linux/swapops.h | 56 ++---
 mm/debug_vm_pgtable.c   | 12 -
 mm/hmm.c|  2 +-
 mm/huge_memory.c| 26 +--
 mm/hugetlb.c| 10 +---
 mm/memory.c | 10 +---
 mm/migrate.c| 26 ++-
 mm/mprotect.c   | 10 +---
 mm/rmap.c   | 10 +---
 9 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 139be8235ad2..4dfd807ae52a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -100,35 +100,35 @@ static inline void *swp_to_radix_entry(swp_entry_t entry)
 }
 
 #if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
-   return swp_entry(write ? SWP_DEVICE_WRITE : SWP_DEVICE_READ,
-page_to_pfn(page));
+   return swp_entry(SWP_DEVICE_READ, offset);
 }
 
-static inline bool is_device_private_entry(swp_entry_t entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
-   int type = swp_type(entry);
-   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
+   return swp_entry(SWP_DEVICE_WRITE, offset);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline bool is_device_private_entry(swp_entry_t entry)
 {
-   *entry = swp_entry(SWP_DEVICE_READ, swp_offset(*entry));
+   int type = swp_type(entry);
+   return type == SWP_DEVICE_READ || type == SWP_DEVICE_WRITE;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
 #else /* CONFIG_DEVICE_PRIVATE */
-static inline swp_entry_t make_device_private_entry(struct page *page, bool 
write)
+static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
return swp_entry(0, 0);
 }
 
-static inline void make_device_private_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_writable_device_private_entry(pgoff_t offset)
 {
+   return swp_entry(0, 0);
 }
 
 static inline bool is_device_private_entry(swp_entry_t entry)
@@ -136,35 +136,32 @@ static inline bool is_device_private_entry(swp_entry_t 
entry)
return false;
 }
 
-static inline bool is_write_device_private_entry(swp_entry_t entry)
+static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
return false;
 }
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
-static inline swp_entry_t make_migration_entry(struct page *page, int write)
-{
-   BUG_ON(!PageLocked(compound_head(page)));
-
-   return swp_entry(write ? SWP_MIGRATION_WRITE : SWP_MIGRATION_READ,
-   page_to_pfn(page));
-}
-
 static inline int is_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_READ ||
swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline int is_write_migration_entry(swp_entry_t entry)
+static inline int is_writable_migration_entry(swp_entry_t entry)
 {
return unlikely(swp_type(entry) == SWP_MIGRATION_WRITE);
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entry)
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
 {
-   *entry = swp_entry(SWP_MIGRATION_READ, swp_offset(*entry));
+   return swp_entry(SWP_MIGRATION_READ, offset);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(SWP_MIGRATION_WRITE, offset);
 }
 
 extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
@@ -174,21 +171,28 @@ extern void migration_entry_wait(struct mm_struct *mm, 
pmd_t *pmd,
 extern void migration_entry_wait_huge(struct vm_area_struct *vma,
struct mm_struct *mm, pte_t *pte);
 #else
+static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
+{
+   return swp_entry(0, 0);
+}
 
-#define make_migration_entry(page, write) swp_entry(0, 0)
 static inline int is_migration_entry(swp_entry_t swp)
 {
return 0;
 }
 
-static inline void make_migration_entry_read(swp_entry_t *entryp) { }
 static inline void __migration_entry_wait(struct mm_struct *mm

[PATCH v7 1/8] mm: Remove special swap entry functions

2021-03-25 Thread Alistair Popple
Remove multiple similar inline functions for dealing with different
types of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct
page for each swap entry type use a common function
pfn_swap_entry_to_page(). Also open-code the various entry_to_pfn()
functions as this results is shorter code that is easier to understand.

Signed-off-by: Alistair Popple 
Reviewed-by: Ralph Campbell 
Reviewed-by: Christoph Hellwig 

---

v7:
* Reworded commit message to include pfn_swap_entry_to_page()
* Added Christoph's Reviewed-by

v6:
* Removed redundant compound_page() call from inside PageLocked()
* Fixed a minor build issue for s390 reported by kernel test bot

v4:
* Added pfn_swap_entry_to_page()
* Reinstated check that migration entries point to locked pages
* Removed #define swapcache_prepare which isn't needed for CONFIG_SWAP=0
  builds
---
 arch/s390/mm/pgtable.c  |  2 +-
 fs/proc/task_mmu.c  | 23 +-
 include/linux/swap.h|  4 +--
 include/linux/swapops.h | 69 ++---
 mm/hmm.c|  5 ++-
 mm/huge_memory.c|  4 +--
 mm/memcontrol.c |  2 +-
 mm/memory.c | 10 +++---
 mm/migrate.c|  6 ++--
 mm/page_vma_mapped.c|  6 ++--
 10 files changed, 50 insertions(+), 81 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 18205f851c24..eec3a9d7176e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -691,7 +691,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, 
swp_entry_t entry)
if (!non_swap_entry(entry))
dec_mm_counter(mm, MM_SWAPENTS);
else if (is_migration_entry(entry)) {
-   struct page *page = migration_entry_to_page(entry);
+   struct page *page = pfn_swap_entry_to_page(entry);
 
dec_mm_counter(mm, mm_counter(page));
}
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 3cec6fbef725..08ee59d945c0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -514,10 +514,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
} else {
mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
}
-   } else if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   } else if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
&& pte_none(*pte))) {
page = xa_load(>vm_file->f_mapping->i_pages,
@@ -549,7 +547,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
+   page = pfn_swap_entry_to_page(entry);
}
if (IS_ERR_OR_NULL(page))
return;
@@ -691,10 +689,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long 
hmask,
} else if (is_swap_pte(*pte)) {
swp_entry_t swpent = pte_to_swp_entry(*pte);
 
-   if (is_migration_entry(swpent))
-   page = migration_entry_to_page(swpent);
-   else if (is_device_private_entry(swpent))
-   page = device_private_entry_to_page(swpent);
+   if (is_pfn_swap_entry(swpent))
+   page = pfn_swap_entry_to_page(swpent);
}
if (page) {
int mapcount = page_mapcount(page);
@@ -1383,11 +1379,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct 
pagemapread *pm,
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
flags |= PM_SWAP;
-   if (is_migration_entry(entry))
-   page = migration_entry_to_page(entry);
-
-   if (is_device_private_entry(entry))
-   page = device_private_entry_to_page(entry);
+   if (is_pfn_swap_entry(entry))
+   page = pfn_swap_entry_to_page(entry);
}
 
if (page && !PageAnon(page))
@@ -1444,7 +1437,7 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long 
addr, unsigned long end,
if (pmd_swp_soft_dirty(pmd))
flags |= PM_SOFT_DIRTY;
VM_BUG_ON(!is_pmd_migration_entry(pmd));
-   page = migration_entry_to_page(entry);
+   page = pfn_s

Re: [PATCH v7 1/2] dt-bindings: serial: Add rx-tx-swap to stm32-usart

2021-03-23 Thread Rob Herring
On Fri, 12 Mar 2021 16:37:01 +0100, Martin Devera wrote:
> Add new rx-tx-swap property to allow for RX & TX pin swapping.
> 
> Signed-off-by: Martin Devera 
> Acked-by: Fabrice Gasnier 
> ---
> v7:
>   - fix yaml linter warning
> v6:
>   - add version changelog
> v5:
>   - yaml fixes based on Rob Herring comments
> - add serial.yaml reference
> - move compatible from 'then' to 'if'
> v3:
>   - don't allow rx-tx-swap for st,stm32-uart (suggested
> by Fabrice Gasnier)
> v2:
>   - change st,swap to rx-tx-swap (suggested by Rob Herring)
> ---
>  .../devicetree/bindings/serial/st,stm32-uart.yaml  | 29 
> ++
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 

Reviewed-by: Rob Herring 


[PATCH 18/23] mm/hugetlb: Introduce huge version of special swap pte helpers

2021-03-22 Thread Peter Xu
This is to let hugetlbfs be prepared to also recognize swap special ptes just
like uffd-wp special swap ptes.

Signed-off-by: Peter Xu 
---
 mm/hugetlb.c | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fd3e87517e10..64e424b03774 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,6 +93,25 @@ static inline bool subpool_is_free(struct hugepage_subpool 
*spool)
return true;
 }
 
+/*
+ * These are sister versions of is_swap_pte() and pte_has_swap_entry().  We
+ * need standalone ones because huge_pte_none() is handled differently from
+ * pte_none().  For more information, please refer to comments above
+ * is_swap_pte() and pte_has_swap_entry().
+ *
+ * Here we directly reuse the pte level of swap special ptes, for example, the
+ * pte_swp_uffd_wp_special().  It just stands for a huge page rather than a
+ * small page for hugetlbfs pages.
+ */
+static inline bool is_huge_swap_pte(pte_t pte)
+{
+   return !huge_pte_none(pte) && !pte_present(pte);
+}
+static inline bool huge_pte_has_swap_entry(pte_t pte)
+{
+   return is_huge_swap_pte(pte) && !is_swap_special_pte(pte);
+}
+
 static inline void unlock_or_release_subpool(struct hugepage_subpool *spool)
 {
spin_unlock(>lock);
@@ -3726,7 +3745,7 @@ bool is_hugetlb_entry_migration(pte_t pte)
 {
swp_entry_t swp;
 
-   if (huge_pte_none(pte) || pte_present(pte))
+   if (!huge_pte_has_swap_entry(pte))
return false;
swp = pte_to_swp_entry(pte);
if (is_migration_entry(swp))
@@ -3739,7 +3758,7 @@ static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
 {
swp_entry_t swp;
 
-   if (huge_pte_none(pte) || pte_present(pte))
+   if (!huge_pte_has_swap_entry(pte))
return false;
swp = pte_to_swp_entry(pte);
if (is_hwpoison_entry(swp))
-- 
2.26.2



  1   2   3   4   5   6   7   8   9   10   >