Re: [mm/swap] aae466b005: vm-scalability.throughput -2.7% regression

2021-01-03 Thread Joonsoo Kim
On Wed, Dec 30, 2020 at 02:24:12PM +0800, kernel test robot wrote:
> 
> Greeting,
> 
> FYI, we noticed a -2.7% regression of vm-scalability.throughput due to commit:
> 
> 
> commit: aae466b0052e1888edd1d7f473d4310d64936196 ("mm/swap: implement 
> workingset detection for anonymous LRU")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

Hello,

Thanks for reporting.

This is a known issue to me. As you can see below, slight regression
is found on only the random access workload. For pure random pattern,
detecting workingset would not guarantee a good result since these
pages doesn't accessed in the near future. Even worse, it causes small
overhead to the system since we need to demote these activated page to
the inactive list to reclaim. So, below result is natural to me.

For other access pattern, this patchset would result in positive
effect.

Thanks.



Re: [PATCH] mm, slub: Consider rest of partial list if acquire_slab() fails

2020-12-28 Thread Joonsoo Kim
2020년 12월 28일 (월) 오후 10:10, Jann Horn 님이 작성:
>
> acquire_slab() fails if there is contention on the freelist of the page
> (probably because some other CPU is concurrently freeing an object from the
> page). In that case, it might make sense to look for a different page
> (since there might be more remote frees to the page from other CPUs, and we
> don't want contention on struct page).
>
> However, the current code accidentally stops looking at the partial list
> completely in that case. Especially on kernels without CONFIG_NUMA set,
> this means that get_partial() fails and new_slab_objects() falls back to
> new_slab(), allocating new pages. This could lead to an unnecessary
> increase in memory fragmentation.
>
> Fixes: 7ced37197196 ("slub: Acquire_slab() avoid loop")
> Signed-off-by: Jann Horn 

Acked-by: Joonsoo Kim 

Thanks.


Re: [PATCH v3 sl-b 1/6] mm: Add mem_dump_obj() to print source of memory block

2020-12-10 Thread Joonsoo Kim
On Thu, Dec 10, 2020 at 07:42:27PM -0800, Paul E. McKenney wrote:
> On Thu, Dec 10, 2020 at 07:33:59PM -0800, Paul E. McKenney wrote:
> > On Fri, Dec 11, 2020 at 11:22:10AM +0900, Joonsoo Kim wrote:
> > > On Thu, Dec 10, 2020 at 05:19:58PM -0800, paul...@kernel.org wrote:
> > > > From: "Paul E. McKenney" 
> 
> [ . . . ]
> 
> > > We can get some infos even if CONFIG_SLUB_DEBUG isn't defined.
> > > Please move them out.
> > 
> > I guess since I worry about CONFIG_MMU=n it only makes sense to also
> > worry about CONFIG_SLUB_DEBUG=n.  Fix update.
> 
> Like this?  (Patch on top of the series, to be folded into the first one.)

Yes!

Acked-by: Joonsoo Kim 

Thanks.


Re: [PATCH v3 sl-b 1/6] mm: Add mem_dump_obj() to print source of memory block

2020-12-10 Thread Joonsoo Kim
On Thu, Dec 10, 2020 at 07:33:59PM -0800, Paul E. McKenney wrote:
> On Fri, Dec 11, 2020 at 11:22:10AM +0900, Joonsoo Kim wrote:
> > On Thu, Dec 10, 2020 at 05:19:58PM -0800, paul...@kernel.org wrote:
> > > From: "Paul E. McKenney" 
> > > 
> > > There are kernel facilities such as per-CPU reference counts that give
> > > error messages in generic handlers or callbacks, whose messages are
> > > unenlightening.  In the case of per-CPU reference-count underflow, this
> > > is not a problem when creating a new use of this facility because in that
> > > case the bug is almost certainly in the code implementing that new use.
> > > However, trouble arises when deploying across many systems, which might
> > > exercise corner cases that were not seen during development and testing.
> > > Here, it would be really nice to get some kind of hint as to which of
> > > several uses the underflow was caused by.
> > > 
> > > This commit therefore exposes a mem_dump_obj() function that takes
> > > a pointer to memory (which must still be allocated if it has been
> > > dynamically allocated) and prints available information on where that
> > > memory came from.  This pointer can reference the middle of the block as
> > > well as the beginning of the block, as needed by things like RCU callback
> > > functions and timer handlers that might not know where the beginning of
> > > the memory block is.  These functions and handlers can use mem_dump_obj()
> > > to print out better hints as to where the problem might lie.
> > > 
> > > The information printed can depend on kernel configuration.  For example,
> > > the allocation return address can be printed only for slab and slub,
> > > and even then only when the necessary debug has been enabled.  For slab,
> > > build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
> > > to the next power of two or use the SLAB_STORE_USER when creating the
> > > kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
> > > boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
> > > if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
> > > to enable printing of the allocation-time stack trace.
> > > 
> > > Cc: Christoph Lameter 
> > > Cc: Pekka Enberg 
> > > Cc: David Rientjes 
> > > Cc: Joonsoo Kim 
> > > Cc: Andrew Morton 
> > > Cc: 
> > > Reported-by: Andrii Nakryiko 
> > > [ paulmck: Convert to printing and change names per Joonsoo Kim. ]
> > > [ paulmck: Move slab definition per Stephen Rothwell and kbuild test 
> > > robot. ]
> > > [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
> > > [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
> > > Signed-off-by: Paul E. McKenney 
> > > ---
> > >  include/linux/mm.h   |  2 ++
> > >  include/linux/slab.h |  2 ++
> > >  mm/slab.c| 20 ++
> > >  mm/slab.h| 12 +
> > >  mm/slab_common.c | 74 
> > > 
> > >  mm/slob.c|  6 +
> > >  mm/slub.c| 36 +
> > >  mm/util.c| 24 +
> > >  8 files changed, 176 insertions(+)
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index ef360fe..1eea266 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -3153,5 +3153,7 @@ unsigned long wp_shared_mapping_range(struct 
> > > address_space *mapping,
> > >  
> > >  extern int sysctl_nr_trim_pages;
> > >  
> > > +void mem_dump_obj(void *object);
> > > +
> > >  #endif /* __KERNEL__ */
> > >  #endif /* _LINUX_MM_H */
> > > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > > index dd6897f..169b511 100644
> > > --- a/include/linux/slab.h
> > > +++ b/include/linux/slab.h
> > > @@ -186,6 +186,8 @@ void kfree(const void *);
> > >  void kfree_sensitive(const void *);
> > >  size_t __ksize(const void *);
> > >  size_t ksize(const void *);
> > > +bool kmem_valid_obj(void *object);
> > > +void kmem_dump_obj(void *object);
> > >  
> > >  #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
> > >  void __check_heap_object(const void *ptr, unsigned long n, struct page 
> > > *page,
> > > diff --git a/

Re: [PATCH v3] mm/page_owner: Record timestamp and pid

2020-12-10 Thread Joonsoo Kim
2020년 12월 11일 (금) 오전 1:04, Georgi Djakov 님이 작성:
>
> From: Liam Mark 
>
> Collect the time for each allocation recorded in page owner so that
> allocation "surges" can be measured.
>
> Record the pid for each allocation recorded in page owner so that the
> source of allocation "surges" can be better identified.
>
> The above is very useful when doing memory analysis.  On a crash for
> example, we can get this information from kdump (or ramdump) and parse it
> to figure out memory allocation problems.
>
> Please note that on x86_64 this increases the size of struct page_owner
> from 16 bytes to 32.
>
> Vlastimil: it's not a functionality intended for production, so unless
> somebody says they need to enable page_owner for debugging and this
> increase prevents them from fitting into available memory, let's not
> complicate things with making this optional.
>
> Signed-off-by: Liam Mark 
> Signed-off-by: Georgi Djakov 
> Acked-by: Vlastimil Babka 
> Cc: Jonathan Corbet 

Acked-by: Joonsoo Kim 

This is useful. Our company already has an in-house patch to store
pid since a few years ago.

Thanks.


Re: [PATCH v3 sl-b 1/6] mm: Add mem_dump_obj() to print source of memory block

2020-12-10 Thread Joonsoo Kim
On Thu, Dec 10, 2020 at 05:19:58PM -0800, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> There are kernel facilities such as per-CPU reference counts that give
> error messages in generic handlers or callbacks, whose messages are
> unenlightening.  In the case of per-CPU reference-count underflow, this
> is not a problem when creating a new use of this facility because in that
> case the bug is almost certainly in the code implementing that new use.
> However, trouble arises when deploying across many systems, which might
> exercise corner cases that were not seen during development and testing.
> Here, it would be really nice to get some kind of hint as to which of
> several uses the underflow was caused by.
> 
> This commit therefore exposes a mem_dump_obj() function that takes
> a pointer to memory (which must still be allocated if it has been
> dynamically allocated) and prints available information on where that
> memory came from.  This pointer can reference the middle of the block as
> well as the beginning of the block, as needed by things like RCU callback
> functions and timer handlers that might not know where the beginning of
> the memory block is.  These functions and handlers can use mem_dump_obj()
> to print out better hints as to where the problem might lie.
> 
> The information printed can depend on kernel configuration.  For example,
> the allocation return address can be printed only for slab and slub,
> and even then only when the necessary debug has been enabled.  For slab,
> build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
> to the next power of two or use the SLAB_STORE_USER when creating the
> kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
> boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
> if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
> to enable printing of the allocation-time stack trace.
> 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> Cc: David Rientjes 
> Cc: Joonsoo Kim 
> Cc: Andrew Morton 
> Cc: 
> Reported-by: Andrii Nakryiko 
> [ paulmck: Convert to printing and change names per Joonsoo Kim. ]
> [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
> [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
> [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
> Signed-off-by: Paul E. McKenney 
> ---
>  include/linux/mm.h   |  2 ++
>  include/linux/slab.h |  2 ++
>  mm/slab.c| 20 ++
>  mm/slab.h| 12 +
>  mm/slab_common.c | 74 
> 
>  mm/slob.c|  6 +
>  mm/slub.c| 36 +
>  mm/util.c| 24 +
>  8 files changed, 176 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef360fe..1eea266 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3153,5 +3153,7 @@ unsigned long wp_shared_mapping_range(struct 
> address_space *mapping,
>  
>  extern int sysctl_nr_trim_pages;
>  
> +void mem_dump_obj(void *object);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index dd6897f..169b511 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -186,6 +186,8 @@ void kfree(const void *);
>  void kfree_sensitive(const void *);
>  size_t __ksize(const void *);
>  size_t ksize(const void *);
> +bool kmem_valid_obj(void *object);
> +void kmem_dump_obj(void *object);
>  
>  #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
>  void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
> diff --git a/mm/slab.c b/mm/slab.c
> index b111356..66f00ad 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3633,6 +3633,26 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t 
> flags,
>  EXPORT_SYMBOL(__kmalloc_node_track_caller);
>  #endif /* CONFIG_NUMA */
>  
> +void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page 
> *page)
> +{
> + struct kmem_cache *cachep;
> + unsigned int objnr;
> + void *objp;
> +
> + kpp->kp_ptr = object;
> + kpp->kp_page = page;
> + cachep = page->slab_cache;
> + kpp->kp_slab_cache = cachep;
> + objp = object - obj_offset(cachep);
> + kpp->kp_data_offset = obj_offset(cachep);
> + page = virt_to_head_page(objp);
> + objnr = obj_to_index(cachep, page, objp);
> + objp = index_to_obj(cachep, page, objnr);
> + kpp->kp_objp = objp;
> + if (DEBUG && cachep->flags & SLAB_STORE_USER)
> +

Re: [PATCH] mm/slab: Perform init_on_free earlier

2020-12-10 Thread Joonsoo Kim
2020년 12월 11일 (금) 오전 3:37, Alexander Popov 님이 작성:
>
> Currently in CONFIG_SLAB init_on_free happens too late, and heap
> objects go to the heap quarantine not being erased.
>
> Lets move init_on_free clearing before calling kasan_slab_free().
> In that case heap quarantine will store erased objects, similarly
> to CONFIG_SLUB=y behavior.
>
> Signed-off-by: Alexander Popov 
> Reviewed-by: Alexander Potapenko 

Acked-by: Joonsoo Kim 


Re: [PATCH] mm, slab, slub: clear the slab_cache field when freeing page

2020-12-10 Thread Joonsoo Kim
2020년 12월 11일 (금) 오전 1:00, Vlastimil Babka 님이 작성:
>
> The page allocator expects that page->mapping is NULL for a page being freed.
> SLAB and SLUB use the slab_cache field which is in union with mapping, but
> before freeing the page, the field is referenced with the "mapping" name when
> set to NULL.
>
> It's IMHO more correct (albeit functionally the same) to use the slab_cache
> name as that's the field we use in SL*B, and document why we clear it in a
> comment (we don't clear fields such as s_mem or freelist, as page allocator
> doesn't care about those). While using the 'mapping' name would automagically
> keep the code correct if the unions in struct page changed, such changes 
> should
> be done consciously and needed changes evaluated - the comment should help 
> with
> that.
>
> Signed-off-by: Vlastimil Babka 

Acked-by: Joonsoo Kim 


Re: [PATCH v2 sl-b 1/5] mm: Add mem_dump_obj() to print source of memory block

2020-12-10 Thread Joonsoo Kim
On Tue, Dec 08, 2020 at 05:12:59PM -0800, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> There are kernel facilities such as per-CPU reference counts that give
> error messages in generic handlers or callbacks, whose messages are
> unenlightening.  In the case of per-CPU reference-count underflow, this
> is not a problem when creating a new use of this facility because in that
> case the bug is almost certainly in the code implementing that new use.
> However, trouble arises when deploying across many systems, which might
> exercise corner cases that were not seen during development and testing.
> Here, it would be really nice to get some kind of hint as to which of
> several uses the underflow was caused by.
> 
> This commit therefore exposes a mem_dump_obj() function that takes
> a pointer to memory (which must still be allocated if it has been
> dynamically allocated) and prints available information on where that
> memory came from.  This pointer can reference the middle of the block as
> well as the beginning of the block, as needed by things like RCU callback
> functions and timer handlers that might not know where the beginning of
> the memory block is.  These functions and handlers can use mem_dump_obj()
> to print out better hints as to where the problem might lie.
> 
> The information printed can depend on kernel configuration.  For example,
> the allocation return address can be printed only for slab and slub,
> and even then only when the necessary debug has been enabled.  For slab,
> build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
> to the next power of two or use the SLAB_STORE_USER when creating the
> kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
> boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
> if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
> to enable printing of the allocation-time stack trace.
> 
> Cc: Christoph Lameter 
> Cc: Pekka Enberg 
> Cc: David Rientjes 
> Cc: Joonsoo Kim 
> Cc: Andrew Morton 
> Cc: 
> Reported-by: Andrii Nakryiko 
> [ paulmck: Convert to printing and change names per Joonsoo Kim. ]
> [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
> Signed-off-by: Paul E. McKenney 

Introducing three functions, kmem_valid_obj(), kmem_provenance(),
mem_dump_obj() looks better than patchset v1. Nice work. Few comments
below.

> ---
>  include/linux/mm.h   |  2 ++
>  include/linux/slab.h |  2 ++
>  mm/slab.c| 28 +
>  mm/slab.h| 11 +
>  mm/slab_common.c | 69 
> 
>  mm/slob.c|  7 ++
>  mm/slub.c| 40 ++
>  mm/util.c| 25 +++
>  8 files changed, 184 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef360fe..1eea266 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3153,5 +3153,7 @@ unsigned long wp_shared_mapping_range(struct 
> address_space *mapping,
>  
>  extern int sysctl_nr_trim_pages;
>  
> +void mem_dump_obj(void *object);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _LINUX_MM_H */
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index dd6897f..169b511 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -186,6 +186,8 @@ void kfree(const void *);
>  void kfree_sensitive(const void *);
>  size_t __ksize(const void *);
>  size_t ksize(const void *);
> +bool kmem_valid_obj(void *object);
> +void kmem_dump_obj(void *object);
>  
>  #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
>  void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
> diff --git a/mm/slab.c b/mm/slab.c
> index b111356..72b6743 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3602,6 +3602,34 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache 
> *cachep,
>  EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
>  #endif
>  
> +void kmem_provenance(struct kmem_provenance *kpp)

To open up the possibility of future enhancement, name, provenance,
looks not good to me. This function could be used to extract various
object information so such as kmem_obj_info() looks better to me. Any
thought?

> +{
> +#ifdef DEBUG
> + struct kmem_cache *cachep;
> + void *object = kpp->kp_ptr;
> + unsigned int objnr;
> + void *objp;
> + struct page *page = kpp->kp_page;
> +
> + cachep = page->slab_cache;
> + if (!(cachep->flags & SLAB_STORE_USER)) {
> + kpp->kp_ret = NULL;
> + goto nodebug;
> + }
> + objp = object - obj_offset(cachep);

Re: [PATCH sl-b 1/6] mm: Add kmem_last_alloc() to return last allocation for memory block

2020-12-08 Thread Joonsoo Kim
On Mon, Dec 07, 2020 at 09:25:54AM -0800, Paul E. McKenney wrote:
> On Mon, Dec 07, 2020 at 06:02:53PM +0900, Joonsoo Kim wrote:
> > Hello, Paul.
> > 
> > On Fri, Dec 04, 2020 at 04:40:52PM -0800, paul...@kernel.org wrote:
> > > From: "Paul E. McKenney" 
> > > 
> > > There are kernel facilities such as per-CPU reference counts that give
> > > error messages in generic handlers or callbacks, whose messages are
> > > unenlightening.  In the case of per-CPU reference-count underflow, this
> > > is not a problem when creating a new use of this facility because in that
> > > case the bug is almost certainly in the code implementing that new use.
> > > However, trouble arises when deploying across many systems, which might
> > > exercise corner cases that were not seen during development and testing.
> > > Here, it would be really nice to get some kind of hint as to which of
> > > several uses the underflow was caused by.
> > > 
> > > This commit therefore exposes a new kmem_last_alloc() function that
> > > takes a pointer to dynamically allocated memory and returns the return
> > > address of the call that allocated it.  This pointer can reference the
> > > middle of the block as well as the beginning of the block, as needed
> > > by things like RCU callback functions and timer handlers that might not
> > > know where the beginning of the memory block is.  These functions and
> > > handlers can use the return value from kmem_last_alloc() to give the
> > > kernel hacker a better hint as to where the problem might lie.
> > 
> > I agree with exposing allocation caller information to the other
> > subsystem to help the debugging. Some suggestions...
> 
> Good to hear!  ;-)
> 
> > 1. It's better to separate a slab object check (validity check) and
> > retrieving the allocation caller. Someone else would want to check
> > only a validity. And, it doesn't depend on the debug configuration so
> > it's not good to bind it to the debug function.
> > 
> > kmem_cache_valid_(obj|ptr)
> > kmalloc_valid_(obj|ptr)
> 
> Here both functions would say "true" for a pointer from kmalloc()?
> Or do I need to add a third function that is happy with a pointer from
> either source?

I focused on separation and missed this case that the user sometimes
cannot know the object source (kmalloc/kmem_cache). At first step,
just checking whether it is a slab-object or not looks enough.

int kmem_valid_obj()

> 
> I do understand that people who don't want to distinguish could just do
> "kmem_cache_valid_ptr(p) || kmalloc_valid_ptr(p)".  However, the two
> use cases in the series have no idea whether the pointer they have came
> from kmalloc(), kmem_cache_alloc(), or somewhere else entirely, even an
> on-stack variable.
> 
> Are you asking me to choose between the _obj() and _ptr() suffixes?

Yes, I prefer _obj().

> If not, please help me understand the distinction.
> 
> Do we want "debug" in these names as well?

I don't think so since it can be called without enabling the debug
option.

> 
> > 2. rename kmem_last_alloc to ...
> > 
> > int kmem_cache_debug_alloc_caller(cache, obj, _addr)
> > int kmalloc_debug_alloc_caller(obj, _addr)
> > 
> > or debug_kmem_cache_alloc_caller()
> > 
> > I think that function name need to include the keyword 'debug' to show
> > itself as a debugging facility (enabled at the debugging). And, return
> > errno and get caller address by pointer argument.
> 
> I am quite happy to add the "debug", but my use cases have no idea
> how the pointer was allocated.  In fact, the next version of the
> patch will also handle allocator return addresses from vmalloc().
> 
> And for kernels without sufficient debug enabled, I need to provide
> the name of the slab cache, and this also is to be in the next version.

Okay. So, your code would be...

if (kmem_valid_obj(ptr))
kmalloc_debug_print_provenance(ptr)
else if (vmalloc_valid_obj(ptr))


> > 3. If concrete error message is needed, please introduce more functions.
> > 
> > void *kmalloc_debug_error(errno)
> 
> Agreed, in fact, I was planning to have a function that printed out
> a suitable error-message continuation to the console for ease-of-use
> reasons.  For example, why is the caller deciding how deep the stack
> frame is?  ;-)
> 
> So something like this?
> 
>   void kmalloc_debug_print_provenance(void *ptr);
> 
> With the understanding that it will print something helpful regardless
> of where ptr came from, within the constraints of the kernel build and
> boot options?

Looks good idea. I suggest a name, kmem_dump_obj(), for this function.
In this case, I don't think that "debug" keyword is needed since it shows
something useful (slab cache info) even if debug option isn't enabled.

So, for summary, we need to introduce two functions to accomplish your
purpose. Please correct me if wrong.

int kmem_valid_obj(ptr)
void kmem_dump_obj(ptr)

Thanks.


Re: [PATCH sl-b 1/6] mm: Add kmem_last_alloc() to return last allocation for memory block

2020-12-07 Thread Joonsoo Kim
Hello, Paul.

On Fri, Dec 04, 2020 at 04:40:52PM -0800, paul...@kernel.org wrote:
> From: "Paul E. McKenney" 
> 
> There are kernel facilities such as per-CPU reference counts that give
> error messages in generic handlers or callbacks, whose messages are
> unenlightening.  In the case of per-CPU reference-count underflow, this
> is not a problem when creating a new use of this facility because in that
> case the bug is almost certainly in the code implementing that new use.
> However, trouble arises when deploying across many systems, which might
> exercise corner cases that were not seen during development and testing.
> Here, it would be really nice to get some kind of hint as to which of
> several uses the underflow was caused by.
> 
> This commit therefore exposes a new kmem_last_alloc() function that
> takes a pointer to dynamically allocated memory and returns the return
> address of the call that allocated it.  This pointer can reference the
> middle of the block as well as the beginning of the block, as needed
> by things like RCU callback functions and timer handlers that might not
> know where the beginning of the memory block is.  These functions and
> handlers can use the return value from kmem_last_alloc() to give the
> kernel hacker a better hint as to where the problem might lie.

I agree with exposing allocation caller information to the other
subsystem to help the debugging. Some suggestions...

1. It's better to separate a slab object check (validity check) and
retrieving the allocation caller. Someone else would want to check
only a validity. And, it doesn't depend on the debug configuration so
it's not good to bind it to the debug function.

kmem_cache_valid_(obj|ptr)
kmalloc_valid_(obj|ptr)

2. rename kmem_last_alloc to ...

int kmem_cache_debug_alloc_caller(cache, obj, _addr)
int kmalloc_debug_alloc_caller(obj, _addr)

or debug_kmem_cache_alloc_caller()

I think that function name need to include the keyword 'debug' to show
itself as a debugging facility (enabled at the debugging). And, return
errno and get caller address by pointer argument.

3. If concrete error message is needed, please introduce more functions.

void *kmalloc_debug_error(errno)

Thanks.


Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

2020-12-06 Thread Joonsoo Kim
On Fri, Dec 04, 2020 at 12:43:29PM -0500, Pavel Tatashin wrote:
> On Thu, Dec 3, 2020 at 11:14 PM Joonsoo Kim  wrote:
> >
> > On Wed, Dec 02, 2020 at 12:23:30AM -0500, Pavel Tatashin wrote:
> > > We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
> > > allocated before pinning they need to migrated to a different zone.
> > > Currently, we migrate movable CMA pages only. Generalize the function
> > > that migrates CMA pages to migrate all movable pages.
> > >
> > > Signed-off-by: Pavel Tatashin 
> > > ---
> > >  include/linux/migrate.h|  1 +
> > >  include/trace/events/migrate.h |  3 +-
> > >  mm/gup.c   | 56 +-
> > >  3 files changed, 24 insertions(+), 36 deletions(-)
> > >
> > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > index 0f8d1583fa8e..00bab23d1ee5 100644
> > > --- a/include/linux/migrate.h
> > > +++ b/include/linux/migrate.h
> > > @@ -27,6 +27,7 @@ enum migrate_reason {
> > >   MR_MEMPOLICY_MBIND,
> > >   MR_NUMA_MISPLACED,
> > >   MR_CONTIG_RANGE,
> > > + MR_LONGTERM_PIN,
> > >   MR_TYPES
> > >  };
> > >
> > > diff --git a/include/trace/events/migrate.h 
> > > b/include/trace/events/migrate.h
> > > index 4d434398d64d..363b54ce104c 100644
> > > --- a/include/trace/events/migrate.h
> > > +++ b/include/trace/events/migrate.h
> > > @@ -20,7 +20,8 @@
> > >   EM( MR_SYSCALL, "syscall_or_cpuset")\
> > >   EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")  \
> > >   EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
> > > - EMe(MR_CONTIG_RANGE,"contig_range")
> > > + EM( MR_CONTIG_RANGE,"contig_range") \
> > > + EMe(MR_LONGTERM_PIN,"longterm_pin")
> > >
> > >  /*
> > >   * First define the enums in the above macros to be exported to userspace
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 724d8a65e1df..1d511f65f8a7 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct 
> > > **vmas, long nr_pages)
> > >  }
> > >  #endif
> > >
> > > -#ifdef CONFIG_CMA
> > > -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> > > - unsigned long start,
> > > - unsigned long nr_pages,
> > > - struct page **pages,
> > > - struct vm_area_struct **vmas,
> > > - unsigned int gup_flags)
> > > +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> > > + unsigned long start,
> > > + unsigned long nr_pages,
> > > + struct page **pages,
> > > + struct vm_area_struct **vmas,
> > > + unsigned int gup_flags)
> > >  {
> > >   unsigned long i;
> > >   unsigned long step;
> > >   bool drain_allow = true;
> > >   bool migrate_allow = true;
> > > - LIST_HEAD(cma_page_list);
> > > + LIST_HEAD(page_list);
> > >   long ret = nr_pages;
> > >   struct migration_target_control mtc = {
> > >   .nid = NUMA_NO_NODE,
> > > @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct 
> > > mm_struct *mm,
> > >*/
> > >   step = compound_nr(head) - (pages[i] - head);
> > >   /*
> > > -  * If we get a page from the CMA zone, since we are going to
> > > -  * be pinning these entries, we might as well move them out
> > > -  * of the CMA zone if possible.
> > > +  * If we get a movable page, since we are going to be 
> > > pinning
> > > +  * these entries, try to move them out if possible.
> > >*/
> > > - if (is_migrate_cma_page(head)) {
> > > + if (is_migrate_movable(get_pageblock_migratetype(head))) {
> >
> > is_migrate_movable() isn't a check for the ZONE. It's a check for the
> > MIGRATE_TYPE. MIGRATE_TYPE doesn't require hard guarantee for
> > migration, and, most of memory, including ZONE_NORMAL, is
> > MIGRATE_MOVABLE. With this code, long term gup would always fails due
> > to not enough memory. I think that correct change would be
> > "is_migrate_cma_page(hear) && zone == ZONE_MOVABLE".
> 
> Good point. The above should be OR not AND.
> 
> zone_idx(page_zone(head)) == ZONE_MOVABLE || is_migrate_cma_page(hear)

Yep!

Thanks.


Re: [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE

2020-12-06 Thread Joonsoo Kim
On Fri, Dec 04, 2020 at 12:50:56PM -0500, Pavel Tatashin wrote:
> > > Yes, this indeed could be a problem for some configurations. I will
> > > add your comment to the commit log of one of the patches.
> >
> > It sounds like there is some inherent tension here, breaking THP's
> > when doing pin_user_pages() is a really nasty thing to do. DMA
> > benefits greatly from THP.
> >
> > I know nothing about ZONE_MOVABLE, is this auto-setup or an admin
> > option? If the result of this patch is standard systems can no longer
> > pin > 80% of their memory I have some regression concerns..
> 
> ZONE_MOVABLE can be configured via kernel parameter, or when memory
> nodes are onlined after hot-add; so this is something that admins
> configure. ZONE_MOVABLE is designed to gurantee memory hot-plug

Just note, the origin of ZONE_MOVABLE is to provide availability of
huge page, especially, hugetlb page. AFAIK, not guarantee memory
hot-plug. See following commit that introduces the ZONE_MOVABLE.

2a1e274 Create the ZONE_MOVABLE zone

> functionality, and not availability of THP, however, I did not know
> about the use case where some admins might configure ZONE_MOVABLE to

The usecase is lightly mentioned in previous discussion.

http://lkml.kernel.org/r/alpine.deb.2.23.453.2011221300100.2830...@chino.kir.corp.google.com

Anyway, I agree with your other arguments and this patchset.

Thanks.


Re: [PATCH 6/6] mm/gup: migrate pinned pages out of movable zone

2020-12-03 Thread Joonsoo Kim
On Wed, Dec 02, 2020 at 12:23:30AM -0500, Pavel Tatashin wrote:
> We do not allocate pin pages in ZONE_MOVABLE, but if pages were already
> allocated before pinning they need to migrated to a different zone.
> Currently, we migrate movable CMA pages only. Generalize the function
> that migrates CMA pages to migrate all movable pages.
> 
> Signed-off-by: Pavel Tatashin 
> ---
>  include/linux/migrate.h|  1 +
>  include/trace/events/migrate.h |  3 +-
>  mm/gup.c   | 56 +-
>  3 files changed, 24 insertions(+), 36 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 0f8d1583fa8e..00bab23d1ee5 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -27,6 +27,7 @@ enum migrate_reason {
>   MR_MEMPOLICY_MBIND,
>   MR_NUMA_MISPLACED,
>   MR_CONTIG_RANGE,
> + MR_LONGTERM_PIN,
>   MR_TYPES
>  };
>  
> diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
> index 4d434398d64d..363b54ce104c 100644
> --- a/include/trace/events/migrate.h
> +++ b/include/trace/events/migrate.h
> @@ -20,7 +20,8 @@
>   EM( MR_SYSCALL, "syscall_or_cpuset")\
>   EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind")  \
>   EM( MR_NUMA_MISPLACED,  "numa_misplaced")   \
> - EMe(MR_CONTIG_RANGE,"contig_range")
> + EM( MR_CONTIG_RANGE,"contig_range") \
> + EMe(MR_LONGTERM_PIN,"longterm_pin")
>  
>  /*
>   * First define the enums in the above macros to be exported to userspace
> diff --git a/mm/gup.c b/mm/gup.c
> index 724d8a65e1df..1d511f65f8a7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1593,19 +1593,18 @@ static bool check_dax_vmas(struct vm_area_struct 
> **vmas, long nr_pages)
>  }
>  #endif
>  
> -#ifdef CONFIG_CMA
> -static long check_and_migrate_cma_pages(struct mm_struct *mm,
> - unsigned long start,
> - unsigned long nr_pages,
> - struct page **pages,
> - struct vm_area_struct **vmas,
> - unsigned int gup_flags)
> +static long check_and_migrate_movable_pages(struct mm_struct *mm,
> + unsigned long start,
> + unsigned long nr_pages,
> + struct page **pages,
> + struct vm_area_struct **vmas,
> + unsigned int gup_flags)
>  {
>   unsigned long i;
>   unsigned long step;
>   bool drain_allow = true;
>   bool migrate_allow = true;
> - LIST_HEAD(cma_page_list);
> + LIST_HEAD(page_list);
>   long ret = nr_pages;
>   struct migration_target_control mtc = {
>   .nid = NUMA_NO_NODE,
> @@ -1623,13 +1622,12 @@ static long check_and_migrate_cma_pages(struct 
> mm_struct *mm,
>*/
>   step = compound_nr(head) - (pages[i] - head);
>   /*
> -  * If we get a page from the CMA zone, since we are going to
> -  * be pinning these entries, we might as well move them out
> -  * of the CMA zone if possible.
> +  * If we get a movable page, since we are going to be pinning
> +  * these entries, try to move them out if possible.
>*/
> - if (is_migrate_cma_page(head)) {
> + if (is_migrate_movable(get_pageblock_migratetype(head))) {

is_migrate_movable() isn't a check for the ZONE. It's a check for the
MIGRATE_TYPE. MIGRATE_TYPE doesn't require hard guarantee for
migration, and, most of memory, including ZONE_NORMAL, is
MIGRATE_MOVABLE. With this code, long term gup would always fails due
to not enough memory. I think that correct change would be
"is_migrate_cma_page(hear) && zone == ZONE_MOVABLE".

Patch #5 also has this problem. Please fix it too.

Thanks.


Re: [PATCH 0/6] prohibit pinning pages in ZONE_MOVABLE

2020-12-03 Thread Joonsoo Kim
Hello,

On Wed, Dec 02, 2020 at 12:23:24AM -0500, Pavel Tatashin wrote:
> When page is pinned it cannot be moved and its physical address stays
> the same until pages is unpinned.
> 
> This is useful functionality to allows userland to implementation DMA
> access. For example, it is used by vfio in vfio_pin_pages().
> 
> However, this functionality breaks memory hotplug/hotremove assumptions
> that pages in ZONE_MOVABLE can always be migrated.
> 
> This patch series fixes this issue by forcing new allocations during
> page pinning to omit ZONE_MOVABLE, and also to migrate any existing
> pages from ZONE_MOVABLE during pinning.

I love what this patchset does, but, at least, it's better to consider
the side-effect of this patchset and inform it in somewhere. IIUC,
ZONE_MOVABLE exists for two purposes.

1) increasing availability of THP
2) memory hot-unplug

Potential issue would come from the case 1). They uses ZONE_MOVABLE
for THP availability and hard guarantee for migration isn't required
until now. So, there would be a system with following congifuration.

- memory layout: ZONE_NORMAL-512MB, ZONE_MOVABLE-512MB
- memory usage: unmovable-256MB, movable pinned-256MB, movable
  unpinned-512MB

With this patchset, movable pinned should be placed in ZONE_NORMAL so
512MB is required for ZONE_NORMAL. ZONE_NORMAL would be exhausted and
system performance would be highly afftect according to memory usage
pattern.

I'm not sure whether such configuration exists or not, but, at least,
it's better to write down this risk on commit message or something
else.

Thanks.


Re: [RFC][PATCH] kexec: Teach indirect pages how to live in high memory

2020-10-08 Thread Joonsoo Kim
On Tue, May 05, 2020 at 01:39:16PM -0500, Eric W. Biederman wrote:
> Hari Bathini  writes:
> 
> > On 05/05/20 3:29 am, Eric W. Biederman wrote:
> >> 
> >> Recently a patch was proposed to kimage_alloc_page to slightly alter
> >> the logic of how pages allocated with incompatible flags were
> >> detected.  The logic was being altered because the semantics of the
> >> page alloctor were changing yet again.
> >> 
> >> Looking at that case I realized that there is no reason for it to even
> >> exist.  Either the indirect page allocations and the source page
> >> allocations could be separated out, or I could do as I am doing now
> >> and simply teach the indirect pages to live in high memory.
> >> 
> >> This patch replaced pointers of type kimage_entry_t * with a new type
> >> kimage_entry_pos_t.  This new type holds the physical address of the
> >> indirect page and the offset within that page of the next indirect
> >> entry to write.  A special constant KIMAGE_ENTRY_POS_INVALID is added
> >> that kimage_image_pos_t variables that don't currently have a valid
> >> may be set to.
> >> 
> >> Two new functions kimage_read_entry and kimage_write_entry have been
> >> provided to write entries in way that works if they live in high
> >> memory.
> >> 
> >> The now unnecessary checks to see if a destination entry is non-zero
> >> and to increment it if so have been removed.  For safety new indrect
> >> pages are now cleared so we have a guarantee everything that has not
> >> been used yet is zero.  Along with this writing an extra trailing 0
> >> entry has been removed, as it is known all trailing entries are now 0.
> >> 
> >> With highmem support implemented for indirect pages
> >> kimage_image_alloc_page has been updated to always allocate
> >> GFP_HIGHUSER pages, and handling of pages with different
> >> gfp flags has been removed.
> >> 
> >> Signed-off-by: "Eric W. Biederman" 
> >
> > Eric, the patch failed with data access exception on ppc64. Using the below 
> > patch on top
> > got me going...
> 
> Doh!  Somehow I thought I had put that logic or something equivalent
> into kimage_write_entry and it appears I did not.  I will see if I can
> respin the patch.
> 
> Thank you very much for testing.

Hello, Eric.

It seems that this patch isn't upstreamed.
Could you respin the patch?
I've tested this one on x86_32 (highmem enabled) and it works well.

Thanks.


Re: [PATCH v2 for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

2020-09-29 Thread Joonsoo Kim
2020년 9월 29일 (화) 오후 5:08, Michal Hocko 님이 작성:
>
> On Mon 28-09-20 17:50:46, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> >
> > memalloc_nocma_{save/restore} APIs can be used to skip page allocation
> > on CMA area, but, there is a missing case and the page on CMA area could
> > be allocated even if APIs are used. This patch handles this case to fix
> > the potential issue.
> >
> > Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
> > could have the pages on CMA area so we need to skip it if ALLOC_CMA isn't
> > specified.
> >
> > Fixes: 8510e69c8efe (mm/page_alloc: fix memalloc_nocma_{save/restore} APIs)
> > Signed-off-by: Joonsoo Kim 
> > ---
> >  mm/page_alloc.c | 13 ++---
> >  1 file changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index fab5e97..104d2e1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3367,9 +3367,16 @@ struct page *rmqueue(struct zone *preferred_zone,
> >   struct page *page;
> >
> >   if (likely(order == 0)) {
> > - page = rmqueue_pcplist(preferred_zone, zone, gfp_flags,
> > + /*
> > +  * MIGRATE_MOVABLE pcplist could have the pages on CMA area 
> > and
> > +  * we need to skip it when CMA area isn't allowed.
> > +  */
> > + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> > + migratetype != MIGRATE_MOVABLE) {
> > + page = rmqueue_pcplist(preferred_zone, zone, 
> > gfp_flags,
> >   migratetype, alloc_flags);
> > - goto out;
> > + goto out;
> > + }
> >   }
>
> This approach looks definitely better than the previous version.

Thanks!

> >
> >   /*
> > @@ -3381,7 +3388,7 @@ struct page *rmqueue(struct zone *preferred_zone,
> >
> >   do {
> >   page = NULL;
> > - if (alloc_flags & ALLOC_HARDER) {
> > + if (order > 0 && alloc_flags & ALLOC_HARDER) {
> >   page = __rmqueue_smallest(zone, order, 
> > MIGRATE_HIGHATOMIC);
> >   if (page)
> >   trace_mm_page_alloc_zone_locked(page, order, 
> > migratetype);
>
> But this condition is not clear to me. __rmqueue_smallest doesn't access
> pcp lists. Maybe I have missed the point in the original discussion but
> this deserves a comment at least.

Before the pcplist skipping is applied, order-0 request can not reach here.
But, now, an order-0 request can reach here. Free memory on
MIGRATE_HIGHATOMIC is reserved for high-order atomic allocation
so an order-0 request should skip it.

I will add a code comment on the next version.

Thanks.


Re: [PATCH v2 for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

2020-09-29 Thread Joonsoo Kim
2020년 9월 29일 (화) 오후 1:50, Andrew Morton 님이 작성:
>
> On Tue, 29 Sep 2020 10:28:05 +0900 Joonsoo Kim  wrote:
>
> > > What about manually emptying the pcplists beforehand?
> >
> > It also increases the probability. schedule() or interrupt after emptying 
> > but
> > before the allocation could invalidate the effect.
>
> Keep local interrupts disabled across the pcp drain and the allocation
> attempt.

As said before, it's an allocation context API and actual allocation
happens later.
Doing such things there is not an easy job.

> > > Or byassing the pcplists for this caller and calling __rmqueue() directly?
> >
> > What this patch does is this one.
>
> I meant via a different function rather than by adding overhead to the
> existing commonly-used function.

Got it. One idea could be disabling/enabling pcp list on these APIs but
it's overhead would not be appropriate on these APIs. I don't have
another idea that doesn't touch the allocation path.

Thanks.


Re: [PATCH v2 for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

2020-09-28 Thread Joonsoo Kim
2020년 9월 29일 (화) 오전 8:52, Andrew Morton 님이 작성:
>
> On Mon, 28 Sep 2020 17:50:46 +0900 js1...@gmail.com wrote:
>
> > From: Joonsoo Kim 
> >
> > memalloc_nocma_{save/restore} APIs can be used to skip page allocation
> > on CMA area, but, there is a missing case and the page on CMA area could
> > be allocated even if APIs are used. This patch handles this case to fix
> > the potential issue.
> >
> > Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
> > could have the pages on CMA area so we need to skip it if ALLOC_CMA isn't
> > specified.
>
> Changelog doesn't describe the end-user visible effects of the bug.
> Please send this description?

How about this one?

memalloc_nocma_{save/restore} APIs can be used to skip page allocation
on CMA area, but, there is a missing case and the page on CMA area could
be allocated even if APIs are used. This patch handles this case to fix
the potential issue.

For now, these APIs are used to prevent long-term pinning on the CMA page.
When the long-term pinning is requested on the CMA page, it is migrated to
the non-CMA page before pinning. This non-CMA page is allocated by using
memalloc_nocma_{save/restore} APIs. If APIs doesn't work as intended,
the CMA page is allocated and it is pinned for a long time. This long-term pin
for the CMA page causes cma_alloc() failure and it could result in wrong
behaviour on the device driver who uses the cma_alloc().

Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
could have the pages on CMA area so we need to skip it if ALLOC_CMA isn't
specified.

> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3367,9 +3367,16 @@ struct page *rmqueue(struct zone *preferred_zone,
> >   struct page *page;
> >
> >   if (likely(order == 0)) {
> > - page = rmqueue_pcplist(preferred_zone, zone, gfp_flags,
> > + /*
> > +  * MIGRATE_MOVABLE pcplist could have the pages on CMA area 
> > and
> > +  * we need to skip it when CMA area isn't allowed.
> > +  */
> > + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> > + migratetype != MIGRATE_MOVABLE) {
> > + page = rmqueue_pcplist(preferred_zone, zone, 
> > gfp_flags,
> >   migratetype, alloc_flags);
> > - goto out;
> > + goto out;
> > + }
> >   }
> >
> >   /*
>
> We still really really don't want to be adding overhead to the page
> allocation hotpath for a really really obscure feature from a single
> callsite.

In fact, if we clear the __GFP_MOVABLE flag when initializing the
allocation context, we can avoid CMA page allocation without
adding this overhead to the page allocation hotpath. But, I think
this is not a good practice since it cheats the allocation type. There
would be a side-effect, for example, we cannot use the memory on
the ZONE_MOVABLE in this case.

> Do we have an understanding of how many people's kernels are enabling
> CONFIG_CMA?

AFAIK, the almost embedded system enables CONFIG_CMA. For example,
the handset, TV, AVN in a car. Recently, Roman makes CMA usable for huge
page allocation so users are continuously increased.

> I previously suggested retrying the allocation in
> __gup_longterm_locked() but you said "it cannot ensure that we
> eventually get the non-CMA page".  Please explain why?

To avoid allocating a CMA page, the pcplist should be empty. Retry doesn't
guarantee that we will hit the case that the pcplist is empty. It increases
the probability for this case, but, I think that relying on the
probability is not
a good practice.

> What about manually emptying the pcplists beforehand?

It also increases the probability. schedule() or interrupt after emptying but
before the allocation could invalidate the effect.

> Or byassing the pcplists for this caller and calling __rmqueue() directly?

What this patch does is this one.

> > @@ -3381,7 +3388,7 @@ struct page *rmqueue(struct zone *preferred_zone,
> >
> >   do {
> >   page = NULL;
> > - if (alloc_flags & ALLOC_HARDER) {
> > + if (order > 0 && alloc_flags & ALLOC_HARDER) {
> >   page = __rmqueue_smallest(zone, order, 
> > MIGRATE_HIGHATOMIC);
> >   if (page)
> >   trace_mm_page_alloc_zone_locked(page, order, 
> > migratetype);
>
> What does this hunk do?

MIGRATE_HIGHATOMIC is a reserved area for high order atomic allocation.
This hunk makes that order-0 allocation skip this area.

Thanks.


Re: [PATCH for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

2020-09-25 Thread Joonsoo Kim
2020년 9월 25일 (금) 오후 5:55, Vlastimil Babka 님이 작성:
>
> On 9/25/20 6:59 AM, Joonsoo Kim wrote:
> > 2020년 8월 28일 (금) 오전 8:54, Joonsoo Kim 님이 작성:
> >
> > Hello, Andrew and Vlastimil.
> >
> > It's better to fix this possible bug introduced in v5.9-rc1 before
> > v5.9 is released.
> > Which approach do you prefer?
> > If it is determined, I will immediately send a patch as you suggested.
>
> Hmm both Mel and I preferred the bypass approach and nobody else weighted in, 
> so
> if you don't mind, you can use my suggestion. Hmm maybe alloc_flags & 
> ALLOC_CMA
> check should precede migratetype check in the if () to optimize for userspace
> allocations?

Okay!
I will implement a bypass approach and send it early next week.

Thanks.


Re: [PATCH for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

2020-09-24 Thread Joonsoo Kim
2020년 8월 28일 (금) 오전 8:54, Joonsoo Kim 님이 작성:
>
> 2020년 8월 27일 (목) 오후 10:35, Mel Gorman 님이 작성:
> >
> > On Wed, Aug 26, 2020 at 02:12:44PM +0900, Joonsoo Kim wrote:
> > > > > And, it requires to break current code
> > > > > layering that order-0 page is always handled by the pcplist. I'd 
> > > > > prefer
> > > > > to avoid it so this patch uses different way to skip CMA page 
> > > > > allocation
> > > > > from the pcplist.
> > > >
> > > > Well it would be much simpler and won't affect most of allocations. 
> > > > Better than
> > > > flushing pcplists IMHO.
> > >
> > > Hmm...Still, I'd prefer my approach.
> >
> > I prefer the pcp bypass approach. It's simpler and it does not incur a
> > pcp drain/refill penalty.
> >
> > > There are two reasons. First,
> > > layering problem
> > > mentioned above. In rmqueue(), there is a code for MIGRATE_HIGHATOMIC.
> > > As the name shows, it's for high order atomic allocation. But, after
> > > skipping pcplist
> > > allocation as you suggested, we could get there with order 0 request.
> >
> > I guess your concern is that under some circumstances that a request that
> > passes a watermark check could fail due to a highatomic reserve and to
> > an extent this is true. However, in that case the system is already low
> > on memory depending on the allocation context, the pcp lists may get
> > flushed anyway.
>
> My concern is that non-highorder (order-0) allocation could pollute/use the
> MIGRATE_HIGHATOMIC pageblock. It's reserved for highorder atomic
> allocation so it's not good if an order-0 request could get there. It would
> cause more fragmentation on that pageblock.
>
> > > We can also
> > > change this code, but, I'd hope to maintain current layering. Second,
> > > a performance
> > > reason. After the flag for nocma is up, a burst of nocma allocation
> > > could come. After
> > > flushing the pcplist one times, we can use the free page on the
> > > pcplist as usual until
> > > the context is changed.
> >
> > It's not guaranteed because CMA pages could be freed between the nocma save
> > and restore triggering further drains due to a reschedule.  Similarly,
> > a CMA allocation in parallel could refill with CMA pages on the per-cpu
> > list. While both cases are unlikely, it's more unpredictable than a
> > straight-forward pcp bypass.
>
> Agreed that it's unpredictable than the pcp bypass. But, as you said,
> those cases
> would be rare.
>
> > I don't really see it as a layering violation of the API because all
> > order-0 pages go through the PCP lists. The fact that order-0 is serviced
> > from the pcp list is an internal implementation detail, the API doesn't
> > care.
>
> What I mean is an internal implementation layering violation. We could make
> a rule even in internal implementation to make code simpler and maintainable.
> I guess that order-0 is serviced from the pcp list is one of those.
>
> Anyway, although I prefer my approach, I'm okay with using pcp bypass.

Hello, Andrew and Vlastimil.

It's better to fix this possible bug introduced in v5.9-rc1 before
v5.9 is released.
Which approach do you prefer?
If it is determined, I will immediately send a patch as you suggested.

Thanks.


Re: [PATCH for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

2020-08-27 Thread Joonsoo Kim
2020년 8월 27일 (목) 오후 10:35, Mel Gorman 님이 작성:
>
> On Wed, Aug 26, 2020 at 02:12:44PM +0900, Joonsoo Kim wrote:
> > > > And, it requires to break current code
> > > > layering that order-0 page is always handled by the pcplist. I'd prefer
> > > > to avoid it so this patch uses different way to skip CMA page allocation
> > > > from the pcplist.
> > >
> > > Well it would be much simpler and won't affect most of allocations. 
> > > Better than
> > > flushing pcplists IMHO.
> >
> > Hmm...Still, I'd prefer my approach.
>
> I prefer the pcp bypass approach. It's simpler and it does not incur a
> pcp drain/refill penalty.
>
> > There are two reasons. First,
> > layering problem
> > mentioned above. In rmqueue(), there is a code for MIGRATE_HIGHATOMIC.
> > As the name shows, it's for high order atomic allocation. But, after
> > skipping pcplist
> > allocation as you suggested, we could get there with order 0 request.
>
> I guess your concern is that under some circumstances that a request that
> passes a watermark check could fail due to a highatomic reserve and to
> an extent this is true. However, in that case the system is already low
> on memory depending on the allocation context, the pcp lists may get
> flushed anyway.

My concern is that non-highorder (order-0) allocation could pollute/use the
MIGRATE_HIGHATOMIC pageblock. It's reserved for highorder atomic
allocation so it's not good if an order-0 request could get there. It would
cause more fragmentation on that pageblock.

> > We can also
> > change this code, but, I'd hope to maintain current layering. Second,
> > a performance
> > reason. After the flag for nocma is up, a burst of nocma allocation
> > could come. After
> > flushing the pcplist one times, we can use the free page on the
> > pcplist as usual until
> > the context is changed.
>
> It's not guaranteed because CMA pages could be freed between the nocma save
> and restore triggering further drains due to a reschedule.  Similarly,
> a CMA allocation in parallel could refill with CMA pages on the per-cpu
> list. While both cases are unlikely, it's more unpredictable than a
> straight-forward pcp bypass.

Agreed that it's unpredictable than the pcp bypass. But, as you said,
those cases
would be rare.

> I don't really see it as a layering violation of the API because all
> order-0 pages go through the PCP lists. The fact that order-0 is serviced
> from the pcp list is an internal implementation detail, the API doesn't
> care.

What I mean is an internal implementation layering violation. We could make
a rule even in internal implementation to make code simpler and maintainable.
I guess that order-0 is serviced from the pcp list is one of those.

Anyway, although I prefer my approach, I'm okay with using pcp bypass.

Thanks.


Re: [PATCH for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

2020-08-25 Thread Joonsoo Kim
2020년 8월 26일 (수) 오전 9:42, Andrew Morton 님이 작성:
>
> On Tue, 25 Aug 2020 14:34:32 +0900 Joonsoo Kim  wrote:
>
> > >
> > > That's a bunch more code on a very hot path to serve an obscure feature
> > > which has a single obscure callsite.
> > >
> > > Can we instead put the burden on that callsite rather than upon
> > > everyone?  For (dumb) example, teach __gup_longterm_locked() to put the
> > > page back if it's CMA and go get another one?
> >
> > Hmm... Unfortunately, it cannot ensure that we eventually get the non-CMA 
> > page.
> > I think that the only way to ensure it is to implement the
> > functionality here. We can
> > use 'unlikely' or 'static branch' to reduce the overhead for a really
> > rare case but
> > for now I have no idea how to completely remove the overhead.
>
> Gee, there must be something?  Provide the gup code with a special
> entry point which takes the page straight from __rmqueue() and bypasses
> the pcp lists?

Hmm... it seems not possible. It's allocation context API and maybe actual
allocation happens in handle_mm_fault() or it's successor. We cannot use
a special entry point for allocation there since it's a general function.

And, IMHO, making a special allocation function that bypasses the pcp list
would not be a good practice.

Thanks.


Re: [PATCH for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

2020-08-25 Thread Joonsoo Kim
2020년 8월 25일 (화) 오후 6:43, Vlastimil Babka 님이 작성:
>
>
> On 8/25/20 6:59 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > memalloc_nocma_{save/restore} APIs can be used to skip page allocation
> > on CMA area, but, there is a missing case and the page on CMA area could
> > be allocated even if APIs are used. This patch handles this case to fix
> > the potential issue.
> >
> > Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
> > could have the pages on CMA area so we need to skip it if ALLOC_CMA isn't
> > specified.
> >
> > This patch implements this behaviour by checking allocated page from
> > the pcplist rather than skipping an allocation from the pcplist entirely.
> > Skipping the pcplist entirely would result in a mismatch between watermark
> > check and actual page allocation.
>
> Are you sure? I think a mismatch exists already. Pages can be on the pcplist 
> but
> they are not considered as free in the watermark check. So passing watermark
> check means there should be also pages on free lists. So skipping pcplists 
> would
> be safe, no?

You are right.

> > And, it requires to break current code
> > layering that order-0 page is always handled by the pcplist. I'd prefer
> > to avoid it so this patch uses different way to skip CMA page allocation
> > from the pcplist.
>
> Well it would be much simpler and won't affect most of allocations. Better 
> than
> flushing pcplists IMHO.

Hmm...Still, I'd prefer my approach. There are two reasons. First,
layering problem
mentioned above. In rmqueue(), there is a code for MIGRATE_HIGHATOMIC.
As the name shows, it's for high order atomic allocation. But, after
skipping pcplist
allocation as you suggested, we could get there with order 0 request.
We can also
change this code, but, I'd hope to maintain current layering. Second,
a performance
reason. After the flag for nocma is up, a burst of nocma allocation
could come. After
flushing the pcplist one times, we can use the free page on the
pcplist as usual until
the context is changed.

How about my reasoning?

Thanks.


Re: [PATCH for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

2020-08-24 Thread Joonsoo Kim
2020년 8월 25일 (화) 오후 2:10, Andrew Morton 님이 작성:
>
> On Tue, 25 Aug 2020 13:59:42 +0900 js1...@gmail.com wrote:
>
> > From: Joonsoo Kim 
> >
> > memalloc_nocma_{save/restore} APIs can be used to skip page allocation
> > on CMA area, but, there is a missing case and the page on CMA area could
> > be allocated even if APIs are used. This patch handles this case to fix
> > the potential issue.
> >
> > Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
> > could have the pages on CMA area so we need to skip it if ALLOC_CMA isn't
> > specified.
> >
> > This patch implements this behaviour by checking allocated page from
> > the pcplist rather than skipping an allocation from the pcplist entirely.
> > Skipping the pcplist entirely would result in a mismatch between watermark
> > check and actual page allocation. And, it requires to break current code
> > layering that order-0 page is always handled by the pcplist. I'd prefer
> > to avoid it so this patch uses different way to skip CMA page allocation
> > from the pcplist.
> >
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3341,6 +3341,22 @@ static struct page *rmqueue_pcplist(struct zone 
> > *preferred_zone,
> >   pcp = _cpu_ptr(zone->pageset)->pcp;
> >   list = >lists[migratetype];
> >   page = __rmqueue_pcplist(zone,  migratetype, alloc_flags, pcp, list);
> > +#ifdef CONFIG_CMA
> > + if (page) {
> > + int mt = get_pcppage_migratetype(page);
> > +
> > + /*
> > +  * pcp could have the pages on CMA area and we need to skip it
> > +  * when !ALLOC_CMA. Free all pcplist and retry allocation.
> > +  */
> > + if (is_migrate_cma(mt) && !(alloc_flags & ALLOC_CMA)) {
> > + list_add(>lru, >lists[migratetype]);
> > + pcp->count++;
> > + free_pcppages_bulk(zone, pcp->count, pcp);
> > + page = __rmqueue_pcplist(zone, migratetype, 
> > alloc_flags, pcp, list);
> > + }
> > + }
> > +#endif
> >   if (page) {
> >   __count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
> >   zone_statistics(preferred_zone, zone);
>
> That's a bunch more code on a very hot path to serve an obscure feature
> which has a single obscure callsite.
>
> Can we instead put the burden on that callsite rather than upon
> everyone?  For (dumb) example, teach __gup_longterm_locked() to put the
> page back if it's CMA and go get another one?

Hmm... Unfortunately, it cannot ensure that we eventually get the non-CMA page.
I think that the only way to ensure it is to implement the
functionality here. We can
use 'unlikely' or 'static branch' to reduce the overhead for a really
rare case but
for now I have no idea how to completely remove the overhead.

Thanks.


Re: [PATCH v2] mm/page_alloc: fix memalloc_nocma_{save/restore} APIs

2020-07-23 Thread Joonsoo Kim
2020년 7월 24일 (금) 오후 12:14, Andrew Morton 님이 작성:
>
> On Fri, 24 Jul 2020 12:04:02 +0900 Joonsoo Kim  wrote:
>
> > 2020년 7월 24일 (금) 오전 11:36, Andrew Morton 님이 작성:
> > >
> > > On Fri, 24 Jul 2020 11:23:52 +0900 Joonsoo Kim  wrote:
> > >
> > > > > > Second, clearing __GFP_MOVABLE in current_gfp_context() has a side 
> > > > > > effect
> > > > > > to exclude the memory on the ZONE_MOVABLE for allocation target.
> > > > >
> > > > > More whoops.
> > > > >
> > > > > Could we please have a description of the end-user-visible effects of
> > > > > this change?  Very much needed when proposing a -stable backport, I 
> > > > > think.
> > > >
> > > > In fact, there is no noticeable end-user-visible effect since the 
> > > > fallback would
> > > > cover the problematic case. It's mentioned in the commit description. 
> > > > Perhap,
> > > > performance would be improved due to reduced retry and more available 
> > > > memory
> > > > (we can use ZONE_MOVABLE with this patch) but it would be neglectable.
> > > >
> > > > > d7fefcc8de9147c is over a year old.  Why did we only just discover
> > > > > this?  This makes one wonder how serious those end-user-visible 
> > > > > effects
> > > > > are?
> > > >
> > > > As mentioned above, there is no visible problem to the end-user.
> > >
> > > OK, thanks.  In that case, I don't believe that a stable backport is
> > > appropriate?
> > >
> > > (Documentation/process/stable-kernel-rules.rst)
> >
> > Thanks for the pointer!
> >
> > Hmm... I'm not sure the correct way to handle this patch. I thought that
> > memalloc_nocma_{save,restore} is an API that is callable from the module.
> > If it is true, it's better to regard this patch as the stable candidate 
> > since
> > out-of-tree modules could use it without the fallback and it would cause
> > a problem. But, yes, there is no visible problem to the end-user, at least,
> > within the mainline so it is possibly not a stable candidate.
> >
> > Please share your opinion about this situation.
>
> We tend not to care much about out-of-tree modules.  I don't think a
> theoretical concern for out-of-tree code justifies risking the
> stability of -stable kernels.

Okay. It's appreciated if you remove the stable tag. Or, I will send it again
without the stable tag.

Thanks.


Re: [PATCH v2] mm/page_alloc: fix memalloc_nocma_{save/restore} APIs

2020-07-23 Thread Joonsoo Kim
2020년 7월 24일 (금) 오전 11:36, Andrew Morton 님이 작성:
>
> On Fri, 24 Jul 2020 11:23:52 +0900 Joonsoo Kim  wrote:
>
> > > > Second, clearing __GFP_MOVABLE in current_gfp_context() has a side 
> > > > effect
> > > > to exclude the memory on the ZONE_MOVABLE for allocation target.
> > >
> > > More whoops.
> > >
> > > Could we please have a description of the end-user-visible effects of
> > > this change?  Very much needed when proposing a -stable backport, I think.
> >
> > In fact, there is no noticeable end-user-visible effect since the fallback 
> > would
> > cover the problematic case. It's mentioned in the commit description. 
> > Perhap,
> > performance would be improved due to reduced retry and more available memory
> > (we can use ZONE_MOVABLE with this patch) but it would be neglectable.
> >
> > > d7fefcc8de9147c is over a year old.  Why did we only just discover
> > > this?  This makes one wonder how serious those end-user-visible effects
> > > are?
> >
> > As mentioned above, there is no visible problem to the end-user.
>
> OK, thanks.  In that case, I don't believe that a stable backport is
> appropriate?
>
> (Documentation/process/stable-kernel-rules.rst)

Thanks for the pointer!

Hmm... I'm not sure the correct way to handle this patch. I thought that
memalloc_nocma_{save,restore} is an API that is callable from the module.
If it is true, it's better to regard this patch as the stable candidate since
out-of-tree modules could use it without the fallback and it would cause
a problem. But, yes, there is no visible problem to the end-user, at least,
within the mainline so it is possibly not a stable candidate.

Please share your opinion about this situation.

Thanks.


Re: [PATCH v2] mm/page_alloc: fix memalloc_nocma_{save/restore} APIs

2020-07-23 Thread Joonsoo Kim
2020년 7월 24일 (금) 오전 10:08, Andrew Morton 님이 작성:
>
> On Thu, 23 Jul 2020 10:49:02 +0900 js1...@gmail.com wrote:
>
> > From: Joonsoo Kim 
> >
> > Currently, memalloc_nocma_{save/restore} API that prevents CMA area
> > in page allocation is implemented by using current_gfp_context(). However,
> > there are two problems of this implementation.
> >
> > First, this doesn't work for allocation fastpath. In the fastpath,
> > original gfp_mask is used since current_gfp_context() is introduced in
> > order to control reclaim and it is on slowpath. So, CMA area can be
> > allocated through the allocation fastpath even if
> > memalloc_nocma_{save/restore} APIs are used.
>
> Whoops.
>
> > Currently, there is just
> > one user for these APIs and it has a fallback method to prevent actual
> > problem.
>
> Shouldn't the patch remove the fallback method?

It's not just the fallback but it also has its own functionality. So,
we should not remove it.

> > Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect
> > to exclude the memory on the ZONE_MOVABLE for allocation target.
>
> More whoops.
>
> Could we please have a description of the end-user-visible effects of
> this change?  Very much needed when proposing a -stable backport, I think.

In fact, there is no noticeable end-user-visible effect since the fallback would
cover the problematic case. It's mentioned in the commit description. Perhap,
performance would be improved due to reduced retry and more available memory
(we can use ZONE_MOVABLE with this patch) but it would be neglectable.

> d7fefcc8de9147c is over a year old.  Why did we only just discover
> this?  This makes one wonder how serious those end-user-visible effects
> are?

As mentioned above, there is no visible problem to the end-user.

Thanks.


Re: [PATCH] mm/page_alloc: fix memalloc_nocma_{save/restore} APIs

2020-07-22 Thread Joonsoo Kim
2020년 7월 21일 (화) 오후 9:43, Matthew Wilcox 님이 작성:
>
> On Tue, Jul 21, 2020 at 02:38:56PM +0200, Vlastimil Babka wrote:
> > On 7/21/20 2:05 PM, Matthew Wilcox wrote:
> > > On Tue, Jul 21, 2020 at 12:28:49PM +0900, js1...@gmail.com wrote:
> > >> @@ -4619,8 +4631,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned 
> > >> int order,
> > >>wake_all_kswapds(order, gfp_mask, ac);
> > >>
> > >>reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> > >> -  if (reserve_flags)
> > >> +  if (reserve_flags) {
> > >>alloc_flags = reserve_flags;
> > >> +  alloc_flags = current_alloc_flags(gfp_mask, alloc_flags);
> > >> +  }
> > >
> > > Is this right?  Shouldn't you be passing reserve_flags to
> > > current_alloc_flags() here?  Also, there's no need to add { } here.
> >
> > Note the "alloc_flags = reserve_flags;" line is not being deleted here. But 
> > if
> > it was, your points would be true, and yeah it would be a bit more tidy.
>
> Oh ... I should wake up a little more.
>
> Yeah, I'd recommend just doing this:
>
> -   alloc_flags = reserve_flags;
> +   alloc_flags = current_alloc_flags(gfp_mask, reserve_flags);

Okay. I will change it. Just note that the reason that I added it
separately is that
I think that separation is more readable since we can easily notice
that alloc_flags
is changed to reserve_flags without inspecting currect_alloc_flags() function.

Thanks.


Re: [PATCH] mm/page_alloc: fix memalloc_nocma_{save/restore} APIs

2020-07-22 Thread Joonsoo Kim
2020년 7월 21일 (화) 오후 9:39, Vlastimil Babka 님이 작성:
>
> On 7/21/20 2:05 PM, Matthew Wilcox wrote:
> > On Tue, Jul 21, 2020 at 12:28:49PM +0900, js1...@gmail.com wrote:
> >> +static inline unsigned int current_alloc_flags(gfp_t gfp_mask,
> >> +unsigned int alloc_flags)
> >> +{
> >> +#ifdef CONFIG_CMA
> >> +unsigned int pflags = current->flags;
> >> +
> >> +if (!(pflags & PF_MEMALLOC_NOCMA) &&
> >> +gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> >> +alloc_flags |= ALLOC_CMA;
> >
> > Please don't indent by one tab when splitting a line because it looks like
> > the second line and third line are part of the same block.  Either do
> > this:
> >
> >   if (!(pflags & PF_MEMALLOC_NOCMA) &&
> >   gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> >   alloc_flags |= ALLOC_CMA;
> >
> > or this:
> >   if (!(pflags & PF_MEMALLOC_NOCMA) &&
> >   gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> >   alloc_flags |= ALLOC_CMA;
>
> Ah, good point.

Will change it.

Thanks.


Re: [PATCH v2 1/4] mm/page_alloc: fix non cma alloc context

2020-07-20 Thread Joonsoo Kim
2020년 7월 21일 (화) 오전 8:23, Andrew Morton 님이 작성:
>
> On Mon, 20 Jul 2020 13:56:15 +0900 js1...@gmail.com wrote:
>
> > Currently, preventing cma area in page allocation is implemented by using
> > current_gfp_context(). However, there are two problems of this
> > implementation.
> >
> > First, this doesn't work for allocation fastpath. In the fastpath,
> > original gfp_mask is used since current_gfp_context() is introduced in
> > order to control reclaim and it is on slowpath.
> > Second, clearing __GFP_MOVABLE has a side effect to exclude the memory
> > on the ZONE_MOVABLE for allocation target.
> >
> > To fix these problems, this patch changes the implementation to exclude
> > cma area in page allocation. Main point of this change is using the
> > alloc_flags. alloc_flags is mainly used to control allocation so it fits
> > for excluding cma area in allocation.
> >
> > Fixes: d7fefcc8de91 (mm/cma: add PF flag to force non cma alloc)
> > Cc: 
>
> This patch is against linux-next (or -mm) and has a lot of issues
> applying to mainline.  If we indeed wish to backport it to -stable, it
> should be against mainline, please.

I sent a revised patch against the mainline a minute ago. Subject and commit
description is updated.

Thanks.


Re: [PATCH v6 2/6] mm/vmscan: protect the workingset on anonymous LRU

2020-07-20 Thread Joonsoo Kim
2020년 7월 17일 (금) 오후 10:59, Johannes Weiner 님이 작성:
>
> On Wed, Jun 17, 2020 at 02:26:19PM +0900, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > In current implementation, newly created or swap-in anonymous page
> > is started on active list. Growing active list results in rebalancing
> > active/inactive list so old pages on active list are demoted to inactive
> > list. Hence, the page on active list isn't protected at all.
> >
> > Following is an example of this situation.
> >
> > Assume that 50 hot pages on active list. Numbers denote the number of
> > pages on active/inactive list (active | inactive).
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(uo) | 50(h)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(uo) | 50(uo), swap-out 50(h)
> >
> > This patch tries to fix this issue.
> > Like as file LRU, newly created or swap-in anonymous pages will be
> > inserted to the inactive list. They are promoted to active list if
> > enough reference happens. This simple modification changes the above
> > example as following.
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(h) | 50(uo)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(h) | 50(uo), swap-out 50(uo)
> >
> > As you can see, hot pages on active list would be protected.
> >
> > Note that, this implementation has a drawback that the page cannot
> > be promoted and will be swapped-out if re-access interval is greater than
> > the size of inactive list but less than the size of total(active+inactive).
> > To solve this potential issue, following patch will apply workingset
> > detection that is applied to file LRU some day before.
> >
> > v6: Before this patch, all anon pages (inactive + active) are considered
> > as workingset. However, with this patch, only active pages are considered
> > as workingset. So, file refault formula which uses the number of all
> > anon pages is changed to use only the number of active anon pages.
>
> I can see that also from the code, but it doesn't explain why.
>
> And I'm not sure this is correct. I can see two problems with it.
>
> After your patch series, there is still one difference between anon
> and file: cache trim mode. If the "use-once" anon dominate most of
> memory and you have a small set of heavily thrashing files, it would
> not get recognized. File refaults *have* to compare their distance to
> the *entire* anon set, or we could get trapped in cache trimming mode
> even as file pages with access frequencies <= RAM are thrashing.
>
> On the anon side, there is no cache trimming mode. But even if we're
> not in cache trimming mode and active file is already being reclaimed,
> we have to recognize thrashing on the anon side when reuse frequencies
> are within available RAM. Otherwise we treat an inactive file that is
> not being reused as having the same value as an anon page that is
> being reused. And then we may reclaim file and anon at the same rate
> even as anon is thrashing and file is not. That's not right.
>
> We need to activate everything with a reuse frequency <= RAM. Reuse
> frequency is refault distance plus size of the inactive list the page
> was on. This means anon distances should be compared to active anon +
> inactive file + active file, and file distances should be compared to
> active file + inactive_anon + active anon.

You're right. Maybe, I'm confused about something at that time. I will change
it as you suggested.

Thanks.


Re: [PATCH 2/4] mm/gup: restrict CMA region by using allocation scope API

2020-07-17 Thread Joonsoo Kim
2020년 7월 17일 (금) 오후 5:26, Michal Hocko 님이 작성:
>
> On Fri 17-07-20 16:46:38, Joonsoo Kim wrote:
> > 2020년 7월 15일 (수) 오후 5:24, Michal Hocko 님이 작성:
> > >
> > > On Wed 15-07-20 14:05:27, Joonsoo Kim wrote:
> > > > From: Joonsoo Kim 
> > > >
> > > > We have well defined scope API to exclude CMA region.
> > > > Use it rather than manipulating gfp_mask manually. With this change,
> > > > we can now use __GFP_MOVABLE for gfp_mask and the ZONE_MOVABLE is also
> > > > searched by page allocator. For hugetlb, gfp_mask is redefined since
> > > > it has a regular allocation mask filter for migration target.
> > > >
> > > > Note that this can be considered as a fix for the commit 9a4e9f3b2d73
> > > > ("mm: update get_user_pages_longterm to migrate pages allocated from
> > > > CMA region"). However, "Fixes" tag isn't added here since it is just
> > > > suboptimal but it doesn't cause any problem.
> > >
> > > But it is breaking the contract that the longterm pins never end up in a
> > > cma managed memory. So I think Fixes tag is really due. I am not sure
> > > about stable backport. If the patch was the trivial move of
> >
> > Previous implementation is correct since longterm pins never end up in a CMA
> > managed memory with that implementation. It's just a different and 
> > suboptimal
> > implementation to exclude the CMA area. This is why I don't add the "Fixes"A
> > tag on the patch.
>
> But the current implementation calls memalloc_nocma_restore too early so
> __gu_longterm_locked will migrate pages possibly to CMA ranges as there
> is no GFP_MOVABLE restriction in place. Or am I missing something?

IIUC, calling memalloc_nocma_restore() too early doesn't cause the
actual problem.

Final check is done by check_and_migrate_cma_pages() which is outside of
scope API. If we find a CMA page in between the gup range here, the page is
migrated to the migration target page and this target page is allocated by
new_non_cma_page().

new_non_cma_page() try to allocate the page without __GFP_MOVABLE so
returned page will not be CMA area memory. Therefore, even if
memalloc_nocma_restore() is called early, there is no actual problem.

Thanks.


Re: [PATCH 1/4] mm/page_alloc: fix non cma alloc context

2020-07-17 Thread Joonsoo Kim
2020년 7월 17일 (금) 오후 5:32, David Laight 님이 작성:
>
> From: js1...@gmail.com
> > Sent: 15 July 2020 06:05
> > From: Joonsoo Kim 
> >
> > Currently, preventing cma area in page allocation is implemented by using
> > current_gfp_context(). However, there are two problems of this
> > implementation.
> ...
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6416d08..cd53894 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> ...
> > @@ -3693,6 +3691,16 @@ alloc_flags_nofragment(struct zone *zone, gfp_t 
> > gfp_mask)
> >   return alloc_flags;
> >  }
> >
> > +static inline void current_alloc_flags(gfp_t gfp_mask,
> > + unsigned int *alloc_flags)
> > +{
> > + unsigned int pflags = READ_ONCE(current->flags);
> > +
> > + if (!(pflags & PF_MEMALLOC_NOCMA) &&
> > + gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> > + *alloc_flags |= ALLOC_CMA;
> > +}
> > +
>
> I'd guess this would be easier to understand and probably generate
> better code if renamed and used as:
> alloc_flags |= can_alloc_cma(gpf_mask);
>
> Given it is a 'static inline' the compiler might end up
> generating the same code.
> If you needed to clear a flag doing:
> alloc_flags = current_alloc_flags(gpf_mask, alloc_flags);
> is much better for non-inlined function.

Vlastimil suggested this way and I have agreed with that. Anyway,
thanks for the suggestion.

Thanks.


Re: [PATCH 1/4] mm/page_alloc: fix non cma alloc context

2020-07-17 Thread Joonsoo Kim
2020년 7월 17일 (금) 오후 5:15, Vlastimil Babka 님이 작성:
>
> On 7/17/20 10:10 AM, Vlastimil Babka wrote:
> > On 7/17/20 9:29 AM, Joonsoo Kim wrote:
> >> 2020년 7월 16일 (목) 오후 4:45, Vlastimil Babka 님이 작성:
> >>>
> >>> On 7/16/20 9:27 AM, Joonsoo Kim wrote:
> >>> > 2020년 7월 15일 (수) 오후 5:24, Vlastimil Babka 님이 작성:
> >>> >> >  /*
> >>> >> >   * get_page_from_freelist goes through the zonelist trying to 
> >>> >> > allocate
> >>> >> >   * a page.
> >>> >> > @@ -3706,6 +3714,8 @@ get_page_from_freelist(gfp_t gfp_mask, 
> >>> >> > unsigned int order, int alloc_flags,
> >>> >> >   struct pglist_data *last_pgdat_dirty_limit = NULL;
> >>> >> >   bool no_fallback;
> >>> >> >
> >>> >> > + current_alloc_flags(gfp_mask, _flags);
> >>> >>
> >>> >> I don't see why to move the test here? It will still be executed in the
> >>> >> fastpath, if that's what you wanted to avoid.
> >>> >
> >>> > I want to execute it on the fastpath, too. Reason that I moved it here
> >>> > is that alloc_flags could be reset on slowpath. See the code where
> >>> > __gfp_pfmemalloc_flags() is on. This is the only place that I can apply
> >>> > this option to all the allocation paths at once.
> >>>
> >>> But get_page_from_freelist() might be called multiple times in the 
> >>> slowpath, and
> >>> also anyone looking for gfp and alloc flags setup will likely not examine 
> >>> this
> >>> function. I don't see a problem in having it in two places that already 
> >>> deal
> >>> with alloc_flags setup, as it is now.
> >>
> >> I agree that anyone looking alloc flags will miss that function easily. 
> >> Okay.
> >> I will place it on its original place, although we now need to add one
> >> more place.
> >> *Three places* are gfp_to_alloc_flags(), prepare_alloc_pages() and
> >> __gfp_pfmemalloc_flags().
> >
> > Hm the check below should also work for ALLOC_OOM|ALLOC_NOCMA then.
> >
> > /* Avoid allocations with no watermarks from looping endlessly */
> >if (tsk_is_oom_victim(current) &&
> > (alloc_flags == ALLOC_OOM ||
> >  (gfp_mask & __GFP_NOMEMALLOC)))
> > goto nopage;
> >
> > Maybe it's simpler to change get_page_from_freelist() then. But document 
> > well.
>
> But then we have e.g. should_reclaim_retry() which calls __zone_watermark_ok()
> where ALLOC_CMA plays a role too, so that means we should have alloc_mask set 
> up
> correctly wrt ALLOC_CMA at the __alloc_pages_slowpath() level...

Good catch! Hmm... Okay. It would be necessarily handled in three places.
I will fix it on the next version. Anyway, we need some clean-up about
alloc_flags
handling since it looks not good for maintenance.

Thanks.


Re: [PATCH 2/4] mm/gup: restrict CMA region by using allocation scope API

2020-07-17 Thread Joonsoo Kim
2020년 7월 15일 (수) 오후 5:24, Michal Hocko 님이 작성:
>
> On Wed 15-07-20 14:05:27, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> >
> > We have well defined scope API to exclude CMA region.
> > Use it rather than manipulating gfp_mask manually. With this change,
> > we can now use __GFP_MOVABLE for gfp_mask and the ZONE_MOVABLE is also
> > searched by page allocator. For hugetlb, gfp_mask is redefined since
> > it has a regular allocation mask filter for migration target.
> >
> > Note that this can be considered as a fix for the commit 9a4e9f3b2d73
> > ("mm: update get_user_pages_longterm to migrate pages allocated from
> > CMA region"). However, "Fixes" tag isn't added here since it is just
> > suboptimal but it doesn't cause any problem.
>
> But it is breaking the contract that the longterm pins never end up in a
> cma managed memory. So I think Fixes tag is really due. I am not sure
> about stable backport. If the patch was the trivial move of

Previous implementation is correct since longterm pins never end up in a CMA
managed memory with that implementation. It's just a different and suboptimal
implementation to exclude the CMA area. This is why I don't add the "Fixes"
tag on the patch.

> memalloc_nocma_restore then it would be probably worth it because it is
> trivial to review and backport. I suspect that longterm pins in CMA
> regions would cause hard to debug issues where CMA memory will not be
> available. But I am not really sure this is a real problem considering
> how many long term pin users we have and I have also no idea whether
> those are usually used along with CMA users.
>
> Anyway I think it would really be much better to isolate the
> memalloc_nocma_restore and have it first in the series. The reword of

Unfortunately, it's not possible to place it first in the series since
memalloc_nocma_XXX API has a bug that could return the CMA area even if
scope is set up. It is fixed on the first patch in this series.

> the __GFP_MOVABLE functionality is orthogonal.

My logic is that, we basically assume that using __GFP_MOVABLE is possible
in migration target allocation. And, it was necessarily cleared in
order to exclude
the CMA area. Now, we use the other method to exclude the CMA area so
__GFP_MOVABLE is added like usual. If you think that it deserves to be
a separate patch, I will do it on the next version.

> Btw __GFP_NOWARN change is not documented.

Will document it.

Thanks.


Re: [PATCH 1/4] mm/page_alloc: fix non cma alloc context

2020-07-17 Thread Joonsoo Kim
2020년 7월 16일 (목) 오후 4:45, Vlastimil Babka 님이 작성:
>
> On 7/16/20 9:27 AM, Joonsoo Kim wrote:
> > 2020년 7월 15일 (수) 오후 5:24, Vlastimil Babka 님이 작성:
> >> >  /*
> >> >   * get_page_from_freelist goes through the zonelist trying to allocate
> >> >   * a page.
> >> > @@ -3706,6 +3714,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned 
> >> > int order, int alloc_flags,
> >> >   struct pglist_data *last_pgdat_dirty_limit = NULL;
> >> >   bool no_fallback;
> >> >
> >> > + current_alloc_flags(gfp_mask, _flags);
> >>
> >> I don't see why to move the test here? It will still be executed in the
> >> fastpath, if that's what you wanted to avoid.
> >
> > I want to execute it on the fastpath, too. Reason that I moved it here
> > is that alloc_flags could be reset on slowpath. See the code where
> > __gfp_pfmemalloc_flags() is on. This is the only place that I can apply
> > this option to all the allocation paths at once.
>
> But get_page_from_freelist() might be called multiple times in the slowpath, 
> and
> also anyone looking for gfp and alloc flags setup will likely not examine this
> function. I don't see a problem in having it in two places that already deal
> with alloc_flags setup, as it is now.

I agree that anyone looking alloc flags will miss that function easily. Okay.
I will place it on its original place, although we now need to add one
more place.
*Three places* are gfp_to_alloc_flags(), prepare_alloc_pages() and
__gfp_pfmemalloc_flags().

Thanks.


Re: [PATCH 1/4] mm/page_alloc: fix non cma alloc context

2020-07-16 Thread Joonsoo Kim
2020년 7월 15일 (수) 오후 5:24, Vlastimil Babka 님이 작성:
>
> On 7/15/20 7:05 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > Currently, preventing cma area in page allocation is implemented by using
> > current_gfp_context(). However, there are two problems of this
> > implementation.
> >
> > First, this doesn't work for allocation fastpath. In the fastpath,
> > original gfp_mask is used since current_gfp_context() is introduced in
> > order to control reclaim and it is on slowpath.
> > Second, clearing __GFP_MOVABLE has a side effect to exclude the memory
> > on the ZONE_MOVABLE for allocation target.
> >
> > To fix these problems, this patch changes the implementation to exclude
> > cma area in page allocation. Main point of this change is using the
> > alloc_flags. alloc_flags is mainly used to control allocation so it fits
> > for excluding cma area in allocation.
>
> Agreed, should have been done with ALLOC_CMA since the beginning.
>
> > Fixes: d7fefcc (mm/cma: add PF flag to force non cma alloc)
>
> More digits please.
> Fixes: d7fefcc8de91 ("mm/cma: add PF flag to force non cma alloc")

Will do.

> > Cc: 
> > Signed-off-by: Joonsoo Kim 
> > ---
> >  include/linux/sched/mm.h |  4 
> >  mm/page_alloc.c  | 27 +++
> >  2 files changed, 15 insertions(+), 16 deletions(-)
> >
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index 44ad5b7..a73847a 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -191,10 +191,6 @@ static inline gfp_t current_gfp_context(gfp_t flags)
> >   flags &= ~(__GFP_IO | __GFP_FS);
> >   else if (pflags & PF_MEMALLOC_NOFS)
> >   flags &= ~__GFP_FS;
>
> Above this hunk you should also remove PF_MEMALLOC_NOCMA from the test.

Will do.

> > -#ifdef CONFIG_CMA
> > - if (pflags & PF_MEMALLOC_NOCMA)
> > - flags &= ~__GFP_MOVABLE;
> > -#endif
> >   }
> >   return flags;
> >  }
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6416d08..cd53894 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2791,7 +2791,7 @@ __rmqueue(struct zone *zone, unsigned int order, int 
> > migratetype,
> >* allocating from CMA when over half of the zone's free memory
> >* is in the CMA area.
> >*/
> > - if (migratetype == MIGRATE_MOVABLE &&
> > + if (alloc_flags & ALLOC_CMA &&
> >   zone_page_state(zone, NR_FREE_CMA_PAGES) >
> >   zone_page_state(zone, NR_FREE_PAGES) / 2) {
> >   page = __rmqueue_cma_fallback(zone, order);
> > @@ -2802,7 +2802,7 @@ __rmqueue(struct zone *zone, unsigned int order, int 
> > migratetype,
> >  retry:
> >   page = __rmqueue_smallest(zone, order, migratetype);
> >   if (unlikely(!page)) {
> > - if (migratetype == MIGRATE_MOVABLE)
> > + if (alloc_flags & ALLOC_CMA)
> >   page = __rmqueue_cma_fallback(zone, order);
> >
> >   if (!page && __rmqueue_fallback(zone, order, migratetype,
> > @@ -3502,11 +3502,9 @@ static inline long 
> > __zone_watermark_unusable_free(struct zone *z,
> >   if (likely(!alloc_harder))
> >   unusable_free += z->nr_reserved_highatomic;
> >
> > -#ifdef CONFIG_CMA
> >   /* If allocation can't use CMA areas don't use free CMA pages */
> > - if (!(alloc_flags & ALLOC_CMA))
> > + if (IS_ENABLED(CONFIG_CMA) && !(alloc_flags & ALLOC_CMA))
> >   unusable_free += zone_page_state(z, NR_FREE_CMA_PAGES);
> > -#endif
> >
> >   return unusable_free;
> >  }
> > @@ -3693,6 +3691,16 @@ alloc_flags_nofragment(struct zone *zone, gfp_t 
> > gfp_mask)
> >   return alloc_flags;
> >  }
> >
> > +static inline void current_alloc_flags(gfp_t gfp_mask,
> > + unsigned int *alloc_flags)
> > +{
> > + unsigned int pflags = READ_ONCE(current->flags);
> > +
> > + if (!(pflags & PF_MEMALLOC_NOCMA) &&
> > + gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> > + *alloc_flags |= ALLOC_CMA;
> > +}
>
> I don't like the modification through parameter, would just do what
> current_gfp_context() does and return the modified value.
> Also make it a no-op (including no READ_ONC

Re: [PATCH v5 5/9] mm/migrate: make a standard migration target allocation function

2020-07-13 Thread Joonsoo Kim
On Mon, Jul 13, 2020 at 09:53:40AM +0200, Vlastimil Babka wrote:
> On 7/13/20 8:41 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> 
> Nit: s/make/introduce/ in the subject, is a more common verb in this context.
> 

Thanks for correcting!
Following is fixed version.

Thanks.

->8---
>From e8e4533bbc56fff3a77d5bc9a40dda7a9efc83e8 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Wed, 1 Jul 2020 16:19:18 +1000
Subject: [PATCH v5 5/9] mm/migrate: introduce a standard migration target
 allocation function

There are some similar functions for migration target allocation.  Since
there is no fundamental difference, it's better to keep just one rather
than keeping all variants.  This patch implements base migration target
allocation function.  In the following patches, variants will be converted
to use this function.

Changes should be mechanical, but, unfortunately, there are some
differences. First, some callers' nodemask is assgined to NULL since NULL
nodemask will be considered as all available nodes, that is,
_states[N_MEMORY]. Second, for hugetlb page allocation, gfp_mask is
redefined as regular hugetlb allocation gfp_mask plus __GFP_THISNODE if
user provided gfp_mask has it. This is because future caller of this
function requires to set this node constaint. Lastly, if provided nodeid
is NUMA_NO_NODE, nodeid is set up to the node where migration source
lives. It helps to remove simple wrappers for setting up the nodeid.

Note that PageHighmem() call in previous function is changed to open-code
"is_highmem_idx()" since it provides more readability.

Acked-by: Vlastimil Babka 
Acked-by: Michal Hocko 
Signed-off-by: Joonsoo Kim 
---
 include/linux/hugetlb.h | 15 +++
 include/linux/migrate.h |  9 +
 mm/internal.h   |  7 +++
 mm/memory-failure.c |  7 +--
 mm/memory_hotplug.c | 12 
 mm/migrate.c| 26 --
 mm/page_isolation.c |  7 +--
 7 files changed, 61 insertions(+), 22 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bb93e95..6b9508d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -701,6 +701,16 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)
return GFP_HIGHUSER;
 }
 
+static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
+{
+   gfp_t modified_mask = htlb_alloc_mask(h);
+
+   /* Some callers might want to enfoce node */
+   modified_mask |= (gfp_mask & __GFP_THISNODE);
+
+   return modified_mask;
+}
+
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
   struct mm_struct *mm, pte_t *pte)
 {
@@ -888,6 +898,11 @@ static inline gfp_t htlb_alloc_mask(struct hstate *h)
return 0;
 }
 
+static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask)
+{
+   return 0;
+}
+
 static inline spinlock_t *huge_pte_lockptr(struct hstate *h,
   struct mm_struct *mm, pte_t *pte)
 {
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 1d70b4a..cc56f0d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -10,6 +10,8 @@
 typedef struct page *new_page_t(struct page *page, unsigned long private);
 typedef void free_page_t(struct page *page, unsigned long private);
 
+struct migration_target_control;
+
 /*
  * Return values from addresss_space_operations.migratepage():
  * - negative errno on page migration failure;
@@ -39,8 +41,7 @@ extern int migrate_page(struct address_space *mapping,
enum migrate_mode mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
unsigned long private, enum migrate_mode mode, int reason);
-extern struct page *new_page_nodemask(struct page *page,
-   int preferred_nid, nodemask_t *nodemask);
+extern struct page *alloc_migration_target(struct page *page, unsigned long 
private);
 extern int isolate_movable_page(struct page *page, isolate_mode_t mode);
 extern void putback_movable_page(struct page *page);
 
@@ -59,8 +60,8 @@ static inline int migrate_pages(struct list_head *l, 
new_page_t new,
free_page_t free, unsigned long private, enum migrate_mode mode,
int reason)
{ return -ENOSYS; }
-static inline struct page *new_page_nodemask(struct page *page,
-   int preferred_nid, nodemask_t *nodemask)
+static inline struct page *alloc_migration_target(struct page *page,
+   unsigned long private)
{ return NULL; }
 static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
{ return -EBUSY; }
diff --git a/mm/internal.h b/mm/internal.h
index dd14c53..0beacf3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -614,4 +614,11 @@ static inline bool is_migrate_highatomic_page(struct page 
*page)
 
 voi

Re: [PATCH v5 4/9] mm/migrate: clear __GFP_RECLAIM to make the migration callback consistent with regular THP allocations

2020-07-13 Thread Joonsoo Kim
On Mon, Jul 13, 2020 at 09:52:20AM +0200, Vlastimil Babka wrote:
> On 7/13/20 8:41 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> > 
> > new_page_nodemask is a migration callback and it tries to use a common
> > gfp flags for the target page allocation whether it is a base page or a
> > THP. The later only adds GFP_TRANSHUGE to the given mask. This results
> > in the allocation being slightly more aggressive than necessary because
> > the resulting gfp mask will contain also __GFP_RECLAIM_KSWAPD. THP
> > allocations usually exclude this flag to reduce over eager background
> > reclaim during a high THP allocation load which has been seen during
> > large mmaps initialization. There is no indication that this is a
> > problem for migration as well but theoretically the same might happen
> > when migrating large mappings to a different node. Make the migration
> > callback consistent with regular THP allocations.
> > 
> > Signed-off-by: Joonsoo Kim 
> 
> Acked-by: Vlastimil Babka 
> 
> Thanks!
> 
> Typo below (I assume Andrew will fix it)
> 
> > ---
> >  mm/migrate.c | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 3b3d918..1cfc965 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1547,6 +1547,11 @@ struct page *new_page_nodemask(struct page *page,
> > }
> >  
> > if (PageTransHuge(page)) {
> > +   /*
> > +* clear __GFP_RECALIM to make the migration callback
> 
>      __GFP_RECLAIM
> 

Okay. Here goes a fixed version.

Thanks!


-->8-
>From 6273f02fd8b8ef066c10c4a8ba54ea9efe6e70cd Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Mon, 6 Jul 2020 14:34:04 +0900
Subject: [PATCH v5 4/9] mm/migrate: clear __GFP_RECLAIM to make the migration
 callback consistent with regular THP allocations

new_page_nodemask is a migration callback and it tries to use a common
gfp flags for the target page allocation whether it is a base page or a
THP. The later only adds GFP_TRANSHUGE to the given mask. This results
in the allocation being slightly more aggressive than necessary because
the resulting gfp mask will contain also __GFP_RECLAIM_KSWAPD. THP
allocations usually exclude this flag to reduce over eager background
reclaim during a high THP allocation load which has been seen during
large mmaps initialization. There is no indication that this is a
problem for migration as well but theoretically the same might happen
when migrating large mappings to a different node. Make the migration
callback consistent with regular THP allocations.

Acked-by: Michal Hocko 
Acked-by: Vlastimil Babka 
Signed-off-by: Joonsoo Kim 
---
 mm/migrate.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index 3b3d918..faabb2e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1547,6 +1547,11 @@ struct page *new_page_nodemask(struct page *page,
}
 
if (PageTransHuge(page)) {
+   /*
+* clear __GFP_RECLAIM to make the migration callback
+* consistent with regular THP allocations.
+*/
+   gfp_mask &= ~__GFP_RECLAIM;
gfp_mask |= GFP_TRANSHUGE;
order = HPAGE_PMD_ORDER;
}
-- 
2.7.4



Re: [PATCH v4 05/11] mm/migrate: clear __GFP_RECLAIM for THP allocation for migration

2020-07-09 Thread Joonsoo Kim
2020년 7월 7일 (화) 오후 9:17, Vlastimil Babka 님이 작성:
>
> On 7/7/20 9:44 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > In mm/migrate.c, THP allocation for migration is called with the provided
> > gfp_mask | GFP_TRANSHUGE. This gfp_mask contains __GFP_RECLAIM and it
> > would be conflict with the intention of the GFP_TRANSHUGE.
> >
> > GFP_TRANSHUGE/GFP_TRANSHUGE_LIGHT is introduced to control the reclaim
> > behaviour by well defined manner since overhead of THP allocation is
> > quite large and the whole system could suffer from it. So, they deals
> > with __GFP_RECLAIM mask deliberately. If gfp_mask contains __GFP_RECLAIM
> > and uses gfp_mask | GFP_TRANSHUGE(_LIGHT) for THP allocation, it means
> > that it breaks the purpose of the GFP_TRANSHUGE(_LIGHT).
> >
> > This patch fixes this situation by clearing __GFP_RECLAIM in provided
> > gfp_mask. Note that there are some other THP allocations for migration
> > and they just uses GFP_TRANSHUGE(_LIGHT) directly. This patch would make
> > all THP allocation for migration consistent.
> >
> > Signed-off-by: Joonsoo Kim 
> > ---
> >  mm/migrate.c | 5 +
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 02b31fe..ecd7615 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1547,6 +1547,11 @@ struct page *new_page_nodemask(struct page *page,
> >   }
> >
> >   if (PageTransHuge(page)) {
> > + /*
> > +  * clear __GFP_RECALIM since GFP_TRANSHUGE is the gfp_mask
> > +  * that chooses the reclaim masks deliberately.
> > +  */
> > + gfp_mask &= ~__GFP_RECLAIM;
> >   gfp_mask |= GFP_TRANSHUGE;
>
> In addition to what Michal said...
>
> The mask is not passed to this function, so I would just redefine it, as is 
> done
> in the hugetlb case. We probably don't even need the __GFP_RETRY_MAYFAIL for 
> the
> THP case asi it's just there to prevent OOM kill (per commit 0f55685627d6d ) 
> and
> the costly order of THP is enough for that.

As I said in another reply, provided __GFP_THISNODE should be handled
so just redefining it would not work.

Thanks.


Re: [PATCH v4 06/11] mm/migrate: make a standard migration target allocation function

2020-07-09 Thread Joonsoo Kim
2020년 7월 8일 (수) 오전 4:00, Michal Hocko 님이 작성:
>
> On Tue 07-07-20 16:49:51, Vlastimil Babka wrote:
> > On 7/7/20 9:44 AM, js1...@gmail.com wrote:
> > > From: Joonsoo Kim 
> > >
> > > There are some similar functions for migration target allocation.  Since
> > > there is no fundamental difference, it's better to keep just one rather
> > > than keeping all variants.  This patch implements base migration target
> > > allocation function.  In the following patches, variants will be converted
> > > to use this function.
> > >
> > > Changes should be mechanical but there are some differences. First, Some
> > > callers' nodemask is assgined to NULL since NULL nodemask will be
> > > considered as all available nodes, that is, _states[N_MEMORY].
> > > Second, for hugetlb page allocation, gfp_mask is ORed since a user could
> > > provide a gfp_mask from now on.
> >
> > I think that's wrong. See how htlb_alloc_mask() determines between
> > GFP_HIGHUSER_MOVABLE and GFP_HIGHUSER, but then you OR it with 
> > __GFP_MOVABLE so
> > it's always GFP_HIGHUSER_MOVABLE.

Indeed.

> Right you are! Not that it would make any real difference because only
> migrateable hugetlb pages will get __GFP_MOVABLE and so we shouldn't
> really end up here for !movable pages in the first place (not sure about
> soft offlining at this moment). But yeah it would be simply better to
> override gfp mask for hugetlb which we have been doing anyway.

Override gfp mask doesn't work since some users will call this function with
__GFP_THISNODE.  I will use hugepage_movable_supported() here and
clear __GFP_MOVABLE if needed.

Thanks.

Thanks.


Re: [PATCH v4 04/11] mm/hugetlb: make hugetlb migration callback CMA aware

2020-07-09 Thread Joonsoo Kim
2020년 7월 9일 (목) 오후 3:43, Michal Hocko 님이 작성:
>
> On Wed 08-07-20 09:41:06, Michal Hocko wrote:
> > On Wed 08-07-20 16:16:02, Joonsoo Kim wrote:
> > > On Tue, Jul 07, 2020 at 01:22:31PM +0200, Vlastimil Babka wrote:
> > > > On 7/7/20 9:44 AM, js1...@gmail.com wrote:
> > > > > From: Joonsoo Kim 
> > > > >
> > > > > new_non_cma_page() in gup.c which try to allocate migration target 
> > > > > page
> > > > > requires to allocate the new page that is not on the CMA area.
> > > > > new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  
> > > > > This way
> > > > > works well for THP page or normal page but not for hugetlb page.
> > > > >
> > > > > hugetlb page allocation process consists of two steps.  First is 
> > > > > dequeing
> > > > > from the pool.  Second is, if there is no available page on the queue,
> > > > > allocating from the page allocator.
> > > > >
> > > > > new_non_cma_page() can control allocation from the page allocator by
> > > > > specifying correct gfp flag.  However, dequeing cannot be controlled 
> > > > > until
> > > > > now, so, new_non_cma_page() skips dequeing completely.  It is a 
> > > > > suboptimal
> > > > > since new_non_cma_page() cannot utilize hugetlb pages on the queue so 
> > > > > this
> > > > > patch tries to fix this situation.
> > > > >
> > > > > This patch makes the deque function on hugetlb CMA aware and skip CMA
> > > > > pages if newly added skip_cma argument is passed as true.
> > > >
> > > > Hmm, can't you instead change dequeue_huge_page_node_exact() to test 
> > > > the PF_
> > > > flag and avoid adding bool skip_cma everywhere?
> > >
> > > Okay! Please check following patch.
> > > >
> > > > I think that's what Michal suggested [1] except he said "the code 
> > > > already does
> > > > by memalloc_nocma_{save,restore} API". It needs extending a bit though, 
> > > > AFAICS.
> > > > __gup_longterm_locked() indeed does the save/restore, but restore comes 
> > > > before
> > > > check_and_migrate_cma_pages() and thus new_non_cma_page() is called, so 
> > > > an
> > > > adjustment is needed there, but that's all?
> > > >
> > > > Hm the adjustment should be also done because save/restore is done 
> > > > around
> > > > __get_user_pages_locked(), but check_and_migrate_cma_pages() also calls
> > > > __get_user_pages_locked(), and that call not being between nocma save 
> > > > and
> > > > restore is thus also a correctness issue?
> > >
> > > Simply, I call memalloc_nocma_{save,restore} in new_non_cma_page(). It
> > > would not cause any problem.
> >
> > I believe a proper fix is the following. The scope is really defined for
> > FOLL_LONGTERM pins and pushing it inside check_and_migrate_cma_pages
> > will solve the problem as well but it imho makes more sense to do it in
> > the caller the same way we do for any others.
> >
> > Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages 
> > allocated from CMA region")
> >
> > I am not sure this is worth backporting to stable yet.
>
> Should I post it as a separate patch do you plan to include this into your 
> next version?

It's better to include it on my next version since this patch would
cause a conflict with
the next tree that includes my v3 of this patchset.

Thanks.


Re: [PATCH v4 05/11] mm/migrate: clear __GFP_RECLAIM for THP allocation for migration

2020-07-08 Thread Joonsoo Kim
2020년 7월 8일 (수) 오후 4:48, Michal Hocko 님이 작성:
>
> On Wed 08-07-20 16:19:17, Joonsoo Kim wrote:
> > On Tue, Jul 07, 2020 at 01:40:19PM +0200, Michal Hocko wrote:
> [...]
> > Subject: [PATCH] mm/migrate: clear __GFP_RECLAIM for THP allocation for
> >  migration
> >
> > In migration target allocation functions, THP allocations uses different
> > gfp_mask, especially, in regard to the reclaim gfp_mask. There is no
> > reason to use different reclaim gfp_mask for each cases and it is
> > an obstacle to make a common function in order to clean-up migration
> > target allocation functions. This patch fixes this situation by using
> > common reclaim gfp_mask for THP allocation.
>
> I would find the following more understandable, feel free to reuse parts
> that you like:
> "
> new_page_nodemask is a migration callback and it tries to use a common
> gfp flags for the target page allocation whether it is a base page or a
> THP. The later only adds GFP_TRANSHUGE to the given mask. This results
> in the allocation being slightly more aggressive than necessary because
> the resulting gfp mask will contain also __GFP_RECLAIM_KSWAPD. THP
> allocations usually exclude this flag to reduce over eager background
> reclaim during a high THP allocation load which has been seen during
> large mmaps initialization. There is no indication that this is a
> problem for migration as well but theoretically the same might happen
> when migrating large mappings to a different node. Make the migration
> callback consistent with regular THP allocations.
> "

Looks good!
I will use this description.

Thanks.


Re: [PATCH v4 11/11] mm/memory_hotplug: remove a wrapper for alloc_migration_target()

2020-07-08 Thread Joonsoo Kim
2020년 7월 8일 (수) 오전 1:34, Vlastimil Babka 님이 작성:
>
> On 7/7/20 9:44 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > To calculate the correct node to migrate the page for hotplug, we need
> > to check node id of the page. Wrapper for alloc_migration_target() exists
> > for this purpose.
> >
> > However, Vlastimil informs that all migration source pages come from
> > a single node. In this case, we don't need to check the node id for each
> > page and we don't need to re-set the target nodemask for each page by
> > using the wrapper. Set up the migration_target_control once and use it for
> > all pages.
> >
> > Signed-off-by: Joonsoo Kim 
>
> Acked-by: Vlastimil Babka 
>
> Thanks! Nitpick below.
>
> > @@ -1345,9 +1324,28 @@ do_migrate_range(unsigned long start_pfn, unsigned 
> > long end_pfn)
> >   put_page(page);
> >   }
> >   if (!list_empty()) {
> > - /* Allocate a new page from the nearest neighbor node */
> > - ret = migrate_pages(, new_node_page, NULL, 0,
> > - MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
> > + nodemask_t nmask = node_states[N_MEMORY];
> > + struct migration_target_control mtc = {
> > + .nmask = ,
> > + .gfp_mask = GFP_USER | __GFP_MOVABLE | 
> > __GFP_RETRY_MAYFAIL,
> > + };
> > +
> > + /*
> > +  * We have checked that migration range is on a single zone so
> > +  * we can use the nid of the first page to all the others.
> > +  */
> > + mtc.nid = page_to_nid(list_first_entry(, struct page, 
> > lru));
> > +
> > + /*
> > +  * try to allocate from a different node but reuse this node
> > +  * if there are no other online nodes to be used (e.g. we are
> > +  * offlining a part of the only existing node)
> > +  */
> > + node_clear(mtc.nid, *mtc.nmask);
> > + if (nodes_empty(*mtc.nmask))
> > + node_set(mtc.nid, *mtc.nmask);
>
> You could have kept using 'nmask' instead of '*mtc.nmask'. Actually that 
> applies
> to patch 6 too, for less churn.

You are right. I will change it.

Thanks.


Re: [PATCH v4 07/11] mm/gup: use a standard migration target allocation callback

2020-07-08 Thread Joonsoo Kim
On Tue, Jul 07, 2020 at 01:46:14PM +0200, Michal Hocko wrote:
> On Tue 07-07-20 16:44:45, Joonsoo Kim wrote:
> [...]
> > @@ -1551,9 +1552,12 @@ struct page *alloc_migration_target(struct page 
> > *page, unsigned long private)
> >  
> > gfp_mask |= htlb_alloc_mask(h);
> > return alloc_huge_page_nodemask(h, nid, mtc->nmask,
> > -   gfp_mask, false);
> > +   gfp_mask, mtc->skip_cma);
> > }
> >  
> > +   if (mtc->skip_cma)
> > +   flags = memalloc_nocma_save();
> > +
> 
> As already mentioned in previous email this is a completely wrong usage
> of the scope API. The scope should be defined by the caller and this
> should be all transparent by the allocator layer.

Okay. Like as newly sent patch for 04/11, this patch will also be changed.

Thanks.


Re: [PATCH v4 05/11] mm/migrate: clear __GFP_RECLAIM for THP allocation for migration

2020-07-08 Thread Joonsoo Kim
On Tue, Jul 07, 2020 at 01:40:19PM +0200, Michal Hocko wrote:
> On Tue 07-07-20 16:44:43, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> > 
> > In mm/migrate.c, THP allocation for migration is called with the provided
> > gfp_mask | GFP_TRANSHUGE. This gfp_mask contains __GFP_RECLAIM and it
> > would be conflict with the intention of the GFP_TRANSHUGE.
> > 
> > GFP_TRANSHUGE/GFP_TRANSHUGE_LIGHT is introduced to control the reclaim
> > behaviour by well defined manner since overhead of THP allocation is
> > quite large and the whole system could suffer from it. So, they deals
> > with __GFP_RECLAIM mask deliberately. If gfp_mask contains __GFP_RECLAIM
> > and uses gfp_mask | GFP_TRANSHUGE(_LIGHT) for THP allocation, it means
> > that it breaks the purpose of the GFP_TRANSHUGE(_LIGHT).
> 
> GFP_TRANSHUGE* is not a carved in stone design. Their primary reason to
> exist is to control how hard to try for different allocation
> paths/configurations because their latency expectations might be
> largerly different. It is mostly the #PF path which aims to be as
> lightweight as possible I believe nobody simply considered migration to be
> very significant to even care. And I am still not sure it matters but
> I would tend to agree that a consistency here is probably a very minor
> plus.
> 
> Your changelog is slightly misleading in that regard because it suggests
> that this is a real problem while it doesn't present any actual data.
> It would be really nice to make the effective change really stand out.
> We are only talking about __GFP_RECLAIM_KSWAPD here. So the only
> difference is that the migration won't wake up kswapd now.
> 
> All that being said the changelog should be probably more explicit about
> the fact that this is solely done for consistency and be honest that the
> runtime effect is not really clear. This would help people reading it in
> future.

Okay. How about following changelog?

Thanks.

--->8
Subject: [PATCH] mm/migrate: clear __GFP_RECLAIM for THP allocation for
 migration

In migration target allocation functions, THP allocations uses different
gfp_mask, especially, in regard to the reclaim gfp_mask. There is no
reason to use different reclaim gfp_mask for each cases and it is
an obstacle to make a common function in order to clean-up migration
target allocation functions. This patch fixes this situation by using
common reclaim gfp_mask for THP allocation.


Re: [PATCH v4 05/11] mm/migrate: clear __GFP_RECLAIM for THP allocation for migration

2020-07-08 Thread Joonsoo Kim
On Tue, Jul 07, 2020 at 02:17:55PM +0200, Vlastimil Babka wrote:
> On 7/7/20 9:44 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> > 
> > In mm/migrate.c, THP allocation for migration is called with the provided
> > gfp_mask | GFP_TRANSHUGE. This gfp_mask contains __GFP_RECLAIM and it
> > would be conflict with the intention of the GFP_TRANSHUGE.
> > 
> > GFP_TRANSHUGE/GFP_TRANSHUGE_LIGHT is introduced to control the reclaim
> > behaviour by well defined manner since overhead of THP allocation is
> > quite large and the whole system could suffer from it. So, they deals
> > with __GFP_RECLAIM mask deliberately. If gfp_mask contains __GFP_RECLAIM
> > and uses gfp_mask | GFP_TRANSHUGE(_LIGHT) for THP allocation, it means
> > that it breaks the purpose of the GFP_TRANSHUGE(_LIGHT).
> > 
> > This patch fixes this situation by clearing __GFP_RECLAIM in provided
> > gfp_mask. Note that there are some other THP allocations for migration
> > and they just uses GFP_TRANSHUGE(_LIGHT) directly. This patch would make
> > all THP allocation for migration consistent.
> > 
> > Signed-off-by: Joonsoo Kim 
> > ---
> >  mm/migrate.c | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 02b31fe..ecd7615 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1547,6 +1547,11 @@ struct page *new_page_nodemask(struct page *page,
> > }
> >  
> > if (PageTransHuge(page)) {
> > +   /*
> > +* clear __GFP_RECALIM since GFP_TRANSHUGE is the gfp_mask
> > +* that chooses the reclaim masks deliberately.
> > +*/
> > +   gfp_mask &= ~__GFP_RECLAIM;
> > gfp_mask |= GFP_TRANSHUGE;
> 
> In addition to what Michal said...
> 
> The mask is not passed to this function, so I would just redefine it, as is 
> done
> in the hugetlb case. We probably don't even need the __GFP_RETRY_MAYFAIL for 
> the
> THP case asi it's just there to prevent OOM kill (per commit 0f55685627d6d ) 
> and
> the costly order of THP is enough for that.

Will check.

Thanks.



Re: [PATCH v4 04/11] mm/hugetlb: make hugetlb migration callback CMA aware

2020-07-08 Thread Joonsoo Kim
On Tue, Jul 07, 2020 at 01:22:31PM +0200, Vlastimil Babka wrote:
> On 7/7/20 9:44 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> > 
> > new_non_cma_page() in gup.c which try to allocate migration target page
> > requires to allocate the new page that is not on the CMA area.
> > new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
> > works well for THP page or normal page but not for hugetlb page.
> > 
> > hugetlb page allocation process consists of two steps.  First is dequeing
> > from the pool.  Second is, if there is no available page on the queue,
> > allocating from the page allocator.
> > 
> > new_non_cma_page() can control allocation from the page allocator by
> > specifying correct gfp flag.  However, dequeing cannot be controlled until
> > now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
> > since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
> > patch tries to fix this situation.
> > 
> > This patch makes the deque function on hugetlb CMA aware and skip CMA
> > pages if newly added skip_cma argument is passed as true.
> 
> Hmm, can't you instead change dequeue_huge_page_node_exact() to test the PF_
> flag and avoid adding bool skip_cma everywhere?

Okay! Please check following patch.
> 
> I think that's what Michal suggested [1] except he said "the code already does
> by memalloc_nocma_{save,restore} API". It needs extending a bit though, 
> AFAICS.
> __gup_longterm_locked() indeed does the save/restore, but restore comes before
> check_and_migrate_cma_pages() and thus new_non_cma_page() is called, so an
> adjustment is needed there, but that's all?
> 
> Hm the adjustment should be also done because save/restore is done around
> __get_user_pages_locked(), but check_and_migrate_cma_pages() also calls
> __get_user_pages_locked(), and that call not being between nocma save and
> restore is thus also a correctness issue?

Simply, I call memalloc_nocma_{save,restore} in new_non_cma_page(). It
would not cause any problem.

-->8---
>From bcfc57e3c6f2df1ad2940308b89d740cd3f0fba8 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Wed, 8 Jul 2020 14:39:26 +0900
Subject: [PATCH] mm/hugetlb: make hugetlb migration callback CMA aware

new_non_cma_page() in gup.c which try to allocate migration target page
requires to allocate the new page that is not on the CMA area.
new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
works well for THP page or normal page but not for hugetlb page.

hugetlb page allocation process consists of two steps.  First is dequeing
from the pool.  Second is, if there is no available page on the queue,
allocating from the page allocator.

new_non_cma_page() can control allocation from the page allocator by
specifying correct gfp flag.  However, dequeing cannot be controlled until
now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
patch tries to fix this situation.

This patch makes new_non_cma_page() uses memalloc_nocma_{save,restore}
to exclude CMA memory rather than manually clearing __GFP_MOVABLE. And,
this patch also makes the deque function on hugetlb CMA aware. In the
deque function, CMA memory is skipped if PF_MEMALLOC_NOCMA flag is set
by memalloc_nocma_{save,restore}.

Acked-by: Mike Kravetz 
Signed-off-by: Joonsoo Kim 
---
 include/linux/hugetlb.h |  2 --
 mm/gup.c| 32 +++-
 mm/hugetlb.c| 11 +--
 3 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index bb93e95..34a10e5 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -509,8 +509,6 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int 
preferred_nid,
nodemask_t *nmask, gfp_t gfp_mask);
 struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
unsigned long address);
-struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
-int nid, nodemask_t *nmask);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);
 
diff --git a/mm/gup.c b/mm/gup.c
index 5daadae..79142a9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1623,6 +1623,8 @@ static struct page *new_non_cma_page(struct page *page, 
unsigned long private)
 * allocation memory.
 */
gfp_t gfp_mask = GFP_USER | __GFP_NOWARN;
+   unsigned int flags = memalloc_nocma_save();
+   struct page *new_page = NULL;
 
if (PageHighMem(page))
gfp_mask |= __GFP_HIGHMEM;
@@ -1630,33 +163

Re: [PATCH v4 04/11] mm/hugetlb: make hugetlb migration callback CMA aware

2020-07-08 Thread Joonsoo Kim
On Tue, Jul 07, 2020 at 01:31:16PM +0200, Michal Hocko wrote:
> On Tue 07-07-20 16:44:42, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> > 
> > new_non_cma_page() in gup.c which try to allocate migration target page
> > requires to allocate the new page that is not on the CMA area.
> > new_non_cma_page() implements it by removing __GFP_MOVABLE flag.  This way
> > works well for THP page or normal page but not for hugetlb page.
> > 
> > hugetlb page allocation process consists of two steps.  First is dequeing
> > from the pool.  Second is, if there is no available page on the queue,
> > allocating from the page allocator.
> > 
> > new_non_cma_page() can control allocation from the page allocator by
> > specifying correct gfp flag.  However, dequeing cannot be controlled until
> > now, so, new_non_cma_page() skips dequeing completely.  It is a suboptimal
> > since new_non_cma_page() cannot utilize hugetlb pages on the queue so this
> > patch tries to fix this situation.
> > 
> > This patch makes the deque function on hugetlb CMA aware and skip CMA
> > pages if newly added skip_cma argument is passed as true.
> 
> I really dislike this as already mentioned in the previous version of
> the patch. You are making hugetlb and only one part of its allocator a
> special snowflake which is almost always a bad idea. Your changelog
> lacks any real justification for this inconsistency.
> 
> Also by doing so you are keeping an existing bug that the hugetlb
> allocator doesn't respect scope gfp flags as I have mentioned when
> reviewing the previous version. That bug likely doesn't matter now but
> it might in future and as soon as it is fixed all this is just a
> pointless exercise.
> 
> I do not have energy and time to burn to repeat that argumentation to I
> will let Mike to have a final word. Btw. you are keeping his acks even
> after considerable changes to patches which I am not really sure he is
> ok with.

As you replied a minute ago, Mike acked.

> > Acked-by: Mike Kravetz 
> > Signed-off-by: Joonsoo Kim 
> 
> To this particular patch.
> [...]
> 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 5daadae..2c3dab4 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1630,11 +1630,12 @@ static struct page *new_non_cma_page(struct page 
> > *page, unsigned long private)
> >  #ifdef CONFIG_HUGETLB_PAGE
> > if (PageHuge(page)) {
> > struct hstate *h = page_hstate(page);
> > +
> > /*
> >  * We don't want to dequeue from the pool because pool pages 
> > will
> >  * mostly be from the CMA region.
> >  */
> > -   return alloc_migrate_huge_page(h, gfp_mask, nid, NULL);
> > +   return alloc_huge_page_nodemask(h, nid, NULL, gfp_mask, true);
> 
> Let me repeat that this whole thing is running under
> memalloc_nocma_save. So additional parameter is bogus.

As Vlasimil said in other reply, we are not under
memalloc_nocma_save(). Anyway, now, I also think that additional parameter
isn't need.

> [...]
> > -static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> > +static struct page *dequeue_huge_page_node_exact(struct hstate *h, int 
> > nid, bool skip_cma)
> 
> If you really insist on an additional parameter at this layer than it
> should be checking for the PF_MEMALLOC_NOCMA instead.

I will change the patch to check PF_MEMALLOC_NOCMA instead of
introducing and checking skip_cma.

> [...]
> > @@ -1971,21 +1977,29 @@ struct page *alloc_buddy_huge_page_with_mpol(struct 
> > hstate *h,
> >  
> >  /* page migration callback function */
> >  struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> > -   nodemask_t *nmask, gfp_t gfp_mask)
> > +   nodemask_t *nmask, gfp_t gfp_mask, bool skip_cma)
> >  {
> > +   unsigned int flags = 0;
> > +   struct page *page = NULL;
> > +
> > +   if (skip_cma)
> > +   flags = memalloc_nocma_save();
> 
> This is pointless for a scope that is already defined up in the call
> chain and fundamentally this is breaking the expected use of the scope
> API. The primary reason for that API to exist is to define the scope and
> have it sticky for _all_ allocation contexts. So if you have to use it
> deep in the allocator then you are doing something wrong.

As mentioned above, we are not under memalloc_nocma_save(). Anyway, I
will rework the patch and attach it to Vlasimil's reply. It's appreciate
if you check it again.

Thanks.


Re: linux-next: Tree for Jul 6 (mm/memory_failure.c)

2020-07-06 Thread Joonsoo Kim
On Mon, Jul 06, 2020 at 09:59:06AM -0700, Randy Dunlap wrote:
> On 7/6/20 12:40 AM, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Changes since 20200703:
> > 
> 
> on i386:
> 
> when CONFIG_MIGRATION is not set/enabled:
> 
> ../mm/memory-failure.c: In function ‘new_page’:
> ../mm/memory-failure.c:1688:9: error: implicit declaration of function 
> ‘alloc_migration_target’; did you mean ‘alloc_migrate_target’? 
> [-Werror=implicit-function-declaration]
>   return alloc_migration_target(p, (unsigned long));
>  ^~
> 
> 
> -- 
> ~Randy
> Reported-by: Randy Dunlap 

Hello,

Thanks for reporting.

Below is the fix for this error.
Andrew, Could you squash this fix into the patch,
"mm-migrate-make-a-standard-target-allocation-function.patch"?

Thanks.


--->8-------
>From 5fac269125dfb2d03e38a75319305e0e70b23a4b Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Tue, 7 Jul 2020 09:16:58 +0900
Subject: [PATCH] mm/migrate: fix for
 mm-migrate-make-a-standard-target-allocation-function.patch in mm tree

new_page_nodemask() is renamed to alloc_migration_target in
mm-migrate-make-a-standard-target-allocation-function.patch, but,
one declaration for !CONFIG_MIGRATION case is missed. This patch fixes it.

Reported-by: Randy Dunlap 
Signed-off-by: Joonsoo Kim 
---
 include/linux/migrate.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 5e9c866..cc56f0d 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -60,8 +60,8 @@ static inline int migrate_pages(struct list_head *l, 
new_page_t new,
free_page_t free, unsigned long private, enum migrate_mode mode,
int reason)
{ return -ENOSYS; }
-static inline struct page *new_page_nodemask(struct page *page,
-   int preferred_nid, nodemask_t *nodemask)
+static inline struct page *alloc_migration_target(struct page *page,
+   unsigned long private)
{ return NULL; }
 static inline int isolate_movable_page(struct page *page, isolate_mode_t mode)
{ return -EBUSY; }
-- 
2.7.4



Re: [PATCH v3 8/8] mm/page_alloc: remove a wrapper for alloc_migration_target()

2020-07-06 Thread Joonsoo Kim
2020년 7월 4일 (토) 오전 1:18, Vlastimil Babka 님이 작성:
>
> On 6/23/20 8:13 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > There is a well-defined standard migration target callback.
> > Use it directly.
> >
> > Signed-off-by: Joonsoo Kim 
>
> Acked-by: Vlastimil Babka 
>
> But you could move this to patch 5/8 to reduce churn. And do the same with

Yes, I now realize that it is possible to make this change earlier.
However, reordering
the patches would cause additional change so I will not change the
order in the next
version. Result would be the same. :)

> mm/memory-failure.c new_page() there really, to drop the simple wrappers. Only

Okay. As you suggested below, with NUMA_NO_NODE handling, we can remove
the more wrappers. I will do it.

> new_node_page() is complex enough.
> Hm wait, new_node_page() is only called by do_migrate_range() which is only
> called by __offline_pages() with explicit test that all pages are from a 
> single
> zone, so the nmask could also be setup just once and not per each page, making
> it possible to remove the wrapper.

I have tried this suggestion and found that code is not simpler than before.
However, there would be minor performance benefit so I will include
this change, too.

> But for new_page() you would have to define that mtc->nid == NUMA_NO_NODE 
> means
> alloc_migrate_target() does page_to_nid(page) by itself.

Yes, I will use this suggestion.

Thanks.


Re: [PATCH v3 6/8] mm/gup: use a standard migration target allocation callback

2020-07-06 Thread Joonsoo Kim
2020년 7월 4일 (토) 오전 12:56, Vlastimil Babka 님이 작성:
>
> On 6/23/20 8:13 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > There is a well-defined migration target allocation callback.
> > It's mostly similar with new_non_cma_page() except considering CMA pages.
> >
> > This patch adds a CMA consideration to the standard migration target
> > allocation callback and use it on gup.c.
> >
> > Signed-off-by: Joonsoo Kim 
>
> Acked-by: Vlastimil Babka 
>
> But a suggestion below.
>
> > ---
> >  mm/gup.c  | 57 
> > -
> >  mm/internal.h |  1 +
> >  mm/migrate.c  |  4 +++-
> >  3 files changed, 12 insertions(+), 50 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 15be281..f6124e3 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1608,56 +1608,15 @@ static bool check_dax_vmas(struct vm_area_struct 
> > **vmas, long nr_pages)
> >  }
> >
> >  #ifdef CONFIG_CMA
> > -static struct page *new_non_cma_page(struct page *page, unsigned long 
> > private)
> > +static struct page *alloc_migration_target_non_cma(struct page *page, 
> > unsigned long private)
> >  {
>
> ...
>
> > + struct migration_target_control mtc = {
> > + .nid = page_to_nid(page),
> > + .gfp_mask = GFP_USER | __GFP_NOWARN,
> > + .skip_cma = true,
> > + };
> >
> > - return __alloc_pages_node(nid, gfp_mask, 0);
> > + return alloc_migration_target(page, (unsigned long));
>
> Do we really need this wrapper? The only user is check_and_migrate_cma_pages 
> so
> just opencode it?

This wrapper exists for setting up a different nid for each page.
However, as you suggested in the next reply, we can remove this wrapper if
NUMA_NO_NODE handling is added to the standard function. I will add NUMA_NO_NODE
handling to the standard function and remove this wrapper.

Thanks.


Re: [PATCH v3 3/8] mm/hugetlb: unify migration callbacks

2020-07-02 Thread Joonsoo Kim
2020년 7월 3일 (금) 오전 1:13, Vlastimil Babka 님이 작성:
>
> On 6/26/20 6:02 AM, Joonsoo Kim wrote:
> > 2020년 6월 25일 (목) 오후 8:26, Michal Hocko 님이 작성:
> >>
> >> On Tue 23-06-20 15:13:43, Joonsoo Kim wrote:
> >> > From: Joonsoo Kim 
> >> >
> >> > There is no difference between two migration callback functions,
> >> > alloc_huge_page_node() and alloc_huge_page_nodemask(), except
> >> > __GFP_THISNODE handling. This patch adds an argument, gfp_mask, on
> >> > alloc_huge_page_nodemask() and replace the callsite for
> >> > alloc_huge_page_node() with the call to
> >> > alloc_huge_page_nodemask(..., __GFP_THISNODE).
> >> >
> >> > It's safe to remove a node id check in alloc_huge_page_node() since
> >> > there is no caller passing NUMA_NO_NODE as a node id.
> >>
> >> Yes this is indeed safe. alloc_huge_page_node used to be called from
> >> other internal hugetlb allocation layer and that allowed NUMA_NO_NODE as
> >> well. Now it is called only from the mempolicy migration callback and
> >> that always specifies a node and want to stick with that node.
> >>
> >> But I have to say I really dislike the gfp semantic because it is
> >> different from any other allocation function I can think of. It
> >> specifies what to be added rather than what should be used.
> >>
> >> Removing the function is ok but please use the full gfp mask instead
> >> or if that is impractical for some reason (wich shouldn't be the case
> >> as htlb_alloc_mask should be trivial to make static inline) make it
> >> explicit that this is not a gfp_mask but a gfp modifier and explicitly
> >> state which modifiers are allowed.
> >
> > Okay. I will try to solve your concern. Concrete solution is not yet 
> > prepared
> > but perhaps I will use full gfp_mask by using htlb_alloc_mask() in caller 
> > sites.
>
> Yeah, that should be feasible. alloc_huge_page_vma() already does
> htlb_alloc_mask(h). In alloc_new_node_page() and new_page_nodemask() it would 
> be
> consistent with the other cases handled there (THP and base).

Okay. Will check it.

Thanks.


Re: [PATCH v6 6/6] mm/vmscan: restore active/inactive ratio for anonymous LRU

2020-07-02 Thread Joonsoo Kim
2020년 7월 2일 (목) 오후 10:45, Vlastimil Babka 님이 작성:
>
> On 6/17/20 7:26 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > Now, workingset detection is implemented for anonymous LRU.
> > We don't have to worry about the misfound for workingset due to
> > the ratio of active/inactive. Let's restore the ratio.
>
> How about:
>
> Now that workingset detection is implemented for anonymous LRU, we don't need
> large inactive list to allow detecting frequently accessed pages before they 
> are
> reclaimed, anymore. This effectively reverts the temporary measure put in by
> commit "mm/vmscan: make active/inactive ratio as 1:1 for anon lru".

Much better!. I will use the comment you suggested. Thanks.

> > Acked-by: Johannes Weiner 
> > Signed-off-by: Joonsoo Kim 
>
> Acked-by: Vlastimil Babka 
>
> Thanks!
> I still hope Matthew can review updated patch 4/6 (I'm not really familiar 
> with
> proper xarray handling), and Johannes patch 5/6.

Okay, I hope so, too. :)

> And then we just need a nice Documentation file describing how reclaim really
> works after all the recent changes :)

Agreed.

Thanks.


Re: [PATCH v6 5/6] mm/swap: implement workingset detection for anonymous LRU

2020-07-02 Thread Joonsoo Kim
2020년 7월 2일 (목) 오후 10:37, Vlastimil Babka 님이 작성:
>
> On 6/17/20 7:26 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > This patch implements workingset detection for anonymous LRU.
> > All the infrastructure is implemented by the previous patches so this patch
> > just activates the workingset detection by installing/retrieving
> > the shadow entry.
> >
> > Signed-off-by: Joonsoo Kim 
>
> Acked-by: Vlastimil Babka 
>
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 8395e60..3769ae6 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -353,8 +353,9 @@ void workingset_refault(struct page *page, void *shadow)
> >   /*
> >* Compare the distance to the existing workingset size. We
> >* don't activate pages that couldn't stay resident even if
> > -  * all the memory was available to the page cache. Whether
> > -  * cache can compete with anon or not depends on having swap.
> > +  * all the memory was available to the workingset. Whether
> > +  * workingset competetion need to consider anon or not depends
>
>   competition needs

Will fix it.

Thanks.


Re: [PATCH v6 3/6] mm/workingset: extend the workingset detection for anon LRU

2020-07-02 Thread Joonsoo Kim
2020년 7월 2일 (목) 오전 6:25, Vlastimil Babka 님이 작성:
>
> On 6/17/20 7:26 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
>
> Hi,
>
> I would adjust the subject, as it sounds like the patch does the whole
> workingset detection, not just preparation.
> How about:
>
> mm/workingset: prepare the workingset infrastructure for anon LRU

Looks good. I will use it.

> > In the following patch, workingset detection will be applied to
> > anonymous LRU. To prepare it, this patch adds some code to
> > distinguish/handle the both LRUs.
>
> How about:
> To prepare for this, this patch splits workingset event counters for refault,
> activate and restore into anon and file variants, as well as the refaults
> counter in struct lruvec.

Will do.

> > v6: do not introduce a new nonresident_age for anon LRU since
> > we need to use *unified* nonresident_age to implement workingset
> > detection for anon LRU.
>
> Again, v6 update info shouldn't go to changelog. In this case I think it 
> doesn't
> need mentioning at all, at least not in this patch.

Okay. I agree that this should not be included in the changelog. I just want
to notice someone who checked previous patches about that there is an
important change in this version.

> > Acked-by: Johannes Weiner 
> > Signed-off-by: Joonsoo Kim 
>
> Acked-by: Vlastimil Babka 

Thanks.


Re: [PATCH v6 2/6] mm/vmscan: protect the workingset on anonymous LRU

2020-07-02 Thread Joonsoo Kim
2020년 7월 2일 (목) 오전 3:02, Vlastimil Babka 님이 작성:
>
> On 6/17/20 7:26 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
>
> Hi, how about a more descriptive subject, such as

Hello,

> mm/vmscan: add new anonymous pages to inactive LRU list

This patch does two things to implement workingset protection.

- add new anonymous pages to inactive LRU list
- check two reference to activate

So, I think that the current subject is more descriptive for this patch.

> > In current implementation, newly created or swap-in anonymous page
> > is started on active list. Growing active list results in rebalancing
> > active/inactive list so old pages on active list are demoted to inactive
> > list. Hence, the page on active list isn't protected at all.
> >
> > Following is an example of this situation.
> >
> > Assume that 50 hot pages on active list. Numbers denote the number of
> > pages on active/inactive list (active | inactive).
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(uo) | 50(h)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(uo) | 50(uo), swap-out 50(h)
> >
> > This patch tries to fix this issue.
> > Like as file LRU, newly created or swap-in anonymous pages will be
> > inserted to the inactive list. They are promoted to active list if
> > enough reference happens. This simple modification changes the above
> > example as following.
> >
> > 1. 50 hot pages on active list
> > 50(h) | 0
> >
> > 2. workload: 50 newly created (used-once) pages
> > 50(h) | 50(uo)
> >
> > 3. workload: another 50 newly created (used-once) pages
> > 50(h) | 50(uo), swap-out 50(uo)
> >
> > As you can see, hot pages on active list would be protected.
> >
> > Note that, this implementation has a drawback that the page cannot
> > be promoted and will be swapped-out if re-access interval is greater than
> > the size of inactive list but less than the size of total(active+inactive).
> > To solve this potential issue, following patch will apply workingset
> > detection that is applied to file LRU some day before.
>
> detection similar to the one that's already applied to file LRU.

Will change!

> > v6: Before this patch, all anon pages (inactive + active) are considered
> > as workingset. However, with this patch, only active pages are considered
> > as workingset. So, file refault formula which uses the number of all
> > anon pages is changed to use only the number of active anon pages.
>
> a "v6" note is more suitable for a diffstat area than commit log, but it's 
> good
> to mention this so drop the 'v6:'?

Okay.

> > Acked-by: Johannes Weiner 
> > Signed-off-by: Joonsoo Kim 
>
> Acked-by: Vlastimil Babka 

Thanks!

> One more nit below.
>
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -476,23 +476,24 @@ void lru_cache_add(struct page *page)
> >  EXPORT_SYMBOL(lru_cache_add);
> >
> >  /**
> > - * lru_cache_add_active_or_unevictable
> > + * lru_cache_add_inactive_or_unevictable
> >   * @page:  the page to be added to LRU
> >   * @vma:   vma in which page is mapped for determining reclaimability
> >   *
> > - * Place @page on the active or unevictable LRU list, depending on its
> > + * Place @page on the inactive or unevictable LRU list, depending on its
> >   * evictability.  Note that if the page is not evictable, it goes
> >   * directly back onto it's zone's unevictable list, it does NOT use a
> >   * per cpu pagevec.
> >   */
> > -void lru_cache_add_active_or_unevictable(struct page *page,
> > +void lru_cache_add_inactive_or_unevictable(struct page *page,
> >struct vm_area_struct *vma)
> >  {
> > + bool unevictable;
> > +
> >   VM_BUG_ON_PAGE(PageLRU(page), page);
> >
> > - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED))
> > - SetPageActive(page);
> > - else if (!TestSetPageMlocked(page)) {
> > + unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
> > + if (unevictable && !TestSetPageMlocked(page)) {
>
> I guess this could be "if (unlikely(unevictable) && ..." to match the previous
> if (likely(evictable)) else ...

I will fix it, too.

Thanks.


Re: [PATCH v6 1/6] mm/vmscan: make active/inactive ratio as 1:1 for anon lru

2020-07-01 Thread Joonsoo Kim
2020년 7월 1일 (수) 오전 2:27, Vlastimil Babka 님이 작성:
>
> On 6/17/20 7:26 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > Current implementation of LRU management for anonymous page has some
> > problems. Most important one is that it doesn't protect the workingset,
> > that is, pages on the active LRU list. Although, this problem will be
> > fixed in the following patchset, the preparation is required and
> > this patch does it.
> >
> > What following patchset does is to restore workingset protection. In this
>
> "Restore" sounds as if the protection used to be there and then it was 
> removed.
> If it's the case, it should be said what commit did that. Otherwise I would 
> say
> "implement", not "restore"?
>
> > case, newly created or swap-in pages are started their lifetime on the
>
> I would rephrase it: "After the following patch, newly created or swap-in 
> pages
> will start their lifetime... "
>
> > inactive list. If inactive list is too small, there is not enough chance
> > to be referenced and the page cannot become the workingset.
> >
> > In order to provide enough chance to the newly anonymous pages, this patch
>
> "In order to provide the newly anonymous pages enough chance to be referenced
> again..."
>
> > makes active/inactive LRU ratio as 1:1.
>
> Here I would add:
>
> This is just a temporary measure. Later patch in the series introduces
> workingset detection for anonymous LRU that will be used to better decide if
> pages should start on the active and inactive list. Afterwards this patch is
> effectively reverted.
>
> > Acked-by: Johannes Weiner 
> > Signed-off-by: Joonsoo Kim 
>
> Acked-by: Vlastimil Babka 

Thanks!

I will change the commit description as you suggested.

Thanks.


Re: [PATCH v3 4/8] mm/hugetlb: make hugetlb migration callback CMA aware

2020-06-30 Thread Joonsoo Kim
2020년 6월 30일 (화) 오후 3:42, Michal Hocko 님이 작성:
>
> On Tue 30-06-20 15:30:04, Joonsoo Kim wrote:
> > 2020년 6월 29일 (월) 오후 4:55, Michal Hocko 님이 작성:
> [...]
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 57ece74e3aae..c1595b1d36f3 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -1092,10 +1092,14 @@ static struct page 
> > > *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
> > >  /* Movability of hugepages depends on migration support. */
> > >  static inline gfp_t htlb_alloc_mask(struct hstate *h)
> > >  {
> > > +   gfp_t gfp;
> > > +
> > > if (hugepage_movable_supported(h))
> > > -   return GFP_HIGHUSER_MOVABLE;
> > > +   gfp = GFP_HIGHUSER_MOVABLE;
> > > else
> > > -   return GFP_HIGHUSER;
> > > +   gfp = GFP_HIGHUSER;
> > > +
> > > +   return current_gfp_context(gfp);
> > >  }
> > >
> > >  static struct page *dequeue_huge_page_vma(struct hstate *h,
> > >
> > > If we even fix this general issue for other allocations and allow a
> > > better CMA exclusion then it would be implemented consistently for
> > > everybody.
> >
> > Yes, I have reviewed the memalloc_nocma_{} APIs and found the better way
> > for CMA exclusion. I will do it after this patch is finished.
> >
> > > Does this make more sense to you are we still not on the same page wrt
> > > to the actual problem?
> >
> > Yes, but we have different opinions about it. As said above, I will make
> > a patch for better CMA exclusion after this patchset. It will make
> > code consistent.
> > I'd really appreciate it if you wait until then.
>
> As I've said I would _prefer_ simplicity over "correctness" if it is only
> partial and hard to reason about from the userspace experience but this
> is not something I would _insist_ on. If Mike as a maintainer of the
> code is ok with that then I will not stand in the way.

Okay.

> But please note that a missing current_gfp_context inside
> htlb_alloc_mask is a subtle bug. I do not think it matters right now but
> with a growing use of scoped apis this might actually hit some day so I
> believe we want to have it in place.

Okay. I will keep in mind and consider it when fixing CMA exclusion on the
other patchset.

Thanks.


Re: [PATCH v3 5/8] mm/migrate: make a standard migration target allocation function

2020-06-30 Thread Joonsoo Kim
2020년 6월 29일 (월) 오후 5:03, Michal Hocko 님이 작성:
>
> On Mon 29-06-20 15:41:37, Joonsoo Kim wrote:
> > 2020년 6월 26일 (금) 오후 4:33, Michal Hocko 님이 작성:
> > >
> > > On Fri 26-06-20 14:02:49, Joonsoo Kim wrote:
> > > > 2020년 6월 25일 (목) 오후 9:05, Michal Hocko 님이 작성:
> > > > >
> > > > > On Tue 23-06-20 15:13:45, Joonsoo Kim wrote:
> > > [...]
> > > > > > -struct page *new_page_nodemask(struct page *page,
> > > > > > - int preferred_nid, nodemask_t 
> > > > > > *nodemask)
> > > > > > +struct page *alloc_migration_target(struct page *page, unsigned 
> > > > > > long private)
> > > > > >  {
> > > > > > - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | 
> > > > > > __GFP_RETRY_MAYFAIL;
> > > > > > + struct migration_target_control *mtc;
> > > > > > + gfp_t gfp_mask;
> > > > > >   unsigned int order = 0;
> > > > > >   struct page *new_page = NULL;
> > > > > > + int zidx;
> > > > > > +
> > > > > > + mtc = (struct migration_target_control *)private;
> > > > > > + gfp_mask = mtc->gfp_mask;
> > > > > >
> > > > > >   if (PageHuge(page)) {
> > > > > >   return alloc_huge_page_nodemask(
> > > > > > - page_hstate(compound_head(page)),
> > > > > > - preferred_nid, nodemask, 0, false);
> > > > > > + page_hstate(compound_head(page)), 
> > > > > > mtc->nid,
> > > > > > + mtc->nmask, gfp_mask, false);
> > > > > >   }
> > > > > >
> > > > > >   if (PageTransHuge(page)) {
> > > > > > + gfp_mask &= ~__GFP_RECLAIM;
> > > > >
> > > > > What's up with this gfp_mask modification?
> > > >
> > > > THP page allocation uses a standard gfp masks, GFP_TRANSHUGE_LIGHT and
> > > > GFP_TRANHUGE. __GFP_RECLAIM flags is a big part of this standard mask 
> > > > design.
> > > > So, I clear it here so as not to disrupt the THP gfp mask.
> > >
> > > Why this wasn't really needed before? I guess I must be missing
> > > something here. This patch should be mostly mechanical convergence of
> > > existing migration callbacks but this change adds a new behavior AFAICS.
> >
> > Before this patch, a user cannot specify a gfp_mask and THP allocation
> > uses GFP_TRANSHUGE
> > statically.
>
> Unless I am misreading there are code paths (e.g.new_page_nodemask) which 
> simply use
> add GFP_TRANSHUGE to GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL. And
> this goes all the way to thp migration introduction.

Ahh... Indeed. I missed that. There are multiple THP migration target
allocation functions
and some functions use GFP_TRANSHUGE + extra_mask so doesn't include
__GFP_KSWAPD_RECLAIM
but the others includes __GFP_KSWAPD_RECLAIM due to original GFP_USER.
Thanks for clarifying.

> > After this patch, a user can specify a gfp_mask and it
> > could conflict with GFP_TRANSHUGE.
> > This code tries to avoid this conflict.
> >
> > > It would effectively drop __GFP_RETRY_MAYFAIL and __GFP_KSWAPD_RECLAIM
> >
> > __GFP_RETRY_MAYFAIL isn't dropped. __GFP_RECLAIM is
> > "___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM".
> > So, __GFP_KSWAPD_RECLAIM would be dropped for THP allocation.
> > IIUC, THP allocation doesn't use __GFP_KSWAPD_RECLAIM since it's
> > overhead is too large and this overhead should be given to the caller
> > rather than system thread (kswapd) and so on.
>
> Yes, there is a reason why KSWAPD is excluded from THP allocations in
> the page fault path. Maybe we want to extend that behavior to the
> migration as well. I do not have a strong opinion on that because I
> haven't seen excessive kswapd reclaim due to THP migrations. They are
> likely too rare.
>
> But as I've said in my previous email. Make this a separate patch with
> an explanation why we want this.

Okay. I will make a separate patch that clears __GFP_RECLAIM for passed
gfp_mask to extend the behavior. It will make THP migration target allocation
consistent. :)

Thanks.


Re: [PATCH v3 4/8] mm/hugetlb: make hugetlb migration callback CMA aware

2020-06-30 Thread Joonsoo Kim
2020년 6월 29일 (월) 오후 4:55, Michal Hocko 님이 작성:
>
> On Mon 29-06-20 15:27:25, Joonsoo Kim wrote:
> [...]
> > Solution that Introduces a new
> > argument doesn't cause this problem while avoiding CMA regions.
>
> My primary argument is that there is no real reason to treat hugetlb
> dequeing somehow differently. So if we simply exclude __GFP_MOVABLE for
> _any_ other allocation then this certainly has some drawbacks on the
> usable memory for the migration target and it can lead to allocation
> failures (especially on movable_node setups where the amount of movable
> memory might be really high) and therefore longterm gup failures. And
> yes those failures might be premature. But my point is that the behavior
> would be _consistent_. So a user wouldn't see random failures for some
> types of pages while a success for others.

Hmm... I don't agree with your argument. Excluding __GFP_MOVABLE is
a *work-around* way to exclude CMA regions. Implementation for dequeuing
in this patch is a right way to exclude CMA regions. Why do we use a work-around
for this case? To be consistent is important but it's only meaningful
if it is correct.
It should not disrupt to make a better code. And, dequeing is already a special
process that is only available for hugetlb. I think that using
different (correct)
implementations there doesn't break any consistency.

> Let's have a look at this patch. It is simply working that around the
> restriction for a very limited types of pages - only hugetlb pages
> which have reserves in non-cma movable pools. I would claim that many
> setups will simply not have many (if any) spare hugetlb pages in the
> pool except for temporary time periods when a workload is (re)starting
> because this would be effectively a wasted memory.

This can not be a stopper to make the correct code.

> The patch is adding a special case flag to claim what the code already
> does by memalloc_nocma_{save,restore} API so the information is already
> there. Sorry I didn't bring this up earlier but I have completely forgot
> about its existence. With that one in place I do agree that dequeing
> needs a fixup but that should be something like the following instead.

Thanks for letting me know. I don't know it until now. It looks like it's
better to use this API rather than introducing a new argument.

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 57ece74e3aae..c1595b1d36f3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1092,10 +1092,14 @@ static struct page *dequeue_huge_page_nodemask(struct 
> hstate *h, gfp_t gfp_mask,
>  /* Movability of hugepages depends on migration support. */
>  static inline gfp_t htlb_alloc_mask(struct hstate *h)
>  {
> +   gfp_t gfp;
> +
> if (hugepage_movable_supported(h))
> -   return GFP_HIGHUSER_MOVABLE;
> +   gfp = GFP_HIGHUSER_MOVABLE;
> else
> -   return GFP_HIGHUSER;
> +   gfp = GFP_HIGHUSER;
> +
> +   return current_gfp_context(gfp);
>  }
>
>  static struct page *dequeue_huge_page_vma(struct hstate *h,
>
> If we even fix this general issue for other allocations and allow a
> better CMA exclusion then it would be implemented consistently for
> everybody.

Yes, I have reviewed the memalloc_nocma_{} APIs and found the better way
for CMA exclusion. I will do it after this patch is finished.

> Does this make more sense to you are we still not on the same page wrt
> to the actual problem?

Yes, but we have different opinions about it. As said above, I will make
a patch for better CMA exclusion after this patchset. It will make
code consistent.
I'd really appreciate it if you wait until then.

Thanks.


Re: [PATCH v3 5/8] mm/migrate: make a standard migration target allocation function

2020-06-29 Thread Joonsoo Kim
2020년 6월 26일 (금) 오후 4:33, Michal Hocko 님이 작성:
>
> On Fri 26-06-20 14:02:49, Joonsoo Kim wrote:
> > 2020년 6월 25일 (목) 오후 9:05, Michal Hocko 님이 작성:
> > >
> > > On Tue 23-06-20 15:13:45, Joonsoo Kim wrote:
> [...]
> > > > -struct page *new_page_nodemask(struct page *page,
> > > > - int preferred_nid, nodemask_t *nodemask)
> > > > +struct page *alloc_migration_target(struct page *page, unsigned long 
> > > > private)
> > > >  {
> > > > - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
> > > > + struct migration_target_control *mtc;
> > > > + gfp_t gfp_mask;
> > > >   unsigned int order = 0;
> > > >   struct page *new_page = NULL;
> > > > + int zidx;
> > > > +
> > > > + mtc = (struct migration_target_control *)private;
> > > > + gfp_mask = mtc->gfp_mask;
> > > >
> > > >   if (PageHuge(page)) {
> > > >   return alloc_huge_page_nodemask(
> > > > - page_hstate(compound_head(page)),
> > > > - preferred_nid, nodemask, 0, false);
> > > > + page_hstate(compound_head(page)), 
> > > > mtc->nid,
> > > > + mtc->nmask, gfp_mask, false);
> > > >   }
> > > >
> > > >   if (PageTransHuge(page)) {
> > > > + gfp_mask &= ~__GFP_RECLAIM;
> > >
> > > What's up with this gfp_mask modification?
> >
> > THP page allocation uses a standard gfp masks, GFP_TRANSHUGE_LIGHT and
> > GFP_TRANHUGE. __GFP_RECLAIM flags is a big part of this standard mask 
> > design.
> > So, I clear it here so as not to disrupt the THP gfp mask.
>
> Why this wasn't really needed before? I guess I must be missing
> something here. This patch should be mostly mechanical convergence of
> existing migration callbacks but this change adds a new behavior AFAICS.

Before this patch, a user cannot specify a gfp_mask and THP allocation
uses GFP_TRANSHUGE
statically. After this patch, a user can specify a gfp_mask and it
could conflict with GFP_TRANSHUGE.
This code tries to avoid this conflict.

> It would effectively drop __GFP_RETRY_MAYFAIL and __GFP_KSWAPD_RECLAIM

__GFP_RETRY_MAYFAIL isn't dropped. __GFP_RECLAIM is
"___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM".
So, __GFP_KSWAPD_RECLAIM would be dropped for THP allocation.
IIUC, THP allocation doesn't use __GFP_KSWAPD_RECLAIM since it's overhead is
too large and this overhead should be given to the caller rather than
system thread (kswapd)
and so on.

Thanks.


Re: [PATCH v3 4/8] mm/hugetlb: make hugetlb migration callback CMA aware

2020-06-29 Thread Joonsoo Kim
2020년 6월 26일 (금) 오후 4:23, Michal Hocko 님이 작성:
>
> On Fri 26-06-20 13:49:15, Joonsoo Kim wrote:
> > 2020년 6월 25일 (목) 오후 8:54, Michal Hocko 님이 작성:
> > >
> > > On Tue 23-06-20 15:13:44, Joonsoo Kim wrote:
> > > > From: Joonsoo Kim 
> > > >
> > > > new_non_cma_page() in gup.c which try to allocate migration target page
> > > > requires to allocate the new page that is not on the CMA area.
> > > > new_non_cma_page() implements it by removing __GFP_MOVABLE flag. This 
> > > > way
> > > > works well for THP page or normal page but not for hugetlb page.
> > >
> > > Could you explain why? I mean why cannot you simply remove __GFP_MOVABLE
> > > flag when calling alloc_huge_page_nodemask and check for it in dequeue
> > > path?
> >
> > If we remove __GFP_MOVABLE when calling alloc_huge_page_nodemask, we cannot
> > use the page in ZONE_MOVABLE on dequeing.
> >
> > __GFP_MOVABLE is not only used for CMA selector but also used for zone
> > selector.  If we clear it, we cannot use the page in the ZONE_MOVABLE
> > even if it's not CMA pages.  For THP page or normal page allocation,
> > there is no way to avoid this weakness without introducing another
> > flag or argument. For me, introducing another flag or argument for
> > these functions looks over-engineering so I don't change them and
> > leave them as they are (removing __GFP_MOVABLE).
> >
> > But, for alloc_huge_page_nodemask(), introducing a new argument
> > doesn't seem to be a problem since it is not a general function but
> > just a migration target allocation function.
>
> I really do not see why hugetlb and only the dequeing part should be
> special. This just leads to a confusion. From the code point of view it
> makes perfect sense to opt out CMA regions for !__GFP_MOVABLE when
> dequeing. So I would rather see a consistent behavior than a special
> case deep in the hugetlb allocator layer.

It seems that there is a misunderstanding. It's possible to opt out CMA regions
for !__GFP_MOVABLE when dequeing. It's reasonable. But, for !__GFP_MOVABLE,
we don't search the hugetlb page on the ZONE_MOVABLE when dequeing since
dequeing zone is limited by gfp_zone(gfp_mask). Solution that Introduces a new
argument doesn't cause this problem while avoiding CMA regions.

Thanks.


Re: [PATCH v6 0/6] workingset protection/detection on the anonymous LRU list

2020-06-25 Thread Joonsoo Kim
2020년 6월 17일 (수) 오후 2:26, 님이 작성:
>
> From: Joonsoo Kim 
>
> Hello,
>
> This patchset implements workingset protection and detection on
> the anonymous LRU list.
>
> * Changes on v6
> - rework to reflect a new LRU balance model
> - remove memcg charge timing stuff on v5 since alternative is already
> merged on mainline
> - remove readahead stuff on v5 (reason is the same with above)
> - clear shadow entry if corresponding swap entry is deleted
> (mm/swapcache: support to handle the exceptional entries in swapcache)
> - change experiment environment
> (from ssd swap to ram swap, for fast evaluation and for reducing side-effect 
> of I/O)
> - update performance number

Hello, Johannes.

Could you review the v6 patchset?

Some minor things have changed so it's really welcome if you review the patchset
again. Especially, "mm/swap: implement workingset detection for anonymous LRU"
doesn't get any ack yet. :)

Thanks.


Re: [PATCH v6 4/6] mm/swapcache: support to handle the exceptional entries in swapcache

2020-06-25 Thread Joonsoo Kim
2020년 6월 19일 (금) 오전 10:33, Joonsoo Kim 님이 작성:
>
> On Wed, Jun 17, 2020 at 05:17:17AM -0700, Matthew Wilcox wrote:
> > On Wed, Jun 17, 2020 at 02:26:21PM +0900, js1...@gmail.com wrote:
> > > From: Joonsoo Kim 
> > >
> > > Swapcache doesn't handle the exceptional entries since there is no case
> >
> > Don't call them exceptional entries.
> >
> > The radix tree has/had the concecpt of exceptional entries.  The swapcache
> > doesn't use the radix tree any more, it uses the XArray.  The XArray
> > has value entries.
> >
> > But you shouldn't call them value entries either; that's an XArray
> > concept.  The swap cache and indeed page cache use value entries to
> > implement shadow entries (they're also used to implement dax entries and
> > swap entries in the page cache).  So just call them shadow entries here.
> >
> > I know there are still places which use the term 'nrexceptional' in
> > the kernel.  I just haven't got round to replacing them yet.  Please
> > don't add more.
>
> Okay! Thanks for commenting.
>
> >
> > > +void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > > +   unsigned long end)
> > > +{
> > > +   unsigned long curr;
> > > +   void *old;
> > > +   swp_entry_t entry = swp_entry(type, begin);
> > > +   struct address_space *address_space = swap_address_space(entry);
> > > +   XA_STATE(xas, _space->i_pages, begin);
> > > +
> > > +retry:
> > > +   xa_lock_irq(_space->i_pages);
> > > +   for (curr = begin; curr <= end; curr++) {
> > > +   entry = swp_entry(type, curr);
> > > +   if (swap_address_space(entry) != address_space) {
> > > +   xa_unlock_irq(_space->i_pages);
> > > +   address_space = swap_address_space(entry);
> > > +   begin = curr;
> > > +   xas_set(, begin);
> > > +   goto retry;
> > > +   }
> > > +
> > > +   old = xas_load();
> > > +   if (!xa_is_value(old))
> > > +   continue;
> > > +   xas_store(, NULL);
> > > +   address_space->nrexceptional--;
> > > +   xas_next();
> > > +   }
> > > +   xa_unlock_irq(_space->i_pages);
> > > +}
> >
> > This is a very clunky loop.  I'm not sure it's even right, given that
> > you change address space without changing the xas's address space.  How
> > about this?
>
> You are correct. The xas's address space should be changed.
>
>
> >   for (;;) {
> >   XA_STATE(xas, _space->i_pages, begin);
> >   unsigned long nr_shadows = 0;
> >
> >   xas_lock_irq();
> >   xas_for_each(, entry, end) {
> >   if (!xa_is_value(entry))
> >   continue;
> >   xas_store(, NULL);
> >   nr_shadows++;
> >   }
> >   address_space->nr_exceptionals -= nr_shadows;
> >   xas_unlock_irq();
> >
> >   if (xas.xa_index >= end)
> >   break;
> >   entry = swp_entry(type, xas.xa_index);
> >   address_space = swap_address_space(entry);
> >   }
>
> Thanks for suggestion.
>
> I make a patch based on your suggestion. IIUC about Xarray,
> after running xas_for_each(), xas.xa_index can be less than the end if
> there are empty slots on last portion of array. Handling this case is
> also considered in my patch.

Hello, Matthew.

Could you check if the following patch (Xarray part) is correct?
Since I made a patch based on your suggestion, I'd like to get your review. :)

Thanks.

> Thanks.
>
> --->8
> From 72e97600ea294372a13ab8e208ebd3c0e1889408 Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim 
> Date: Fri, 15 Nov 2019 09:48:32 +0900
> Subject: [PATCH v6 4/6] mm/swapcache: support to handle the shadow entries
>
> Workingset detection for anonymous page will be implemented in the
> following patch and it requires to store the shadow entries into the
> swapcache. This patch implements an infrastructure to store the shadow
> entry in the swapcache.
>
> Acked-by: Johannes Weiner 
> Signed-off-by: Joonsoo Kim 
> ---
>  include/linux/swap.h | 17 
>  mm/shmem.c   |  3 ++-
>  mm/swap_state.c  | 57 
> +

Re: [PATCH v3 6/8] mm/gup: use a standard migration target allocation callback

2020-06-25 Thread Joonsoo Kim
2020년 6월 25일 (목) 오후 9:08, Michal Hocko 님이 작성:
>
> On Tue 23-06-20 15:13:46, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> >
> > There is a well-defined migration target allocation callback.
> > It's mostly similar with new_non_cma_page() except considering CMA pages.
> >
> > This patch adds a CMA consideration to the standard migration target
> > allocation callback and use it on gup.c.
>
> We already can express that by a missing __GFP_MOVABLE so I would rather
> not introduce a duplication in form of another flag.

I replied to this question in a previous email.

Thanks.


Re: [PATCH v3 5/8] mm/migrate: make a standard migration target allocation function

2020-06-25 Thread Joonsoo Kim
2020년 6월 25일 (목) 오후 9:05, Michal Hocko 님이 작성:
>
> On Tue 23-06-20 15:13:45, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> >
> > There are some similar functions for migration target allocation. Since
> > there is no fundamental difference, it's better to keep just one rather
> > than keeping all variants. This patch implements base migration target
> > allocation function. In the following patches, variants will be converted
> > to use this function.
> >
> > Note that PageHighmem() call in previous function is changed to open-code
> > "is_highmem_idx()" since it provides more readability.
>
> I was little bit surprised that alloc_migration_target still uses
> private argument while it only accepts migration_target_control
> structure pointer but then I have noticed that you are using it as a
> real callback in a later patch.
>
> > Signed-off-by: Joonsoo Kim 
>
> Few questions inline
> [...]
>
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 47b8ccb..820ea5e 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1648,9 +1648,13 @@ EXPORT_SYMBOL(unpoison_memory);
> >
> >  static struct page *new_page(struct page *p, unsigned long private)
> >  {
> > - int nid = page_to_nid(p);
> > + struct migration_target_control mtc = {
> > + .nid = page_to_nid(p),
> > + .nmask = _states[N_MEMORY],
>
> This could be .namsk = NULL, right? alloc_migration_target doesn't
> modify the node mask and NULL nodemask is always interpreted as all
> available nodes.

Will do.

> > + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
> > + };
> >
> > - return new_page_nodemask(p, nid, _states[N_MEMORY]);
> > + return alloc_migration_target(p, (unsigned long));
> >  }
> >
> [...]
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 634f1ea..3afff59 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1536,29 +1536,34 @@ int migrate_pages(struct list_head *from, 
> > new_page_t get_new_page,
> >   return rc;
> >  }
> >
> > -struct page *new_page_nodemask(struct page *page,
> > - int preferred_nid, nodemask_t *nodemask)
> > +struct page *alloc_migration_target(struct page *page, unsigned long 
> > private)
> >  {
> > - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
> > + struct migration_target_control *mtc;
> > + gfp_t gfp_mask;
> >   unsigned int order = 0;
> >   struct page *new_page = NULL;
> > + int zidx;
> > +
> > + mtc = (struct migration_target_control *)private;
> > + gfp_mask = mtc->gfp_mask;
> >
> >   if (PageHuge(page)) {
> >   return alloc_huge_page_nodemask(
> > - page_hstate(compound_head(page)),
> > - preferred_nid, nodemask, 0, false);
> > + page_hstate(compound_head(page)), mtc->nid,
> > + mtc->nmask, gfp_mask, false);
> >   }
> >
> >   if (PageTransHuge(page)) {
> > + gfp_mask &= ~__GFP_RECLAIM;
>
> What's up with this gfp_mask modification?

THP page allocation uses a standard gfp masks, GFP_TRANSHUGE_LIGHT and
GFP_TRANHUGE. __GFP_RECLAIM flags is a big part of this standard mask design.
So, I clear it here so as not to disrupt the THP gfp mask.

> >   gfp_mask |= GFP_TRANSHUGE;
> >   order = HPAGE_PMD_ORDER;
> >   }
> > -
> > - if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
> > + zidx = zone_idx(page_zone(page));
> > + if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE)
> >   gfp_mask |= __GFP_HIGHMEM;
> >
> >   new_page = __alloc_pages_nodemask(gfp_mask, order,
> > - preferred_nid, nodemask);
> > + mtc->nid, mtc->nmask);
> >
> >   if (new_page && PageTransHuge(new_page))
> >   prep_transhuge_page(new_page);
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index aec26d9..adba031 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -309,7 +309,11 @@ int test_pages_isolated(unsigned long start_pfn, 
> > unsigned long end_pfn,
> >
> >  struct page *alloc_migrate_target(struct page *page, unsigned long private)
> >  {
> > - int nid = page_to_nid(page);
> > + struct migration_target_control mtc = {
> > + .nid = page_to_nid(page),
> > + .nmask = _states[N_MEMORY],
>
> nmask = NULL again

Okay.

Thanks.


Re: [PATCH v3 4/8] mm/hugetlb: make hugetlb migration callback CMA aware

2020-06-25 Thread Joonsoo Kim
2020년 6월 25일 (목) 오후 8:54, Michal Hocko 님이 작성:
>
> On Tue 23-06-20 15:13:44, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> >
> > new_non_cma_page() in gup.c which try to allocate migration target page
> > requires to allocate the new page that is not on the CMA area.
> > new_non_cma_page() implements it by removing __GFP_MOVABLE flag. This way
> > works well for THP page or normal page but not for hugetlb page.
>
> Could you explain why? I mean why cannot you simply remove __GFP_MOVABLE
> flag when calling alloc_huge_page_nodemask and check for it in dequeue
> path?

If we remove __GFP_MOVABLE when calling alloc_huge_page_nodemask, we cannot
use the page in ZONE_MOVABLE on dequeing.

__GFP_MOVABLE is not only used for CMA selector but also used for zone selector.
If we clear it, we cannot use the page in the ZONE_MOVABLE even if
it's not CMA pages.
For THP page or normal page allocation, there is no way to avoid this
weakness without
introducing another flag or argument. For me, introducing another flag
or argument for
these functions looks over-engineering so I don't change them and
leave them as they are
(removing __GFP_MOVABLE).

But, for alloc_huge_page_nodemask(), introducing a new argument
doesn't seem to be
a problem since it is not a general function but just a migration
target allocation function.

If you agree with this argument, I will add more description to the patch.

Thanks.


Re: [PATCH v3 3/8] mm/hugetlb: unify migration callbacks

2020-06-25 Thread Joonsoo Kim
2020년 6월 25일 (목) 오후 8:26, Michal Hocko 님이 작성:
>
> On Tue 23-06-20 15:13:43, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> >
> > There is no difference between two migration callback functions,
> > alloc_huge_page_node() and alloc_huge_page_nodemask(), except
> > __GFP_THISNODE handling. This patch adds an argument, gfp_mask, on
> > alloc_huge_page_nodemask() and replace the callsite for
> > alloc_huge_page_node() with the call to
> > alloc_huge_page_nodemask(..., __GFP_THISNODE).
> >
> > It's safe to remove a node id check in alloc_huge_page_node() since
> > there is no caller passing NUMA_NO_NODE as a node id.
>
> Yes this is indeed safe. alloc_huge_page_node used to be called from
> other internal hugetlb allocation layer and that allowed NUMA_NO_NODE as
> well. Now it is called only from the mempolicy migration callback and
> that always specifies a node and want to stick with that node.
>
> But I have to say I really dislike the gfp semantic because it is
> different from any other allocation function I can think of. It
> specifies what to be added rather than what should be used.
>
> Removing the function is ok but please use the full gfp mask instead
> or if that is impractical for some reason (wich shouldn't be the case
> as htlb_alloc_mask should be trivial to make static inline) make it
> explicit that this is not a gfp_mask but a gfp modifier and explicitly
> state which modifiers are allowed.

Okay. I will try to solve your concern. Concrete solution is not yet prepared
but perhaps I will use full gfp_mask by using htlb_alloc_mask() in caller sites.

Thanks.


Re: [PATCH v6 4/6] mm/swapcache: support to handle the exceptional entries in swapcache

2020-06-18 Thread Joonsoo Kim
On Wed, Jun 17, 2020 at 05:17:17AM -0700, Matthew Wilcox wrote:
> On Wed, Jun 17, 2020 at 02:26:21PM +0900, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> > 
> > Swapcache doesn't handle the exceptional entries since there is no case
> 
> Don't call them exceptional entries.
> 
> The radix tree has/had the concecpt of exceptional entries.  The swapcache
> doesn't use the radix tree any more, it uses the XArray.  The XArray
> has value entries.
> 
> But you shouldn't call them value entries either; that's an XArray
> concept.  The swap cache and indeed page cache use value entries to
> implement shadow entries (they're also used to implement dax entries and
> swap entries in the page cache).  So just call them shadow entries here.
> 
> I know there are still places which use the term 'nrexceptional' in
> the kernel.  I just haven't got round to replacing them yet.  Please
> don't add more.

Okay! Thanks for commenting.

> 
> > +void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > +   unsigned long end)
> > +{
> > +   unsigned long curr;
> > +   void *old;
> > +   swp_entry_t entry = swp_entry(type, begin);
> > +   struct address_space *address_space = swap_address_space(entry);
> > +   XA_STATE(xas, _space->i_pages, begin);
> > +
> > +retry:
> > +   xa_lock_irq(_space->i_pages);
> > +   for (curr = begin; curr <= end; curr++) {
> > +   entry = swp_entry(type, curr);
> > +   if (swap_address_space(entry) != address_space) {
> > +   xa_unlock_irq(_space->i_pages);
> > +   address_space = swap_address_space(entry);
> > +   begin = curr;
> > +   xas_set(, begin);
> > +   goto retry;
> > +   }
> > +
> > +   old = xas_load();
> > +   if (!xa_is_value(old))
> > +   continue;
> > +   xas_store(, NULL);
> > +   address_space->nrexceptional--;
> > +   xas_next();
> > +   }
> > +   xa_unlock_irq(_space->i_pages);
> > +}
> 
> This is a very clunky loop.  I'm not sure it's even right, given that
> you change address space without changing the xas's address space.  How
> about this?

You are correct. The xas's address space should be changed.


>   for (;;) {
>   XA_STATE(xas, _space->i_pages, begin);
>   unsigned long nr_shadows = 0;
> 
>   xas_lock_irq();
>   xas_for_each(, entry, end) {
>   if (!xa_is_value(entry))
>   continue;
>   xas_store(, NULL);
>   nr_shadows++;
>   }
>   address_space->nr_exceptionals -= nr_shadows;
>   xas_unlock_irq();
> 
>   if (xas.xa_index >= end)
>   break;
>   entry = swp_entry(type, xas.xa_index);
>   address_space = swap_address_space(entry);
>   }

Thanks for suggestion.

I make a patch based on your suggestion. IIUC about Xarray,
after running xas_for_each(), xas.xa_index can be less than the end if
there are empty slots on last portion of array. Handling this case is
also considered in my patch.

Thanks.

--->8
>From 72e97600ea294372a13ab8e208ebd3c0e1889408 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Fri, 15 Nov 2019 09:48:32 +0900
Subject: [PATCH v6 4/6] mm/swapcache: support to handle the shadow entries

Workingset detection for anonymous page will be implemented in the
following patch and it requires to store the shadow entries into the
swapcache. This patch implements an infrastructure to store the shadow
entry in the swapcache.

Acked-by: Johannes Weiner 
Signed-off-by: Joonsoo Kim 
---
 include/linux/swap.h | 17 
 mm/shmem.c   |  3 ++-
 mm/swap_state.c  | 57 ++--
 mm/swapfile.c|  2 ++
 mm/vmscan.c  |  2 +-
 5 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f4f5f94..901da54 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -416,9 +416,13 @@ extern struct address_space *swapper_spaces[];
 extern unsigned long total_swapcache_pages(void);
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *page);
-extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
-extern void __delete_from_swap_cache(struct page *, swp_entry_t entry);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry,
+   gfp_t gfp, void **shadowp);
+extern

Re: [PATCH for v5.8 2/3] mm/swap: fix for "mm: workingset: age nonresident information alongside anonymous pages"

2020-06-16 Thread Joonsoo Kim
2020년 6월 17일 (수) 오전 3:36, Andrew Morton 님이 작성:
>
> On Tue, 16 Jun 2020 15:16:43 +0900 js1...@gmail.com wrote:
>
> > Subject: [PATCH for v5.8 2/3] mm/swap: fix for "mm: workingset: age 
> > nonresident information alongside anonymous pages"
>
> I'm having trouble locating such a patch.
>
> > Non-file-lru page could also be activated in mark_page_accessed()
> > and we need to count this activation for nonresident_age.
> >
> > Note that it's better for this patch to be squashed into the patch
> > "mm: workingset: age nonresident information alongside anonymous pages".
> >
> > Signed-off-by: Joonsoo Kim 
> > ---
> >  mm/swap.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 667133d..c5d5114 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -443,8 +443,7 @@ void mark_page_accessed(struct page *page)
> >   else
> >   __lru_cache_activate_page(page);
> >   ClearPageReferenced(page);
> > - if (page_is_file_lru(page))
> > - workingset_activation(page);
> > + workingset_activation(page);
> >   }
> >   if (page_is_idle(page))
> >   clear_page_idle(page);
>
> AFAICT this patch Fixes: a528910e12ec7ee ("mm: thrash detection-based file
> cache sizing")?

No,

This patch could be squashed into the previous patch,
"mm: workingset: age nonresident information alongside anonymous
pages", in this patchset.
I intentionally do not unify them by my hand since review is required.

Thanks.


Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon

2020-06-16 Thread Joonsoo Kim
2020년 6월 15일 (월) 오후 10:41, Johannes Weiner 님이 작성:
>
> On Fri, Jun 12, 2020 at 12:19:58PM +0900, Joonsoo Kim wrote:
> > 2020년 6월 5일 (금) 오전 12:06, Johannes Weiner 님이 작성:
> > >
> > > On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote:
> > > > On 6/1/20 10:44 PM, Johannes Weiner wrote:
> > > > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> > > > > From: Johannes Weiner 
> > > > > Date: Mon, 1 Jun 2020 14:04:09 -0400
> > > > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon 
> > > > > fix
> > > >
> > > > Looks like the whole series is now in mainline (that was quick!) 
> > > > without this
> > > > followup? As such it won't be squashed so perhaps the subject should be 
> > > > more
> > > > "standalone" now, description referencing commit 34e58cac6d8f ("mm: 
> > > > workingset:
> > > > let cache workingset challenge anon"), possibly with Fixes: tag etc...
> > >
> > > Yep, I'll send a stand-alone version of this. It was a bit fat to get
> > > squashed last minute anyway, and I don't mind having a bit of extra
> > > time to quantify the actual impact of this delta.
> >
> > Hello, Johannes.
> >
> > Now, a week has passed. I hope that this patch is merged as soon as 
> > possible,
> > since my WIP patchset depends on this patch. How about trying to merge
> > this patch now? If you don't mind, I could send it on your behalf.
>
> I didn't realize you directly depended on it, my apologies.

No problem.

> Do you depend on it code-wise or do you have test case that shows a
> the difference?

Code-wise. My patchset also requires activation counting for the anon
pages and conflict could occur in this case.

> Here is the latest version again, you can include it in your patch
> series until we get it merged. But I think we should be able to merge
> it as a follow-up fix into 5.8 still.

Yes. I will send it separately for 5.8. And, I will send some minor fixes, too.
It's appreciated if you review them.

Thanks.


Re: [External] Re: [PATCH 3/3] mm/slub: Fix release all resources used by a slab cache

2020-06-15 Thread Joonsoo Kim
2020년 6월 15일 (월) 오후 3:41, Muchun Song 님이 작성:
>
> On Mon, Jun 15, 2020 at 2:23 PM Joonsoo Kim  wrote:
> >
> > 2020년 6월 14일 (일) 오후 9:39, Muchun Song 님이 작성:
> > >
> > > The function of __kmem_cache_shutdown() is that release all resources
> > > used by the slab cache, while currently it stop release resources when
> > > the preceding node is not empty.
> > >
> > > Signed-off-by: Muchun Song 
> > > ---
> > >  mm/slub.c | 7 ---
> > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index b73505df3de2..4e477ef0f2b9 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -3839,6 +3839,7 @@ bool __kmem_cache_empty(struct kmem_cache *s)
> > >   */
> > >  int __kmem_cache_shutdown(struct kmem_cache *s)
> > >  {
> > > +   int ret = 0;
> > > int node;
> > > struct kmem_cache_node *n;
> > >
> > > @@ -3846,11 +3847,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
> > > /* Attempt to free all objects */
> > > for_each_kmem_cache_node(s, node, n) {
> > > free_partial(s, n);
> > > -   if (node_nr_slabs(n))
> > > -   return 1;
> > > +   if (!ret && node_nr_slabs(n))
> > > +   ret = 1;
> > > }
> >
> > I don't think that this is an improvement.
> >
> > If the shutdown condition isn't met, we don't need to process further.
> > Just 'return 1' looks okay to me.
> >
> > And, with this change, sysfs_slab_remove() is called even if the
> > shutdown is failed.
> > It's better not to have side effects when failing.
>
> If someone calls __kmem_cache_shutdown, he may want to release
> resources used by the slab cache as much as possible. If we continue,
> we may release more pages. From this point, is it an improvement?

My opinion is not strong but I still think that it's not useful enough
to complicate
the code.

If shutdown is failed, it implies there are some bugs and someone should fix it.
Releasing more resources would mitigate the resource problem but doesn't
change the situation that someone should fix it soon.

Anyway, I don't object more if you don't agree with my opinion. In that case,
please fix not to call sysfs_slab_remove() when ret is 1.

Thanks.


Re: [PATCH 3/3] mm/slub: Fix release all resources used by a slab cache

2020-06-15 Thread Joonsoo Kim
2020년 6월 14일 (일) 오후 9:39, Muchun Song 님이 작성:
>
> The function of __kmem_cache_shutdown() is that release all resources
> used by the slab cache, while currently it stop release resources when
> the preceding node is not empty.
>
> Signed-off-by: Muchun Song 
> ---
>  mm/slub.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index b73505df3de2..4e477ef0f2b9 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3839,6 +3839,7 @@ bool __kmem_cache_empty(struct kmem_cache *s)
>   */
>  int __kmem_cache_shutdown(struct kmem_cache *s)
>  {
> +   int ret = 0;
> int node;
> struct kmem_cache_node *n;
>
> @@ -3846,11 +3847,11 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
> /* Attempt to free all objects */
> for_each_kmem_cache_node(s, node, n) {
> free_partial(s, n);
> -   if (node_nr_slabs(n))
> -   return 1;
> +   if (!ret && node_nr_slabs(n))
> +   ret = 1;
> }

I don't think that this is an improvement.

If the shutdown condition isn't met, we don't need to process further.
Just 'return 1' looks okay to me.

And, with this change, sysfs_slab_remove() is called even if the
shutdown is failed.
It's better not to have side effects when failing.

Thanks.


Re: [PATCH 2/3] mm/slub: Use node_nr_slabs() instead of slabs_node()

2020-06-15 Thread Joonsoo Kim
2020년 6월 14일 (일) 오후 9:39, Muchun Song 님이 작성:
>
> In the some code, we already get the kmem_cache_node, so we can
> use node_nr_slabs() directly instead of slabs_node(). Check the
> condition of n->nr_partial can also be removed because we can get
> the correct result via node_nr_slabs().
>
> Signed-off-by: Muchun Song 

If appying this patch, there is no reference to slabs_node(). Please remove it.

Thanks.


Re: [PATCH 1/3] mm/slub: Fix slabs_node return value when CONFIG_SLUB_DEBUG disabled

2020-06-15 Thread Joonsoo Kim
2020년 6월 14일 (일) 오후 9:39, Muchun Song 님이 작성:
>
> The slabs_node() always return zero when CONFIG_SLUB_DEBUG is disabled.
> But some codes determine whether slab is empty by checking the return
> value of slabs_node(). As you know, the result is not correct. This
> problem can be reproduce by the follow code(and boot system with the
> cmdline of "slub_nomerge"):
>
>   void *objs[32];
>   struct kmem_cache *cache = kmem_cache_create("kmem-test", 128, 0,
> 0, 0);
>
>   if (cache) {
> int i;
>
> /* Make a full slab */
> for (i = 0; i < ARRAY_SIZE(objs); i++)
> objs[i] = kmem_cache_alloc(cache, GFP_KERNEL_ACCOUNT);
>
> /*
>  * This really should fail because the slab cache still has
>  * objects. But we did destroy the @cache because of zero
>  * returned by slabs_node().
>  */
> kmem_cache_destroy(cache);
>   }
>
> To fix it, we can move the nr_slabs of kmem_cache_node out of the
> CONFIG_SLUB_DEBUG. So we can get the corrent value returned by the
> slabs_node().
>
> With this patch applied, we will get a warning message and stack
> trace in the dmesg.
>
> Signed-off-by: Muchun Song 
> ---
>  mm/slab.h |  2 +-
>  mm/slub.c | 80 
> +--
>  2 files changed, 43 insertions(+), 39 deletions(-)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 0b91f2a7b033..062d4542b7e2 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -619,8 +619,8 @@ struct kmem_cache_node {
>  #ifdef CONFIG_SLUB
> unsigned long nr_partial;
> struct list_head partial;
> -#ifdef CONFIG_SLUB_DEBUG
> atomic_long_t nr_slabs;
> +#ifdef CONFIG_SLUB_DEBUG
> atomic_long_t total_objects;
> struct list_head full;
>  #endif
> diff --git a/mm/slub.c b/mm/slub.c
> index 49b5cb7da318..1a3e6a5b7287 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c

Hello,

You also need to initialize nr_slabs in init_kmem_cache_node()
on !CONFIG_SLUB_DEBUG.

Otherwise, looks good to me.

Thanks.


Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon

2020-06-11 Thread Joonsoo Kim
2020년 6월 5일 (금) 오전 12:06, Johannes Weiner 님이 작성:
>
> On Thu, Jun 04, 2020 at 03:35:27PM +0200, Vlastimil Babka wrote:
> > On 6/1/20 10:44 PM, Johannes Weiner wrote:
> > > From a8faceabc1454dfd878caee2a8422493d937a394 Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner 
> > > Date: Mon, 1 Jun 2020 14:04:09 -0400
> > > Subject: [PATCH] mm: workingset: let cache workingset challenge anon fix
> >
> > Looks like the whole series is now in mainline (that was quick!) without 
> > this
> > followup? As such it won't be squashed so perhaps the subject should be more
> > "standalone" now, description referencing commit 34e58cac6d8f ("mm: 
> > workingset:
> > let cache workingset challenge anon"), possibly with Fixes: tag etc...
>
> Yep, I'll send a stand-alone version of this. It was a bit fat to get
> squashed last minute anyway, and I don't mind having a bit of extra
> time to quantify the actual impact of this delta.

Hello, Johannes.

Now, a week has passed. I hope that this patch is merged as soon as possible,
since my WIP patchset depends on this patch. How about trying to merge
this patch now? If you don't mind, I could send it on your behalf.

Thanks.


Re: [patch 113/131] mm: balance LRU lists based on relative thrashing

2020-06-09 Thread Joonsoo Kim
2020년 6월 9일 (화) 오후 11:46, Johannes Weiner 님이 작성:
>
> On Tue, Jun 09, 2020 at 05:15:33PM +0800, Alex Shi wrote:
> >
> >
> > 在 2020/6/4 上午7:03, Andrew Morton 写道:
> > >
> > > +   /* XXX: Move to lru_cache_add() when it supports new vs putback */
> >
> > Hi Hannes,
> >
> > Sorry for a bit lost, would you like to explain a bit more of your idea 
> > here?
> >
> > > +   spin_lock_irq(_pgdat(page)->lru_lock);
> > > +   lru_note_cost(page);
> > > +   spin_unlock_irq(_pgdat(page)->lru_lock);
> > > +
> >
> >
> > What could we see here w/o the lru_lock?
>
> It'll just be part of the existing LRU locking in
> pagevec_lru_move_fn(), when the new pages are added to the LRU in
> batch. See this older patch for example:
>
> https://lore.kernel.org/linux-mm/20160606194836.3624-6-han...@cmpxchg.org/
>
> I didn't include it in this series to reduce conflict with Joonsoo's
> WIP series that also operates in this area and does something similar:

Thanks!

> https://lkml.org/lkml/2020/4/3/63

I haven't completed the rebase of my series but I guess that referenced patch
"https://lkml.org/lkml/2020/4/3/63; would be removed in the next version.

Before the I/O cost model, a new anonymous page contributes to the LRU reclaim
balance. But, now, a new anonymous page doesn't contributes to the I/O cost
so this adjusting patch would not be needed anymore.

If anyone wants to change this part,
"/* XXX: Move to lru_cache_add() when it supports new vs putback */", feel free
to do it.

Thanks.


Re: [PATCH v2 07/12] mm/hugetlb: do not modify user provided gfp_mask

2020-06-09 Thread Joonsoo Kim
2020년 6월 9일 (화) 오후 10:54, Michal Hocko 님이 작성:
>
> On Wed 27-05-20 15:44:58, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> >
> > It's not good practice to modify user input. Instead of using it to
> > build correct gfp_mask for APIs, this patch introduces another gfp_mask
> > field, __gfp_mask, for internal usage.
>
> Ugh, this is really ugly. It is just hugetlb to add __GFP_THISNODE as a
> special case. This is an ugly hack but I do not think we want to work
> around it by yet another hack. Moreover it seems that the __GFP_THISNODE
> might be not needed anymore as pointed out in a reply to earlier patch.

If you mean __GFP_THISNODE handling is ugly, as you pointed out,
__GFP_THISNODE handling would be removed in the next version.

If you mean introducing __gfp_mask is ugly, I will try to use a local variable
to keep modified gfp_mask rather than introducing a field in alloc_control.

Thanks.


Re: [PATCH v2 08/12] mm/migrate: change the interface of the migration target alloc/free functions

2020-06-09 Thread Joonsoo Kim
2020년 6월 9일 (화) 오후 11:04, Michal Hocko 님이 작성:
>
> On Wed 27-05-20 15:44:59, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> >
> > To prepare unifying duplicated functions in following patches, this patch
> > changes the interface of the migration target alloc/free functions.
> > Functions now use struct alloc_control as an argument.
>
> It also pulls private argument into alloc_control and keeps it that way.
> Wouldn't it be better to use explicit types and names in a union? Each
> allocation callback has to understant the meaning anyway. I would
> consider the resulting code cleaner that way. What do you think?

Your suggestion sounds reasonable. Thanks.

My plan is that, as Vlastimil suggested, I will keep the private argument in
migration callback and use the appropriate private argument by the
allocation caller. There will be no private field on alloc_control.

Thanks.


Re: [PATCH v2 06/12] mm/hugetlb: make hugetlb migration target allocation APIs CMA aware

2020-06-09 Thread Joonsoo Kim
2020년 6월 9일 (화) 오후 10:53, Michal Hocko 님이 작성:
>
> On Wed 27-05-20 15:44:57, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> >
> > There is a user who do not want to use CMA memory for migration. Until
> > now, it is implemented by caller side but it's not optimal since there
> > is limited information on caller. This patch implements it on callee side
> > to get better result.
>
> I do not follow this changelog and honestly do not see an improvement.
> skip_cma in the alloc_control sound like a hack to me. I can now see

new_non_cma_page() want to allocate the new page that is not on the
CMA area. new_non_cma_page() implements it by not specifying
__GFP_MOVALBE mask or removing this mask.

hugetlb page allocation has two steps. First is dequeing from the pool. And,
if there is no available page on the pool, allocating from the page allocator.

new_non_cma_page() can control allocating from the page allocator in hugetlb
via the gfp flags. However, dequeing cannot be controlled by this way so it
skips dequeing completely. This is why new_non_cma_page() uses
alloc_migrate_huge_page() instead of alloc_huge_page_nodemask().

My patch makes hugetlb code CMA aware so that new_non_cma_page()
can get the benefit of the hugetlb pool.

> why your earlier patch has started to or the given gfp_mask. If anything
> this should be folded here. But even then I do not like a partial
> gfp_mask (__GFP_NOWARN on its own really has GFP_NOWAIT like semantic).

Will not use partial gfp_mask.

Thanks.


Re: [PATCH v2 05/12] mm/hugetlb: unify hugetlb migration callback function

2020-06-09 Thread Joonsoo Kim
2020년 6월 9일 (화) 오후 10:43, Michal Hocko 님이 작성:
>
> On Wed 27-05-20 15:44:56, Joonsoo Kim wrote:
> [...]
> > -/* page migration callback function */
> >  struct page *alloc_huge_page_nodemask(struct hstate *h,
> >   struct alloc_control *ac)
> >  {
> >   ac->gfp_mask |= htlb_alloc_mask(h);
> > + if (ac->nid == NUMA_NO_NODE)
> > + ac->gfp_mask &= ~__GFP_THISNODE;
>
> Is this really needed? alloc_huge_page_node is currently only called
> from numa migration code and the target node should be always defined.

Thanks! When I read the code, I was not sure that the target node is always
defined so I left this code. However, if it's true, this code isn't
needed at all.
I will consider your suggestion in the next version.

Thanks.


Re: [PATCH v2 04/12] mm/hugetlb: use provided ac->gfp_mask for allocation

2020-06-09 Thread Joonsoo Kim
2020년 6월 9일 (화) 오후 10:26, Michal Hocko 님이 작성:
>
> On Wed 27-05-20 15:44:55, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> >
> > gfp_mask handling on alloc_huge_page_(node|nodemask) is
> > slightly changed, from ASSIGN to OR. It's safe since caller of these
> > functions doesn't pass extra gfp_mask except htlb_alloc_mask().
> >
> > This is a preparation step for following patches.
>
> This patch on its own doesn't make much sense to me. Should it be folded
> in the patch which uses that?

Splitting this patch is requested by Roman. :)

Anyway, the next version would not have this patch since many thing will be
changed.

Thanks.


Re: [PATCH v2 03/12] mm/hugetlb: introduce alloc_control structure to simplify migration target allocation APIs

2020-06-09 Thread Joonsoo Kim
2020년 6월 9일 (화) 오후 10:24, Michal Hocko 님이 작성:
>
> On Wed 27-05-20 15:44:54, Joonsoo Kim wrote:
> > From: Joonsoo Kim 
> >
> > Currently, page allocation functions for migration requires some arguments.
> > More worse, in the following patch, more argument will be needed to unify
> > the similar functions. To simplify them, in this patch, unified data
> > structure that controls allocation behaviour is introduced.
> >
> > For clean-up, function declarations are re-ordered.
>
> This is really hard to review without having a clear picture of the
> resulting code so bear with me. I can see some reasons why allocation
> callbacks might benefit from a agragated argument but you seem to touch
> the internal hugetlb dequeue_huge_page_vma which shouldn't really need
> that. I wouldn't mind much but I remember the hugetlb allocation
> functions layering is quite complex for hugetlb specific reasons (see
> 0c397daea1d4 ("mm, hugetlb: further simplify hugetlb allocation API")
> for more background).
>
> Is there any reason why the agregated argument cannot be limited only to
> migration callbacks. That would be alloc_huge_page_node, 
> alloc_huge_page_nodemask
> and alloc_huge_page_vma.

I did it since it's simple for me, but, yes, it's not good to touch
the internal functions.

Anyway, Vlastimil already suggested not to introduce alloc_control for
any hugetlb
functions. I will try it on the next version so the next version would not have
alloc_control in any hugetlb functions.

Thanks.


Re: [PATCH v5 00/10] workingset protection/detection on the anonymous LRU list

2020-06-02 Thread Joonsoo Kim
2020년 6월 3일 (수) 오후 12:57, Suren Baghdasaryan 님이 작성:
>
> On Wed, Apr 8, 2020 at 5:50 PM Joonsoo Kim  wrote:
> >
> > 2020년 4월 9일 (목) 오전 1:55, Vlastimil Babka 님이 작성:
> > >
> > > On 4/3/20 7:40 AM, js1...@gmail.com wrote:
> > > > From: Joonsoo Kim 
> > > >
> > > > Hello,
> > > >
> > > > This patchset implements workingset protection and detection on
> > > > the anonymous LRU list.
> > >
> > > Hi!
> >
> > Hi!
> >
> > > > I did another test to show the performance effect of this patchset.
> > > >
> > > > - ebizzy (with modified random function)
> > > > ebizzy is the test program that main thread allocates lots of memory and
> > > > child threads access them randomly during the given times. Swap-in/out
> > > > will happen if allocated memory is larger than the system memory.
> > > >
> > > > The random function that represents the zipf distribution is used to
> > > > make hot/cold memory. Hot/cold ratio is controlled by the parameter. If
> > > > the parameter is high, hot memory is accessed much larger than cold one.
> > > > If the parameter is low, the number of access on each memory would be
> > > > similar. I uses various parameters in order to show the effect of
> > > > patchset on various hot/cold ratio workload.
> > > >
> > > > My test setup is a virtual machine with 8 cpus and 1024MB memory.
> > > >
> > > > Result format is as following.
> > > >
> > > > Parameter 0.1 ... 1.3
> > > > Allocated memory size
> > > > Throughput for base (larger is better)
> > > > Throughput for patchset (larger is better)
> > > > Improvement (larger is better)
> > > >
> > > >
> > > > * single thread
> > > >
> > > >  0.1  0.3  0.5  0.7  0.9  1.1  1.3
> > > > <512M>
> > > >   7009.0   7372.0   7774.0   8523.0   9569.0  10724.0  11936.0
> > > >   6973.0   7342.0   7745.0   8576.0   9441.0  10730.0  12033.0
> > > >-0.01 -0.0 -0.0 0.01-0.01  0.0 0.01
> > > > <768M>
> > > >915.0   1039.0   1275.0   1687.0   2328.0   3486.0   5445.0
> > > >920.0   1037.0   1238.0   1689.0   2384.0   3638.0   5381.0
> > > > 0.01 -0.0-0.03  0.0 0.02 0.04-0.01
> > > > <1024M>
> > > >425.0471.0539.0753.0   1183.0   2130.0   3839.0
> > > >414.0468.0553.0770.0   1242.0   2187.0   3932.0
> > > >-0.03-0.01 0.03 0.02 0.05 0.03 0.02
> > > > <1280M>
> > > >320.0346.0410.0556.0871.0   1654.0   3298.0
> > > >316.0346.0411.0550.0892.0   1652.0   3293.0
> > > >-0.01  0.0  0.0-0.01 0.02 -0.0 -0.0
> > > > <1536M>
> > > >273.0290.0341.0458.0733.0   1381.0   2925.0
> > > >271.0293.0344.0462.0740.0   1398.0   2969.0
> > > >-0.01 0.01 0.01 0.01 0.01 0.01 0.02
> > > > <2048M>
> > > > 77.0 79.0 95.0147.0276.0690.0   1816.0
> > > > 91.0 94.0115.0170.0321.0770.0   2018.0
> > > > 0.18 0.19 0.21 0.16 0.16 0.12 0.11
> > > >
> > > >
> > > > * multi thread (8)
> > > >
> > > >  0.1  0.3  0.5  0.7  0.9  1.1  1.3
> > > > <512M>
> > > >  29083.0  29648.0  30145.0  31668.0  33964.0  38414.0  43707.0
> > > >  29238.0  29701.0  30301.0  31328.0  33809.0  37991.0  43667.0
> > > > 0.01  0.0 0.01-0.01 -0.0-0.01 -0.0
> > > > <768M>
> > > >   3332.0   3699.0   4673.0   5830.0   8307.0  12969.0  17665.0
> > > >   3579.0   3992.0   4432.0   6111.0   8699.0  12604.0  18061.0
> > > > 0.07 0.08-0.05 0.05 0.05-0.03 0.02
> > > > <1024M>
> > > >   1921.0   2141.0   2484.0   3296.0   5391.0   8227.0  14574.0
> > > >   1989.0   2155.0   2609.0   3565.0   5463.0   8170.0  15642.0
> > > > 0.04 0.01 0.05 0.08 0.01-0.01 0.07
> > > > <1280M>
> > > >   1524.0   1625.0   1931.0   2581.0   4155.0   6959.0  1

Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon

2020-06-02 Thread Joonsoo Kim
2020년 6월 3일 (수) 오전 1:48, Johannes Weiner 님이 작성:
>
> On Tue, Jun 02, 2020 at 11:34:17AM +0900, Joonsoo Kim wrote:
> > 2020년 6월 2일 (화) 오전 12:56, Johannes Weiner 님이 작성:
> > > On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> > > > But, I still think that modified refault activation equation isn't
> > > > safe. The next
> > > > problem I found is related to the scan ratio limit patch ("limit the 
> > > > range of
> > > > LRU type balancing") on this series. See the below example.
> > > >
> > > > anon: Hot (X M)
> > > > file: Hot (200 M) / dummy (200 M)
> > > > P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
> > > > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->
> > > >
> > > > Without this patch, A and F(H) are kept on the memory and look like
> > > > it's correct.
> > > >
> > > > With this patch and below fix, refault equation for Pn would be:
> > > >
> > > > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
> > > > ratio (from anon non-resident)
> > > > anon + active file = X + 200
> > > > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200
> > >
> > > That doesn't look quite right to me. The anon part of the refault
> > > distance is driven by X, so the left-hand of this formula contains X
> > > as well.
> > >
> > > 1000 file (1200M reuse distance, 200M in-core size) + F(H) reactivations 
> > > + X * scan ratio < X + 1000
> >
> > As I said before, there is no X on left-hand of this formula. To
> > access all Pn and
> > re-access P1, we need 1200M file list scan and reclaim. More scan isn't 
> > needed.
> > With your patch "limit the range of LRU type balancing", scan ratio
> > between file/anon
> > list is limited to 0.5 ~ 2.0, so, maximum anon scan would be 1200 M *
> > 2.0, that is,
> > 2400 M and not bounded by X. That means that file list cannot be
> > stable with some X.
>
> Oh, no X on the left because you're talking about the number of pages
> scanned until the first refaults, which is fixed - so why are we still
> interpreting the refault distance against a variable anon size X?

Looks like I was confused again. Your formula is correct and mine is
wrong. My mistake is I thought that your patch "limit the range of LRU
type balancing"
which makes scan *ratio* 2:1 leads to actual scan *count* ratio
between anon/file to 2:1.
But, now I realized that 2:1 is just scan ratio and actual scan
*count* ratio could be far
larger with certain list size. It would be X * scan ratio in above example so my
explanation is wrong and you are right.

Sorry for making a trouble.

> Well, that's misleading. We compare against anon because part of the
> cache is already encoded in the refault distance. What we're really
> checking is access distance against total amount of available RAM.
>
> Consider this. We want to activate pages where
>
> access_distance <= RAM
>
> and our measure of access distance is:
>
> access_distance = refault_distance + inactive_file
>
> So the comparison becomes:
>
> refault_distance + inactive_file < RAM
>
> which we simplify to:
>
> refault_distance < active_file + anon
>
> There is a certain threshold for X simply because there is a certain
> threshold for RAM beyond which we need to start activating. X cannot
> be arbitrary, it must be X + cache filling up memory - after all we
> have page reclaim evicting pages.
>
> Again, this isn't new. In the current code, we activate when:
>
> refault_distance < active_file
>
> which is
>
> access_distance <= RAM - anon
>
> You can see, whether things are stable or not always depends on the
> existing workingset size. It's just a proxy for how much total RAM we
> have potentially available to the refaulting page.
>
> > If my lastly found example is a correct example (your confirm is required),
> > it is also related to the correctness issue since cold pages causes
> > eviction of the hot pages repeatedly.
>
> I think your example is correct, but it benefits from the VM
> arbitrarily making an assumption that has a 50/50 shot of being true.
>
> You and I know which pages are hot and which are cold because you
> designed the example.
>
> All the VM sees is this:
>
> - We have an established workingset that has previously shown an
>   access distance <= RAM and therefor was activated.
>
> - We now have another set that also appear

Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon

2020-06-01 Thread Joonsoo Kim
2020년 6월 2일 (화) 오전 12:56, Johannes Weiner 님이 작성:
>
> On Mon, Jun 01, 2020 at 03:14:24PM +0900, Joonsoo Kim wrote:
> > 2020년 5월 30일 (토) 오전 12:12, Johannes Weiner 님이 작성:
> > >
> > > On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote:
> > > > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner 님이 작성:
> > > > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > > > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner 님이 
> > > > > > 작성:
> > > > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > > > The only way they could get reclaimed is if their access distance ends
> > > > > up bigger than the file cache. But if that's the case, then the
> > > > > workingset is overcommitted, and none of the pages qualify for reclaim
> > > > > protection. Picking a subset to protect against the rest is arbitrary.
> > > >
> > > > In the fixed example, although other file (500 MB) is repeatedly 
> > > > accessed,
> > > > it's not workingset. If we have unified list (file + anon), access 
> > > > distance of
> > > > Pn will be larger than whole memory size. Therefore, it's not 
> > > > overcommitted
> > > > workingset and this patch wrongly try to activate it. As I said before,
> > > > without considering inactive_age for anon list, this calculation can 
> > > > not be
> > > > correct.
> > >
> > > You're right. If we don't take anon age into account, the activations
> > > could be over-eager; however, so would counting IO cost and exerting
> > > pressure on anon be, which means my previous patch to split these two
> > > wouldn't fix fundamental the problem you're pointing out. We simply
> >
> > Splitting would not fix the fundamental problem (over-eager) but it would
> > greatly weaken the problem. Just counting IO cost doesn't break the
> > active/inactive separation in file list. It does cause more scan on anon 
> > list
> > but I think that it's endurable.
>
> I think the split is a good idea.
>
> The only thing I'm not sure yet is if we can get away without an
> additional page flag if the active flag cannot be reused to denote
> thrashing. I'll keep at it, maybe I can figure something out.
>
> But I think it would be follow-up work.
>
> > > have to take anon age into account for the refaults to be comparable.
> >
> > Yes, taking anon age into account is also a good candidate to fix the 
> > problem.
>
> Okay, good.
>
> > > However, your example cannot have a completely silent stable state. As
> > > we stop workingset aging, the refault distances will slowly increase
> > > again. We will always have a bit of churn, and rightfully so, because
> > > the workingset *could* go stale.
> > >
> > > That's the same situation in my cache-only example above. Anytime you
> > > have a subset of pages that by itself could fit into memory, but can't
> > > because of an established workingset, ongoing sampling is necessary.
> > >
> > > But the rate definitely needs to reduce as we detect that in-memory
> > > pages are indeed hot. Otherwise we cause more churn than is required
> > > for an appropriate rate of workingset sampling.
> > >
> > > How about the patch below? It looks correct, but I will have to re-run
> > > my tests to make sure I / we are not missing anything.
> >
> > Much better! It may solve my concern mostly.
>
> Okay thanks for confirming. I'll send a proper version to Andrew.

Okay.

> > But, I still think that modified refault activation equation isn't
> > safe. The next
> > problem I found is related to the scan ratio limit patch ("limit the range 
> > of
> > LRU type balancing") on this series. See the below example.
> >
> > anon: Hot (X M)
> > file: Hot (200 M) / dummy (200 M)
> > P: 1200 M (3 parts, each one 400 M, P1, P2, P3)
> > Access Pattern: A -> F(H) -> P1 -> A -> F(H) -> P2 -> ... ->
> >
> > Without this patch, A and F(H) are kept on the memory and look like
> > it's correct.
> >
> > With this patch and below fix, refault equation for Pn would be:
> >
> > Refault dist of Pn = 1200 (from file non-resident) + 1200 * anon scan
> > ratio (from anon non-resident)
> > anon + active file = X + 200
> > 1200 + 1200 * anon scan ratio (0.5 ~ 2) < X + 200
>
> That doesn't look quite right to me. The anon part of the refault
> distance is 

Re: [PATCH v2 00/12] clean-up the migration target allocation functions

2020-06-01 Thread Joonsoo Kim
2020년 5월 29일 (금) 오후 3:50, Joonsoo Kim 님이 작성:
>
> 2020년 5월 29일 (금) 오전 4:25, Vlastimil Babka 님이 작성:
> >
> > On 5/27/20 8:44 AM, js1...@gmail.com wrote:
> > > From: Joonsoo Kim 
> > >
> > > This patchset clean-up the migration target allocation functions.
> > >
> > > * Changes on v2
> > > - add acked-by tags
> > > - fix missing compound_head() call for the patch #3
> > > - remove thisnode field on alloc_control and use __GFP_THISNODE directly
> > > - fix missing __gfp_mask setup for the patch
> > > "mm/hugetlb: do not modify user provided gfp_mask"
> > >
> > > * Cover-letter
> > >
> > > Contributions of this patchset are:
> > > 1. unify two hugetlb alloc functions. As a result, one is remained.
> > > 2. make one external hugetlb alloc function to internal one.
> > > 3. unify three functions for migration target allocation.
> > >
> > > The patchset is based on next-20200526.
> > > The patchset is available on:
> >
> > I went through the series and I'd like to make some high-level suggestions
> > first, that should hopefully simplify the code a bit more and reduce churn:
>
> Thanks for review!
> I have not enough time today to check your suggestions.
> I will check on next week and then reply again.
>
> Thanks.
>
> > - in the series, alloc_huge_page_nodemask() becomes the only caller of
> > alloc_migrate_huge_page(). So you can inline the code there, and it's one 
> > less
> > function out of many with similar name :)
> >
> > - after that, alloc_huge_page_nodemask(ac) uses ac mostly just to extract
> > individual fields, and only pass it as a whole to 
> > dequeue_huge_page_nodemask().
> > The only other caller of dequeue...() is dequeue_huge_page_vma() who has to
> > construct ac from scratch. It might be probably simpler not to introduce 
> > struct
> > alloc_control into hugetlb code at all, and only keep it for
> > alloc_migrate_target(), at which point it can have a more specific name as
> > discussed and there's less churn
> >
> > - I'd suggest not change signature of migrate_pages(), free_page_t and
> > new_page_t, keeping the opaque private field is fine as not all callbacks 
> > use
> > struct alloc_context pointer, and then e.g. compaction_alloc has to use the
> > private field etc. alloc_migration_target() can simply cast the private to
> > struct alloc_control *ac as the first thing

Looks like all your suggestions are reasonable. I will try them and make v3.

Thanks.


Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon

2020-06-01 Thread Joonsoo Kim
2020년 5월 30일 (토) 오전 12:12, Johannes Weiner 님이 작성:
>
> On Fri, May 29, 2020 at 03:48:00PM +0900, Joonsoo Kim wrote:
> > 2020년 5월 29일 (금) 오전 2:02, Johannes Weiner 님이 작성:
> > > On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > > > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner 님이 작성:
> > > > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > *It would require another page flag to tell whether a refaulting cache
> > > page has challenged the anon set once (transitioning) or repeatedly
> > > (thrashing), as we currently use the active state for that. If we
> > > would repurpose PG_workingset to tell the first from the second
> > > refault, we'd need a new flag to mark a page for memstall accounting.
> >
> > I don't understand why a new flag is needed. Whenever we found that
> > challenge is needed (dist < active + anon), we need to add up IO cost.
>
> It sounds like this was cleared up later on in the email.
>
> > > > It could cause thrashing for your patch.
> > > > Without the patch, current logic try to
> > > > find most hottest file pages that are fit into the current file list
> > > > size and protect them
> > > > successfully. Assume that access distance of 50 MB hot file pages is 60 
> > > > MB
> > > > which is less than whole file list size but larger than inactive list
> > > > size. Without
> > > > your patch, 50 MB (hot) pages are not evicted at all. All these hot
> > > > pages will be
> > > > protected from the 100MB low access frequency pages. 100 MB low access
> > > > frequency pages will be refaulted repeatedely but it's correct 
> > > > behaviour.
> > >
> > > Hm, something doesn't quite add up. Why is the 50M hot set evicted
> > > with my patch?
> >
> > Thanks for kind explanation. I read all and I found that I was confused 
> > before.
> > Please let me correct the example.
> >
> > Environment:
> > anon: 500 MB (hot) / 500 MB (hot)
> > file: 50 MB (so hot) / 50 MB (dummy)
> >
> > I will call 50 MB file hot pages as F(H).
> > Let's assume that periodical access to other file (500 MB) is started. That
> > file consists of 5 parts and each one is 100 MB. I will call it P1, P2, 
> > ..., P5.
> >
> > Problem will occur on following access pattern.
> >
> > F(H) -> P1 -> F(H) -> P2 -> ... -> F(H) -> P5 -> F(H) -> P1 -> ...
> >
> > With this access pattern, access distance of F(H) and Pn is:
> >
> > F(H) = 150 MB
> > Pn = 750 MB
> >
> > Without your patch, F(H) is kept on the memory since deactivation would not
> > happen. However, with your patch, deactivation happens since Pn's refault
> > distance is less than 'active file + anon'. In the end, F(H) would be 
> > finally
> > evicted.
>
> Okay, this example makes sense to me.
>
> I do think P needs to challenge the workingset - at first. P could
> easily fit into memory by itself if anon and active_file were cold, so
> we need to challenge them to find out that they're hot. As you can
> see, if anon and F(H) were completely unused, the current behavior
> would be incorrect.
>
> The current behavior would do the same in a cache-only example:
>
> anon = 1G (unreclaimable)
> file = 500M (hot) / 300M (dummy)
>
> P = 400M
>
> F(H) -> P1 -> F(H) -> P2 ...
>
> If F(H) is already active, the first P refaults would have a distance
> of 100M, thereby challenging F(H). As F(H) reactivates, its size will
> be reflected in the refault distances, pushing them beyond the size of
> memory that is available to the cache: 600M refault distance > 500M
> active cache, or 900M access distance > 800M cache space.

Hmm... It seems that the current behavior (before your patch) for this
example has no problem. It causes deactivation but doesn't cause eviction
so there is no workingset thrashing.

> However, I agree with your observation about the anon age below. When
> we start aging the anon set, we have to reflect that in the refault
> distances. Once we know that the 1G anon pages are indeed hotter than
> the pages in P, there is no reason to keep churning the workingset.

Okay.

> > > The only way they could get reclaimed is if their access distance ends
> > > up bigger than the file cache. But if that's the case, then the
> > > workingset is overcommitted, and none of the pages qualify for reclaim
> > > protection. Picking a subset to protect against the rest is arbitrary.
> >
> >

Re: [PATCH v2 00/12] clean-up the migration target allocation functions

2020-05-29 Thread Joonsoo Kim
2020년 5월 29일 (금) 오전 4:25, Vlastimil Babka 님이 작성:
>
> On 5/27/20 8:44 AM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > This patchset clean-up the migration target allocation functions.
> >
> > * Changes on v2
> > - add acked-by tags
> > - fix missing compound_head() call for the patch #3
> > - remove thisnode field on alloc_control and use __GFP_THISNODE directly
> > - fix missing __gfp_mask setup for the patch
> > "mm/hugetlb: do not modify user provided gfp_mask"
> >
> > * Cover-letter
> >
> > Contributions of this patchset are:
> > 1. unify two hugetlb alloc functions. As a result, one is remained.
> > 2. make one external hugetlb alloc function to internal one.
> > 3. unify three functions for migration target allocation.
> >
> > The patchset is based on next-20200526.
> > The patchset is available on:
>
> I went through the series and I'd like to make some high-level suggestions
> first, that should hopefully simplify the code a bit more and reduce churn:

Thanks for review!
I have not enough time today to check your suggestions.
I will check on next week and then reply again.

Thanks.

> - in the series, alloc_huge_page_nodemask() becomes the only caller of
> alloc_migrate_huge_page(). So you can inline the code there, and it's one less
> function out of many with similar name :)
>
> - after that, alloc_huge_page_nodemask(ac) uses ac mostly just to extract
> individual fields, and only pass it as a whole to 
> dequeue_huge_page_nodemask().
> The only other caller of dequeue...() is dequeue_huge_page_vma() who has to
> construct ac from scratch. It might be probably simpler not to introduce 
> struct
> alloc_control into hugetlb code at all, and only keep it for
> alloc_migrate_target(), at which point it can have a more specific name as
> discussed and there's less churn
>
> - I'd suggest not change signature of migrate_pages(), free_page_t and
> new_page_t, keeping the opaque private field is fine as not all callbacks use
> struct alloc_context pointer, and then e.g. compaction_alloc has to use the
> private field etc. alloc_migration_target() can simply cast the private to
> struct alloc_control *ac as the first thing


Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon

2020-05-29 Thread Joonsoo Kim
2020년 5월 29일 (금) 오전 2:02, Johannes Weiner 님이 작성:
>
> On Thu, May 28, 2020 at 04:16:50PM +0900, Joonsoo Kim wrote:
> > 2020년 5월 27일 (수) 오후 10:43, Johannes Weiner 님이 작성:
> > >
> > > On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > > > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner 님이 작성:
> > > > >
> > > > > We activate cache refaults with reuse distances in pages smaller than
> > > > > the size of the total cache. This allows new pages with competitive
> > > > > access frequencies to establish themselves, as well as challenge and
> > > > > potentially displace pages on the active list that have gone cold.
> > > > >
> > > > > However, that assumes that active cache can only replace other active
> > > > > cache in a competition for the hottest memory. This is not a great
> > > > > default assumption. The page cache might be thrashing while there are
> > > > > enough completely cold and unused anonymous pages sitting around that
> > > > > we'd only have to write to swap once to stop all IO from the cache.
> > > > >
> > > > > Activate cache refaults when their reuse distance in pages is smaller
> > > > > than the total userspace workingset, including anonymous pages.
> > > >
> > > > Hmm... I'm not sure the correctness of this change.
> > > >
> > > > IIUC, this patch leads to more activations in the file list and more 
> > > > activations
> > > > here will challenge the anon list since rotation ratio for the file
> > > > list will be increased.
> > >
> > > Yes.
> > >
> > > > However, this change breaks active/inactive concept of the file list.
> > > > active/inactive
> > > > separation is implemented by in-list refault distance. anon list size 
> > > > has
> > > > no direct connection with refault distance of the file list so using
> > > > anon list size
> > > > to detect workingset for file page breaks the concept.
> > >
> > > This is intentional, because there IS a connection: they both take up
> > > space in RAM, and they both cost IO to bring back once reclaimed.
> >
> > I know that. This is the reason that I said 'no direct connection'. The anon
> > list size is directly related the *possible* file list size. But,
> > active/inactive
> > separation in one list is firstly based on *current* list size rather than
> > the possible list size. Adding anon list size to detect workingset means
> > to use the possible list size and I think that it's wrong.
>
> Hm so you're saying we should try to grow the page cache, but maintain
> an inactive/active balance within the cache based on its current size
> in case growing is not possible.
>
> I think it would be implementable*, but I'm not quite convinced that
> it's worth distinguishing. We're talking about an overcommitted
> workingset and thereby an inherently unstable state that may thrash
> purely based on timing differences anyway. See below.

I think that this patch wrongly judge the overcommitted workingset.
See below new example in this reply.

> *It would require another page flag to tell whether a refaulting cache
> page has challenged the anon set once (transitioning) or repeatedly
> (thrashing), as we currently use the active state for that. If we
> would repurpose PG_workingset to tell the first from the second
> refault, we'd need a new flag to mark a page for memstall accounting.

I don't understand why a new flag is needed. Whenever we found that
challenge is needed (dist < active + anon), we need to add up IO cost.

> > > When file is refaulting, it means we need to make more space for
> > > cache. That space can come from stale active file pages. But what if
> > > active cache is all hot, and meanwhile there are cold anon pages that
> > > we could swap out once and then serve everything from RAM?
> > >
> > > When file is refaulting, we should find the coldest data that is
> > > taking up RAM and kick it out. It doesn't matter whether it's file or
> > > anon: the goal is to free up RAM with the least amount of IO risk.
> >
> > I understand your purpose and agree with it. We need to find a solution.
> > To achieve your goal, my suggestion is:
> >
> > - refault distance < active file, then do activation and add up IO cost
> > - refault distance < active file + anon list, then add up IO cost
> >
> > This doesn't break workingset detection on file list and challenge

Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon

2020-05-28 Thread Joonsoo Kim
2020년 5월 27일 (수) 오후 10:43, Johannes Weiner 님이 작성:
>
> On Wed, May 27, 2020 at 11:06:47AM +0900, Joonsoo Kim wrote:
> > 2020년 5월 21일 (목) 오전 8:26, Johannes Weiner 님이 작성:
> > >
> > > We activate cache refaults with reuse distances in pages smaller than
> > > the size of the total cache. This allows new pages with competitive
> > > access frequencies to establish themselves, as well as challenge and
> > > potentially displace pages on the active list that have gone cold.
> > >
> > > However, that assumes that active cache can only replace other active
> > > cache in a competition for the hottest memory. This is not a great
> > > default assumption. The page cache might be thrashing while there are
> > > enough completely cold and unused anonymous pages sitting around that
> > > we'd only have to write to swap once to stop all IO from the cache.
> > >
> > > Activate cache refaults when their reuse distance in pages is smaller
> > > than the total userspace workingset, including anonymous pages.
> >
> > Hmm... I'm not sure the correctness of this change.
> >
> > IIUC, this patch leads to more activations in the file list and more 
> > activations
> > here will challenge the anon list since rotation ratio for the file
> > list will be increased.
>
> Yes.
>
> > However, this change breaks active/inactive concept of the file list.
> > active/inactive
> > separation is implemented by in-list refault distance. anon list size has
> > no direct connection with refault distance of the file list so using
> > anon list size
> > to detect workingset for file page breaks the concept.
>
> This is intentional, because there IS a connection: they both take up
> space in RAM, and they both cost IO to bring back once reclaimed.

I know that. This is the reason that I said 'no direct connection'. The anon
list size is directly related the *possible* file list size. But,
active/inactive
separation in one list is firstly based on *current* list size rather than
the possible list size. Adding anon list size to detect workingset means
to use the possible list size and I think that it's wrong.

> When file is refaulting, it means we need to make more space for
> cache. That space can come from stale active file pages. But what if
> active cache is all hot, and meanwhile there are cold anon pages that
> we could swap out once and then serve everything from RAM?
>
> When file is refaulting, we should find the coldest data that is
> taking up RAM and kick it out. It doesn't matter whether it's file or
> anon: the goal is to free up RAM with the least amount of IO risk.

I understand your purpose and agree with it. We need to find a solution.
To achieve your goal, my suggestion is:

- refault distance < active file, then do activation and add up IO cost
- refault distance < active file + anon list, then add up IO cost

This doesn't break workingset detection on file list and challenge
the anon list as the same degree as you did.

> Remember that the file/anon split, and the inactive/active split, are
> there to optimize reclaim. It doesn't mean that these memory pools are
> independent from each other.
>
> The file list is split in two because of use-once cache. The anon and
> file lists are split because of different IO patterns, because we may
> not have swap etc. But once we are out of use-once cache, have swap
> space available, and have corrected for the different cost of IO,
> there needs to be a relative order between all pages in the system to
> find the optimal candidates to reclaim.
>
> > My suspicion is started by this counter example.
> >
> > Environment:
> > anon: 500 MB (so hot) / 500 MB (so hot)
> > file: 50 MB (hot) / 50 MB (cold)
> >
> > Think about the situation that there is periodical access to other file 
> > (100 MB)
> > with low frequency (refault distance is 500 MB)
> >
> > Without your change, this periodical access doesn't make thrashing for 
> > cached
> > active file page since refault distance of periodical access is larger
> > than the size of
> > the active file list. However, with your change, it causes thrashing
> > on the file list.
>
> It doesn't cause thrashing. It causes scanning because that 100M file
> IS thrashing: with or without my patch, that refault IO is occuring.

It could cause thrashing for your patch. Without the patch, current logic try to
find most hottest file pages that are fit into the current file list
size and protect them
successfully. Assume that access distance of 50 MB hot file pages is 60 MB
which is less than whole file list size but larger than inactive list
size. Without
y

Re: [PATCH 01/11] mm/page_isolation: prefer the node of the source page

2020-05-27 Thread Joonsoo Kim
> > >  mm/page_isolation.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > > index 2c11a38..7df89bd 100644
> > > --- a/mm/page_isolation.c
> > > +++ b/mm/page_isolation.c
> > > @@ -300,5 +300,7 @@ int test_pages_isolated(unsigned long start_pfn, 
> > > unsigned long end_pfn,
> > >
> > >  struct page *alloc_migrate_target(struct page *page, unsigned long 
> > > private)
> > >  {
> > > - return new_page_nodemask(page, numa_node_id(), 
> > > _states[N_MEMORY]);
> > > + int nid = page_to_nid(page);
> > > +
> > > + return new_page_nodemask(page, nid, _states[N_MEMORY]);
> >
> > Why not new_page_nodemask(page, page_to_nid(page), _states[N_MEMORY]) ?
> > Still fits into 80 characters.
>
> It's just my preference not directly using function call in argument
> as much as possible.
> If you don't like it, I will change it as you suggested. However,
> since alloc_migrate_target()
> will be removed in following patch, there will be no difference in the end.

Oops... I found that your suggestion exceed 80 character.
I will leave the patch as it is.

Thanks.


Re: [PATCH 05/14] mm: workingset: let cache workingset challenge anon

2020-05-26 Thread Joonsoo Kim
2020년 5월 21일 (목) 오전 8:26, Johannes Weiner 님이 작성:
>
> We activate cache refaults with reuse distances in pages smaller than
> the size of the total cache. This allows new pages with competitive
> access frequencies to establish themselves, as well as challenge and
> potentially displace pages on the active list that have gone cold.
>
> However, that assumes that active cache can only replace other active
> cache in a competition for the hottest memory. This is not a great
> default assumption. The page cache might be thrashing while there are
> enough completely cold and unused anonymous pages sitting around that
> we'd only have to write to swap once to stop all IO from the cache.
>
> Activate cache refaults when their reuse distance in pages is smaller
> than the total userspace workingset, including anonymous pages.

Hmm... I'm not sure the correctness of this change.

IIUC, this patch leads to more activations in the file list and more activations
here will challenge the anon list since rotation ratio for the file
list will be increased.

However, this change breaks active/inactive concept of the file list.
active/inactive
separation is implemented by in-list refault distance. anon list size has
no direct connection with refault distance of the file list so using
anon list size
to detect workingset for file page breaks the concept.

My suspicion is started by this counter example.

Environment:
anon: 500 MB (so hot) / 500 MB (so hot)
file: 50 MB (hot) / 50 MB (cold)

Think about the situation that there is periodical access to other file (100 MB)
with low frequency (refault distance is 500 MB)

Without your change, this periodical access doesn't make thrashing for cached
active file page since refault distance of periodical access is larger
than the size of
the active file list. However, with your change, it causes thrashing
on the file list.

In fact, I'm not sure that I'm thinking correctly. It's very
complicated for me. :)
Please let me know if there is a missing piece.

Thanks.


Re: [PATCH 06/11] mm/hugetlb: do not modify user provided gfp_mask

2020-05-22 Thread Joonsoo Kim
2020년 5월 22일 (금) 오전 7:19, Mike Kravetz 님이 작성:
>
> On 5/17/20 6:20 PM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > It's not good practice to modify user input. Instead of using it to
> > build correct gfp_mask for APIs, this patch introduces another gfp_mask
> > field, __gfp_mask, for internal usage.
>
> Modifying the flags as is done in the existing code does not bother me
> too much, but that is just my opinion.  Adding __gfp_mask for modifications
> is fine with me if others think it is a good thing.

With the following patches, in some cases, ac->gfp_mask is set up once and
used many times. If we modify ac->gfp_mask in place, there would be side-effect.

> Does dequeue_huge_page_vma() need to be modified so that it will set
> ac.__gfp_mask before calling dequeue_huge_page_nodemask()?

Good catch! Will change!

Thanks.


Re: [PATCH 04/11] mm/hugetlb: unify hugetlb migration callback function

2020-05-22 Thread Joonsoo Kim
2020년 5월 22일 (금) 오전 5:54, Mike Kravetz 님이 작성:
>
> On 5/17/20 6:20 PM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > There is no difference between two migration callback functions,
> > alloc_huge_page_node() and alloc_huge_page_nodemask(), except
> > __GFP_THISNODE handling. This patch adds one more field on to
> > the alloc_control and handles this exception.
> >
> > Signed-off-by: Joonsoo Kim 
> > ---
> >  include/linux/hugetlb.h |  8 
> >  mm/hugetlb.c| 23 ++-
> >  mm/internal.h   |  1 +
> >  mm/mempolicy.c  |  3 ++-
> >  4 files changed, 5 insertions(+), 30 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 6da217e..4892ed3 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -505,8 +505,6 @@ struct huge_bootmem_page {
> >
> >  struct page *alloc_migrate_huge_page(struct hstate *h,
> >   struct alloc_control *ac);
> > -struct page *alloc_huge_page_node(struct hstate *h,
> > - struct alloc_control *ac);
> >  struct page *alloc_huge_page_nodemask(struct hstate *h,
> >   struct alloc_control *ac);
> >  struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct 
> > *vma,
> > @@ -755,12 +753,6 @@ static inline void huge_ptep_modify_prot_commit(struct 
> > vm_area_struct *vma,
> >  struct hstate {};
> >
> >  static inline struct page *
> > -alloc_huge_page_node(struct hstate *h, struct alloc_control *ac)
> > -{
> > - return NULL;
> > -}
> > -
> > -static inline struct page *
> >  alloc_huge_page_nodemask(struct hstate *h, struct alloc_control *ac)
> >  {
> >   return NULL;
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 859dba4..60b0983 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1981,31 +1981,12 @@ struct page *alloc_buddy_huge_page_with_mpol(struct 
> > hstate *h,
> >  }
> >
> >  /* page migration callback function */
> > -struct page *alloc_huge_page_node(struct hstate *h,
> > - struct alloc_control *ac)
> > -{
> > - struct page *page = NULL;
> > -
> > - ac->gfp_mask |= htlb_alloc_mask(h);
> > - if (ac->nid != NUMA_NO_NODE)
> > - ac->gfp_mask |= __GFP_THISNODE;
> > -
> > - spin_lock(_lock);
> > - if (h->free_huge_pages - h->resv_huge_pages > 0)
> > - page = dequeue_huge_page_nodemask(h, ac);
> > - spin_unlock(_lock);
> > -
> > - if (!page)
> > - page = alloc_migrate_huge_page(h, ac);
> > -
> > - return page;
> > -}
> > -
> > -/* page migration callback function */
> >  struct page *alloc_huge_page_nodemask(struct hstate *h,
> >   struct alloc_control *ac)
> >  {
> >   ac->gfp_mask |= htlb_alloc_mask(h);
> > + if (ac->thisnode && ac->nid != NUMA_NO_NODE)
> > + ac->gfp_mask |= __GFP_THISNODE;
> >
> >   spin_lock(_lock);
> >   if (h->free_huge_pages - h->resv_huge_pages > 0) {
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 75b3f8e..574722d0 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -618,6 +618,7 @@ struct alloc_control {
> >   int nid;
> >   nodemask_t *nmask;
> >   gfp_t gfp_mask;
> > + bool thisnode;
>
> I wonder if the new field is necessary?
>
> IIUC, it simply prevents the check for NUMA_NO_NODE and possible setting
> of __GFP_THISNODE in previous alloc_huge_page_nodemask() calling sequences.
> However, it appears that node (preferred_nid) is always set to something
> other than NUMA_NO_NODE in those callers.

Okay. I will check it.

> It obviously makes sense to add the field to guarantee no changes to
> functionality while making the conversions.  However, it it is not really
> necessary then it may cause confusion later.

Agreed.

Thanks.


Re: [PATCH 03/11] mm/hugetlb: introduce alloc_control structure to simplify migration target allocation APIs

2020-05-21 Thread Joonsoo Kim
2020년 5월 22일 (금) 오전 3:57, Mike Kravetz 님이 작성:
>
> On 5/17/20 6:20 PM, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > Currently, page allocation functions for migration requires some arguments.
> > More worse, in the following patch, more argument will be needed to unify
> > the similar functions. To simplify them, in this patch, unified data
> > structure that controls allocation behaviour is introduced.
>
> As a followup to Roman's question and your answer about adding a suffix/prefix
> to the new structure.  It 'may' be a bit confusing as alloc_context is already
> defined and *ac is passsed around for page allocations.  Perhaps, this new
> structure could somehow have migrate in the name as it is all about allocating
> migrate targets?

I have considered that but I cannot find appropriate prefix. In hugetlb code,
struct alloc_control is passed to the internal function which is not
fully dedicated
to the migration so 'migrate' would not be appropriate prefix.

alloc_context is used by page allocation core and alloc_control would be used by
outside of it so I think that we can endure it. If there is a good
suggestion, I will change
the name happily.

> >
> > For clean-up, function declarations are re-ordered.
> >
> > Note that, gfp_mask handling on alloc_huge_page_(node|nodemask) is
> > slightly changed, from ASSIGN to OR. It's safe since caller of these
> > functions doesn't pass extra gfp_mask except htlb_alloc_mask().
> >
> > Signed-off-by: Joonsoo Kim 
>
> Patch makes sense.

Thanks!

> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index a298a8c..94d2386 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1526,10 +1526,15 @@ struct page *new_page_nodemask(struct page *page,
> >   unsigned int order = 0;
> >   struct page *new_page = NULL;
> >
> > - if (PageHuge(page))
> > - return alloc_huge_page_nodemask(
> > - page_hstate(compound_head(page)),
> > - preferred_nid, nodemask);
> > + if (PageHuge(page)) {
> > + struct hstate *h = page_hstate(page);
>
> I assume the removal of compound_head(page) was intentional?  Just asking
> because PageHuge will look at head page while page_hstate will not.  So,
> if passed a non-head page things could go bad.

I was thinking that page_hstate() can handle the tail page but it seems that
it's not. Thanks for correction. I will change it on next version.

Thanks.


Re: [PATCH 04/11] mm/hugetlb: unify hugetlb migration callback function

2020-05-20 Thread Joonsoo Kim
2020년 5월 21일 (목) 오전 9:46, Roman Gushchin 님이 작성:
>
> On Mon, May 18, 2020 at 10:20:50AM +0900, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > There is no difference between two migration callback functions,
> > alloc_huge_page_node() and alloc_huge_page_nodemask(), except
> > __GFP_THISNODE handling. This patch adds one more field on to
> > the alloc_control and handles this exception.
> >
> > Signed-off-by: Joonsoo Kim 
> > ---
> >  include/linux/hugetlb.h |  8 
> >  mm/hugetlb.c| 23 ++-
> >  mm/internal.h   |  1 +
> >  mm/mempolicy.c  |  3 ++-
> >  4 files changed, 5 insertions(+), 30 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 6da217e..4892ed3 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -505,8 +505,6 @@ struct huge_bootmem_page {
> >
> >  struct page *alloc_migrate_huge_page(struct hstate *h,
> >   struct alloc_control *ac);
> > -struct page *alloc_huge_page_node(struct hstate *h,
> > - struct alloc_control *ac);
> >  struct page *alloc_huge_page_nodemask(struct hstate *h,
> >   struct alloc_control *ac);
>
> Maybe drop _nodemask suffix in the remaining function?

We cannot remove the suffix since there is already the same named function,
alloc_huge_page().

> >  struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct 
> > *vma,
> > @@ -755,12 +753,6 @@ static inline void huge_ptep_modify_prot_commit(struct 
> > vm_area_struct *vma,
> >  struct hstate {};
> >
> >  static inline struct page *
> > -alloc_huge_page_node(struct hstate *h, struct alloc_control *ac)
> > -{
> > - return NULL;
> > -}
> > -
> > -static inline struct page *
> >  alloc_huge_page_nodemask(struct hstate *h, struct alloc_control *ac)
> >  {
> >   return NULL;
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 859dba4..60b0983 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1981,31 +1981,12 @@ struct page *alloc_buddy_huge_page_with_mpol(struct 
> > hstate *h,
> >  }
> >
> >  /* page migration callback function */
> > -struct page *alloc_huge_page_node(struct hstate *h,
> > - struct alloc_control *ac)
> > -{
> > - struct page *page = NULL;
> > -
> > - ac->gfp_mask |= htlb_alloc_mask(h);
> > - if (ac->nid != NUMA_NO_NODE)
> > - ac->gfp_mask |= __GFP_THISNODE;
> > -
> > - spin_lock(_lock);
> > - if (h->free_huge_pages - h->resv_huge_pages > 0)
> > - page = dequeue_huge_page_nodemask(h, ac);
> > - spin_unlock(_lock);
> > -
> > - if (!page)
> > - page = alloc_migrate_huge_page(h, ac);
> > -
> > - return page;
> > -}
> > -
> > -/* page migration callback function */
> >  struct page *alloc_huge_page_nodemask(struct hstate *h,
> >   struct alloc_control *ac)
> >  {
> >   ac->gfp_mask |= htlb_alloc_mask(h);
> > + if (ac->thisnode && ac->nid != NUMA_NO_NODE)
> > + ac->gfp_mask |= __GFP_THISNODE;
> >
> >   spin_lock(_lock);
> >   if (h->free_huge_pages - h->resv_huge_pages > 0) {
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 75b3f8e..574722d0 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -618,6 +618,7 @@ struct alloc_control {
> >   int nid;
> >   nodemask_t *nmask;
> >   gfp_t gfp_mask;
> > + bool thisnode;
> >  };
>
> Can you, please, add some comments what each field exactly mean?

Will do.

Thanks.


Re: [PATCH 03/11] mm/hugetlb: introduce alloc_control structure to simplify migration target allocation APIs

2020-05-20 Thread Joonsoo Kim
2020년 5월 21일 (목) 오전 9:43, Roman Gushchin 님이 작성:
>
> On Mon, May 18, 2020 at 10:20:49AM +0900, js1...@gmail.com wrote:
> > From: Joonsoo Kim 
> >
> > Currently, page allocation functions for migration requires some arguments.
> > More worse, in the following patch, more argument will be needed to unify
> > the similar functions. To simplify them, in this patch, unified data
> > structure that controls allocation behaviour is introduced.
>
> Is it all about huge pages only? If so, maybe adding huge_page suffix/prefix
> to struct alloc_control? E.g. huge_alloc_control or something like this.

No, it will be used for other migration target allocation functions. You can
see it on following patches.

> >
> > For clean-up, function declarations are re-ordered.
> >
> > Note that, gfp_mask handling on alloc_huge_page_(node|nodemask) is
> > slightly changed, from ASSIGN to OR. It's safe since caller of these
> > functions doesn't pass extra gfp_mask except htlb_alloc_mask().
>
> Changes make sense to me, but it feels like this patch is doing several
> thing simultaneously. Do you mind splitting it into few?

I will try to split it on next spin.

Thanks.


  1   2   3   4   5   6   7   8   9   10   >