Re: [PATCH 2/9] mm/page_alloc: Add a bulk page allocator

2021-04-12 Thread Mel Gorman
On Mon, Apr 12, 2021 at 11:59:38AM +0100, Mel Gorman wrote:
> > I don't understand this comment. Only alloc_flags_nofragment() sets this 
> > flag
> > and we don't use it here?
> > 
> 
> It's there as a reminder that there are non-obvious consequences
> to ALLOC_NOFRAGMENT that may affect the bulk allocation success
> rate. __rmqueue_fallback will only select pageblock_order pages and if that
> fails, we fall into the slow path that allocates a single page. I didn't
> deal with it because it was not obvious that it's even relevant but I bet
> in 6 months time, I'll forget that ALLOC_NOFRAGMENT may affect success
> rates without the comment. I'm waiting for a bug that can trivially trigger
> a case with a meaningful workload where the success rate is poor enough to
> affect latency before adding complexity. Ideally by then, the allocation
> paths would be unified a bit better.
> 

So this needs better clarification. ALLOC_NOFRAGMENT is not a
problem at the moment but at one point during development, it was a
non-obvious potential problem. If the paths are unified, ALLOC_NOFRAGMENT
*potentially* becomes a problem depending on how it's done and it needs
careful consideration. For example, it could be part unified by moving
the alloc_flags_nofragment() call into prepare_alloc_pages because in
__alloc_pages, it always happens and it looks like an obvious partial
unification. Hence the comment "May set ALLOC_NOFRAGMENT" because I wanted
a reminder in case I "fixed" this in 6 months time and forgot the downside.

-- 
Mel Gorman
SUSE Labs


[PATCH] mm/page_alloc: Add a bulk page allocator -fix -fix -fix

2021-04-12 Thread Mel Gorman
Vlastimil Babka noted that a comment is wrong, fix it. This is the third
fix to the mmotm patch mm-page_alloc-add-a-bulk-page-allocator.patch.

Signed-off-by: Mel Gorman 
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1c67c99603a3..c62862071e6a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5067,7 +5067,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int 
preferred_nid,
return 0;
gfp = alloc_gfp;
 
-   /* Find an allowed local zone that meets the high watermark. */
+   /* Find an allowed local zone that meets the low watermark. */
for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
ac.highest_zoneidx, ac.nodemask) {
unsigned long mark;
 


Re: [PATCH 2/9] mm/page_alloc: Add a bulk page allocator

2021-04-12 Thread Mel Gorman
On Mon, Apr 12, 2021 at 12:21:42PM +0200, Vlastimil Babka wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8a3e13277e22..eb547470a7e4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4965,6 +4965,124 @@ static inline bool prepare_alloc_pages(gfp_t 
> > gfp_mask, unsigned int order,
> > return true;
> >  }
> >  
> > +/*
> > + * __alloc_pages_bulk - Allocate a number of order-0 pages to a list
> > + * @gfp: GFP flags for the allocation
> > + * @preferred_nid: The preferred NUMA node ID to allocate from
> > + * @nodemask: Set of nodes to allocate from, may be NULL
> > + * @nr_pages: The number of pages desired on the list
> > + * @page_list: List to store the allocated pages
> > + *
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly and add them to a list.
> > + *
> > + * Returns the number of pages on the list.
> > + */
> > +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> > +   nodemask_t *nodemask, int nr_pages,
> > +   struct list_head *page_list)
> > +{
> > +   struct page *page;
> > +   unsigned long flags;
> > +   struct zone *zone;
> > +   struct zoneref *z;
> > +   struct per_cpu_pages *pcp;
> > +   struct list_head *pcp_list;
> > +   struct alloc_context ac;
> > +   gfp_t alloc_gfp;
> > +   unsigned int alloc_flags;
> 
> Was going to complain that this is not set to ALLOC_WMARK_LOW. Must be faster
> next time...
> 

Good that you caught it anyway!

> > +   int allocated = 0;
> > +
> > +   if (WARN_ON_ONCE(nr_pages <= 0))
> > +   return 0;
> > +
> > +   /* Use the single page allocator for one page. */
> > +   if (nr_pages == 1)
> > +   goto failed;
> > +
> > +   /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> 
> I don't understand this comment. Only alloc_flags_nofragment() sets this flag
> and we don't use it here?
> 

It's there as a reminder that there are non-obvious consequences
to ALLOC_NOFRAGMENT that may affect the bulk allocation success
rate. __rmqueue_fallback will only select pageblock_order pages and if that
fails, we fall into the slow path that allocates a single page. I didn't
deal with it because it was not obvious that it's even relevant but I bet
in 6 months time, I'll forget that ALLOC_NOFRAGMENT may affect success
rates without the comment. I'm waiting for a bug that can trivially trigger
a case with a meaningful workload where the success rate is poor enough to
affect latency before adding complexity. Ideally by then, the allocation
paths would be unified a bit better.

> > +   gfp &= gfp_allowed_mask;
> > +   alloc_gfp = gfp;
> > +   if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, , 
> > _gfp, _flags))
> > +   return 0;
> > +   gfp = alloc_gfp;
> > +
> > +   /* Find an allowed local zone that meets the high watermark. */
> 
> Should it say "low watermark"?
> 

Yeah, that's leftover from an earlier prototype :(

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/9] mm/page_alloc: Add a bulk page allocator

2021-04-12 Thread Vlastimil Babka
On 3/25/21 12:42 PM, Mel Gorman wrote:
> This patch adds a new page allocator interface via alloc_pages_bulk,
> and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> to be allocated and added to a list.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch
> if necessary.
> 
> Note that this implementation is not very efficient and could be improved
> but it would require refactoring. The intent is to make it available early
> to determine what semantics are required by different callers. Once the
> full semantics are nailed down, it can be refactored.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: Vlastimil Babka 
> ---
>  include/linux/gfp.h |  11 +
>  mm/page_alloc.c | 118 
>  2 files changed, 129 insertions(+)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 0a88f84b08f4..4a304fd39916 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -518,6 +518,17 @@ static inline int arch_make_page_accessible(struct page 
> *page)
>  struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
>   nodemask_t *nodemask);
>  
> +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *list);
> +
> +/* Bulk allocate order-0 pages */
> +static inline unsigned long
> +alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
> +{
> + return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
> +}
> +
>  /*
>   * Allocate pages, preferring the node given as nid. The node must be valid 
> and
>   * online. For more general interface, see alloc_pages_node().
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8a3e13277e22..eb547470a7e4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4965,6 +4965,124 @@ static inline bool prepare_alloc_pages(gfp_t 
> gfp_mask, unsigned int order,
>   return true;
>  }
>  
> +/*
> + * __alloc_pages_bulk - Allocate a number of order-0 pages to a list
> + * @gfp: GFP flags for the allocation
> + * @preferred_nid: The preferred NUMA node ID to allocate from
> + * @nodemask: Set of nodes to allocate from, may be NULL
> + * @nr_pages: The number of pages desired on the list
> + * @page_list: List to store the allocated pages
> + *
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly and add them to a list.
> + *
> + * Returns the number of pages on the list.
> + */
> +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *page_list)
> +{
> + struct page *page;
> + unsigned long flags;
> + struct zone *zone;
> + struct zoneref *z;
> + struct per_cpu_pages *pcp;
> + struct list_head *pcp_list;
> + struct alloc_context ac;
> + gfp_t alloc_gfp;
> + unsigned int alloc_flags;

Was going to complain that this is not set to ALLOC_WMARK_LOW. Must be faster
next time...

> + int allocated = 0;
> +
> + if (WARN_ON_ONCE(nr_pages <= 0))
> + return 0;
> +
> + /* Use the single page allocator for one page. */
> + if (nr_pages == 1)
> + goto failed;
> +
> + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */

I don't understand this comment. Only alloc_flags_nofragment() sets this flag
and we don't use it here?

> + gfp &= gfp_allowed_mask;
> + alloc_gfp = gfp;
> + if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, , 
> _gfp, _flags))
> + return 0;
> + gfp = alloc_gfp;
> +
> + /* Find an allowed local zone that meets the high watermark. */

Should it say "low watermark"?

Vlastimil


Re: [PATCH] mm/page_alloc: Add a bulk page allocator -fix -fix

2021-03-30 Thread Colin Ian King
On 30/03/2021 12:48, Mel Gorman wrote:
> Colin Ian King reported the following problem (slightly edited)
> 
>   Author: Mel Gorman 
>   Date:   Mon Mar 29 11:12:24 2021 +1100
> 
>       mm/page_alloc: add a bulk page allocator
> 
>   ...
> 
>   Static analysis on linux-next with Coverity has found a potential
>   uninitialized variable issue in function __alloc_pages_bulk with
>   the following commit:
> 
>   ...
> 
>   Uninitialized scalar variable (UNINIT)
>   15. uninit_use_in_call: Using uninitialized value alloc_flags when
>   calling prepare_alloc_pages.
> 
>   5056if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask,
>   , _gfp, _flags))
> 
> The problem is that prepare_alloc_flags only updates alloc_flags
> which must have a valid initial value. The appropriate initial value is
> ALLOC_WMARK_LOW to avoid the bulk allocator pushing a zone below the low
> watermark without waking kswapd assuming the GFP mask allows kswapd to
> be woken.
> 
> This is a second fix to the mmotm patch
> mm-page_alloc-add-a-bulk-page-allocator.patch . It will cause a mild conflict
> with a later patch due to renaming of an adjacent variable that is trivially
> resolved. I can post a full series with the fixes merged if that is preferred.
> 
> Signed-off-by: Mel Gorman 
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 92d55f80c289..dabef0b910c9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4990,7 +4990,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int 
> preferred_nid,
>   struct list_head *pcp_list;
>   struct alloc_context ac;
>   gfp_t alloc_gfp;
> - unsigned int alloc_flags;
> + unsigned int alloc_flags = ALLOC_WMARK_LOW;
>   int allocated = 0;
>  
>   if (WARN_ON_ONCE(nr_pages <= 0))
> 

Thanks Mel, that definitely fixes the issue.

Reviewed-by: Colin Ian King 


[PATCH] mm/page_alloc: Add a bulk page allocator -fix -fix

2021-03-30 Thread Mel Gorman
Colin Ian King reported the following problem (slightly edited)

Author: Mel Gorman 
Date:   Mon Mar 29 11:12:24 2021 +1100

mm/page_alloc: add a bulk page allocator

...

Static analysis on linux-next with Coverity has found a potential
uninitialized variable issue in function __alloc_pages_bulk with
the following commit:

...

Uninitialized scalar variable (UNINIT)
15. uninit_use_in_call: Using uninitialized value alloc_flags when
calling prepare_alloc_pages.

5056if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask,
, _gfp, _flags))

The problem is that prepare_alloc_flags only updates alloc_flags
which must have a valid initial value. The appropriate initial value is
ALLOC_WMARK_LOW to avoid the bulk allocator pushing a zone below the low
watermark without waking kswapd assuming the GFP mask allows kswapd to
be woken.

This is a second fix to the mmotm patch
mm-page_alloc-add-a-bulk-page-allocator.patch . It will cause a mild conflict
with a later patch due to renaming of an adjacent variable that is trivially
resolved. I can post a full series with the fixes merged if that is preferred.

Signed-off-by: Mel Gorman 
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 92d55f80c289..dabef0b910c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4990,7 +4990,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int 
preferred_nid,
struct list_head *pcp_list;
struct alloc_context ac;
gfp_t alloc_gfp;
-   unsigned int alloc_flags;
+   unsigned int alloc_flags = ALLOC_WMARK_LOW;
int allocated = 0;
 
if (WARN_ON_ONCE(nr_pages <= 0))


Re: mm/page_alloc: add a bulk page allocator

2021-03-30 Thread Mel Gorman
On Mon, Mar 29, 2021 at 04:18:09PM +0100, Colin Ian King wrote:
> Hi,
> 
> Static analysis on linux-next with Coverity has found a potential
> uninitialized variable issue in function __alloc_pages_bulk with the
> following commit:
> 
> commit b0e0a469733fa571ddd8fe147247c9561b51b2da
> Author: Mel Gorman 
> Date:   Mon Mar 29 11:12:24 2021 +1100
> 
>     mm/page_alloc: add a bulk page allocator
> 
> The analysis is as follows:
> 
> > 
>
> 5050if (nr_pages - nr_populated == 1)
> 5051goto failed;
> 5052
> 5053/* May set ALLOC_NOFRAGMENT, fragmentation will return 1
> page. */
> 5054gfp &= gfp_allowed_mask;
> 5055alloc_gfp = gfp;
> 
> Uninitialized scalar variable (UNINIT)
> 15. uninit_use_in_call: Using uninitialized value alloc_flags when
> calling prepare_alloc_pages.
> 
> 5056if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask,
> , _gfp, _flags))

Ok, so Coverity thinks that alloc_flags is potentially uninitialised and
without digging into every part of the report, Coverity is right.

> 
>
> So alloc_flags in gfp_to_alloc_flags_cma is being updated with the |=
> operator and we managed to get to this path with uninitialized
> alloc_flags.  Should alloc_flags be initialized to zero in
> __alloc_page_bulk()?
> 

You are correct about the |= updating an initial value, but I think the
initialized value should be ALLOC_WMARK_LOW. A value of 0 would be the same
as ALLOC_WMARK_MIN and that would allow the bulk allocator to potentially
consume too many pages without waking kswapd.  I'll put together a patch
shortly. Thanks Colin!

-- 
Mel Gorman
SUSE Labs


re: mm/page_alloc: add a bulk page allocator

2021-03-29 Thread Colin Ian King
Hi,

Static analysis on linux-next with Coverity has found a potential
uninitialized variable issue in function __alloc_pages_bulk with the
following commit:

commit b0e0a469733fa571ddd8fe147247c9561b51b2da
Author: Mel Gorman 
Date:   Mon Mar 29 11:12:24 2021 +1100

mm/page_alloc: add a bulk page allocator

The analysis is as follows:

5023 unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
5024nodemask_t *nodemask, int nr_pages,
5025struct list_head *page_list,
5026struct page **page_array)
5027 {
5028struct page *page;
5029unsigned long flags;
5030struct zone *zone;
5031struct zoneref *z;
5032struct per_cpu_pages *pcp;
5033struct list_head *pcp_list;
5034struct alloc_context ac;
5035gfp_t alloc_gfp;
1. var_decl: Declaring variable alloc_flags without initializer.
5036unsigned int alloc_flags;
5037int nr_populated = 0;
5038
2. Condition !!(nr_pages <= 0), taking false branch.
5039if (unlikely(nr_pages <= 0))
5040return 0;
5041
5042/*
5043 * Skip populated array elements to determine if any pages need
5044 * to be allocated before disabling IRQs.
5045 */
3. Condition page_array, taking true branch.
4. Condition page_array[nr_populated], taking true branch.
5. Condition nr_populated < nr_pages, taking true branch.
7. Condition page_array, taking true branch.
8. Condition page_array[nr_populated], taking true branch.
9. Condition nr_populated < nr_pages, taking true branch.
11. Condition page_array, taking true branch.
12. Condition page_array[nr_populated], taking true branch.
13. Condition nr_populated < nr_pages, taking false branch.
5046while (page_array && page_array[nr_populated] &&
nr_populated < nr_pages)
6. Jumping back to the beginning of the loop.
10. Jumping back to the beginning of the loop.
5047nr_populated++;
5048
5049/* Use the single page allocator for one page. */
14. Condition nr_pages - nr_populated == 1, taking false branch.
5050if (nr_pages - nr_populated == 1)
5051goto failed;
5052
5053/* May set ALLOC_NOFRAGMENT, fragmentation will return 1
page. */
5054gfp &= gfp_allowed_mask;
5055alloc_gfp = gfp;

Uninitialized scalar variable (UNINIT)
15. uninit_use_in_call: Using uninitialized value alloc_flags when
calling prepare_alloc_pages.

5056if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask,
, _gfp, _flags))
5057return 0;

And in prepare_alloc_pages():

4957 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int
order,
4958int preferred_nid, nodemask_t *nodemask,
4959struct alloc_context *ac, gfp_t *alloc_gfp,
4960unsigned int *alloc_flags)
4961 {
4962ac->highest_zoneidx = gfp_zone(gfp_mask);
4963ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
4964ac->nodemask = nodemask;
4965ac->migratetype = gfp_migratetype(gfp_mask);
4966

1. Condition cpusets_enabled(), taking false branch.

4967if (cpusets_enabled()) {
4968*alloc_gfp |= __GFP_HARDWALL;
4969/*
4970 * When we are in the interrupt context, it is
irrelevant
4971 * to the current task context. It means that any
node ok.
4972 */
4973if (!in_interrupt() && !ac->nodemask)
4974ac->nodemask = _current_mems_allowed;
4975else
4976*alloc_flags |= ALLOC_CPUSET;
4977}
4978
4979fs_reclaim_acquire(gfp_mask);
4980fs_reclaim_release(gfp_mask);
4981
2. Condition gfp_mask & 1024U /* (gfp_t)1024U */, taking true branch.
4982might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
4983
3. Condition should_fail_alloc_page(gfp_mask, order), taking false
branch.
4984if (should_fail_alloc_page(gfp_mask, order))
4985return false;
4986
4. read_value: Reading value *alloc_flags when calling
gfp_to_alloc_flags_cma.
4987*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);

And in call gfp_to_alloc_flags_cma():

in /mm/page_alloc.c

3853 static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
3854  unsigned int
alloc_flags)
3855 {
3856#ifdef CONFIG_CMA
1. Condition gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE, taking
true branch.
3857if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
2. read_value: Reading value alloc_flags.
3858alloc_flags |= ALLOC_CMA;
3859#endif
3860return alloc_flags;
3861 }

So alloc_flags in gfp_to_alloc_flags_cma is being updated wi

Re: [PATCH 2/9] mm/page_alloc: Add a bulk page allocator

2021-03-25 Thread Mel Gorman
On Thu, Mar 25, 2021 at 12:05:25PM +, Matthew Wilcox wrote:
> On Thu, Mar 25, 2021 at 11:42:21AM +, Mel Gorman wrote:
> > +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> > +   nodemask_t *nodemask, int nr_pages,
> > +   struct list_head *list);
> > +
> > +/* Bulk allocate order-0 pages */
> > +static inline unsigned long
> > +alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
> > +{
> > +   return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
> 
> Discrepancy in the two return types here.  Suspect they should both
> be 'unsigned int' so there's no question about "can it return an errno".
> 

I'll make it unsigned long as the nr_pages parameter is unsigned long.
It's a silly range to have for pages but it matches alloc_contig_range
even though free_contig_range takes unsigned int *sigh*

> >  
> > +/*
> 
> If you could make that "/**" instead ...
> 

I decided not to until we're reasonably sure the semantics are not going
to change.

---8<---
mm/page_alloc: Add a bulk page allocator -fix

Matthew Wilcox pointed out that the return type for alloc_pages_bulk()
and __alloc_pages_bulk() is inconsistent. Fix it.

Signed-off-by: Mel Gorman 
---
 include/linux/gfp.h | 2 +-
 mm/page_alloc.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4a304fd39916..a2be8f4174a9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -518,7 +518,7 @@ static inline int arch_make_page_accessible(struct page 
*page)
 struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
 
-int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *list);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eb547470a7e4..92d55f80c289 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4978,7 +4978,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
  *
  * Returns the number of pages on the list.
  */
-int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
struct list_head *page_list)
 {


Re: [PATCH 2/9] mm/page_alloc: Add a bulk page allocator

2021-03-25 Thread Matthew Wilcox
On Thu, Mar 25, 2021 at 11:42:21AM +, Mel Gorman wrote:
> +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *list);
> +
> +/* Bulk allocate order-0 pages */
> +static inline unsigned long
> +alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
> +{
> + return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);

Discrepancy in the two return types here.  Suspect they should both
be 'unsigned int' so there's no question about "can it return an errno".

>  
> +/*

If you could make that "/**" instead ...



[PATCH 2/9] mm/page_alloc: Add a bulk page allocator

2021-03-25 Thread Mel Gorman
This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list.

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.

Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
---
 include/linux/gfp.h |  11 +
 mm/page_alloc.c | 118 
 2 files changed, 129 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0a88f84b08f4..4a304fd39916 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -518,6 +518,17 @@ static inline int arch_make_page_accessible(struct page 
*page)
 struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
 
+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *list);
+
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
+{
+   return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8a3e13277e22..eb547470a7e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4965,6 +4965,124 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
return true;
 }
 
+/*
+ * __alloc_pages_bulk - Allocate a number of order-0 pages to a list
+ * @gfp: GFP flags for the allocation
+ * @preferred_nid: The preferred NUMA node ID to allocate from
+ * @nodemask: Set of nodes to allocate from, may be NULL
+ * @nr_pages: The number of pages desired on the list
+ * @page_list: List to store the allocated pages
+ *
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly and add them to a list.
+ *
+ * Returns the number of pages on the list.
+ */
+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *page_list)
+{
+   struct page *page;
+   unsigned long flags;
+   struct zone *zone;
+   struct zoneref *z;
+   struct per_cpu_pages *pcp;
+   struct list_head *pcp_list;
+   struct alloc_context ac;
+   gfp_t alloc_gfp;
+   unsigned int alloc_flags;
+   int allocated = 0;
+
+   if (WARN_ON_ONCE(nr_pages <= 0))
+   return 0;
+
+   /* Use the single page allocator for one page. */
+   if (nr_pages == 1)
+   goto failed;
+
+   /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+   gfp &= gfp_allowed_mask;
+   alloc_gfp = gfp;
+   if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, , 
_gfp, _flags))
+   return 0;
+   gfp = alloc_gfp;
+
+   /* Find an allowed local zone that meets the high watermark. */
+   for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
ac.highest_zoneidx, ac.nodemask) {
+   unsigned long mark;
+
+   if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
+   !__cpuset_zone_allowed(zone, gfp)) {
+   continue;
+   }
+
+   if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
+   zone_to_nid(zone) != 
zone_to_nid(ac.preferred_zoneref->zone)) {
+   goto failed;
+   }
+
+   mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + 
nr_pages;
+   if (zone_watermark_fast(zone, 0,  mark,
+   zonelist_zone_idx(ac.preferred_zoneref),
+   alloc_flags, gfp)) {
+   break;
+   }
+   }
+
+   /*
+* If there are no allowed local zones that meets the watermarks then
+* try to allocate a single page and reclaim if necessary.
+*/
+   if (!zone)
+   goto failed;
+
+   /* Attempt the batch allocation */
+   local_irq_save(flags);
+   pcp = _cpu_ptr(zone->pageset)->pcp;
+   pcp_list = >lists[ac.migratetype];
+
+   while (allocated < nr_pages) {
+   page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,

Re: [PATCH 2/3] mm/page_alloc: Add a bulk page allocator

2021-03-23 Thread Mel Gorman
On Tue, Mar 23, 2021 at 05:00:08PM +0100, Jesper Dangaard Brouer wrote:
> > +   /*
> > +* If there are no allowed local zones that meets the watermarks then
> > +* try to allocate a single page and reclaim if necessary.
> > +*/
> > +   if (!zone)
> > +   goto failed;
> > +
> > +   /* Attempt the batch allocation */
> > +   local_irq_save(flags);
> > +   pcp = _cpu_ptr(zone->pageset)->pcp;
> > +   pcp_list = >lists[ac.migratetype];
> > +
> > +   while (allocated < nr_pages) {
> > +   page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> > +   pcp, pcp_list);
> 
> The function __rmqueue_pcplist() is now used two places, this cause the
> compiler to uninline the static function.
> 

This was expected. It was not something I was particularly happy with
but avoiding it was problematic without major refactoring.

> My tests show you should inline __rmqueue_pcplist().  See patch I'm
> using below signature, which also have some benchmark notes. (Please
> squash it into your patch and drop these notes).
> 

The cycle savings per element is very marginal at just 4 cycles. I
expect just the silly stat updates are way more costly but the series
that addresses that is likely to be controversial. As I know the cycle
budget for processing a packet is tight, I've applied the patch but am
keeping it separate to preserve the data in case someone points out that
is a big function to inline and "fixes" it.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/3] mm/page_alloc: Add a bulk page allocator

2021-03-23 Thread Jesper Dangaard Brouer
On Mon, 22 Mar 2021 09:18:44 +
Mel Gorman  wrote:

> This patch adds a new page allocator interface via alloc_pages_bulk,
> and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> to be allocated and added to a list.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch
> if necessary.
> 
> Note that this implementation is not very efficient and could be improved
> but it would require refactoring. The intent is to make it available early
> to determine what semantics are required by different callers. Once the
> full semantics are nailed down, it can be refactored.
> 
> Signed-off-by: Mel Gorman 
> Acked-by: Vlastimil Babka 
> ---
>  include/linux/gfp.h |  11 
>  mm/page_alloc.c | 124 
>  2 files changed, 135 insertions(+)
[...]

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8a3e13277e22..3f4d56854c74 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4965,6 +4965,130 @@ static inline bool prepare_alloc_pages(gfp_t 
> gfp_mask, unsigned int order,
>   return true;
>  }
>  
> +/*
> + * __alloc_pages_bulk - Allocate a number of order-0 pages to a list
> + * @gfp: GFP flags for the allocation
> + * @preferred_nid: The preferred NUMA node ID to allocate from
> + * @nodemask: Set of nodes to allocate from, may be NULL
> + * @nr_pages: The number of pages requested
> + * @page_list: List to store the allocated pages, must be empty
> + *
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly and add them to a list. The list must be
> + * empty to allow new pages to be prepped with IRQs enabled.
> + *
> + * Returns the number of pages allocated.
> + */
> +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *page_list)
> +{
[...]
> + /*
> +  * If there are no allowed local zones that meets the watermarks then
> +  * try to allocate a single page and reclaim if necessary.
> +  */
> + if (!zone)
> + goto failed;
> +
> + /* Attempt the batch allocation */
> + local_irq_save(flags);
> + pcp = _cpu_ptr(zone->pageset)->pcp;
> + pcp_list = >lists[ac.migratetype];
> +
> + while (allocated < nr_pages) {
> + page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> + pcp, pcp_list);

The function __rmqueue_pcplist() is now used two places, this cause the
compiler to uninline the static function.

My tests show you should inline __rmqueue_pcplist().  See patch I'm
using below signature, which also have some benchmark notes. (Please
squash it into your patch and drop these notes).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[PATCH] mm: inline __rmqueue_pcplist

From: Jesper Dangaard Brouer 

When __alloc_pages_bulk() got introduced two callers of
__rmqueue_pcplist exist and the compiler chooses to not inline
this function.

 ./scripts/bloat-o-meter vmlinux-before vmlinux-inline__rmqueue_pcplist
add/remove: 0/1 grow/shrink: 2/0 up/down: 164/-125 (39)
Function old new   delta
rmqueue 21972296 +99
__alloc_pages_bulk  19211986 +65
__rmqueue_pcplist125   --125
Total: Before=19374127, After=19374166, chg +0.00%

modprobe page_bench04_bulk loops=$((10**7))

Type:time_bulk_page_alloc_free_array
 -  Per elem: 106 cycles(tsc) 29.595 ns (step:64)
 - (measurement period time:0.295955434 sec time_interval:295955434)
 - (invoke count:1000 tsc_interval:1065447105)

Before:
 - Per elem: 110 cycles(tsc) 30.633 ns (step:64)

Signed-off-by: Jesper Dangaard Brouer 
---
 mm/page_alloc.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2cbb8da811ab..f60f51a97a7b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3415,7 +3415,8 @@ static inline void zone_statistics(struct zone 
*preferred_zone, struct zone *z)
 }
 
 /* Remove page from the per-cpu list, caller must protect the list */
-static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
+static inline
+struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
unsigned int alloc_flags,
struct per_cpu_pages *pcp,
struct list_head *list)



[PATCH 2/3] mm/page_alloc: Add a bulk page allocator

2021-03-22 Thread Mel Gorman
This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list.

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.

Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
---
 include/linux/gfp.h |  11 
 mm/page_alloc.c | 124 
 2 files changed, 135 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0a88f84b08f4..4a304fd39916 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -518,6 +518,17 @@ static inline int arch_make_page_accessible(struct page 
*page)
 struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
 
+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *list);
+
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
+{
+   return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8a3e13277e22..3f4d56854c74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4965,6 +4965,130 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
return true;
 }
 
+/*
+ * __alloc_pages_bulk - Allocate a number of order-0 pages to a list
+ * @gfp: GFP flags for the allocation
+ * @preferred_nid: The preferred NUMA node ID to allocate from
+ * @nodemask: Set of nodes to allocate from, may be NULL
+ * @nr_pages: The number of pages requested
+ * @page_list: List to store the allocated pages, must be empty
+ *
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly and add them to a list. The list must be
+ * empty to allow new pages to be prepped with IRQs enabled.
+ *
+ * Returns the number of pages allocated.
+ */
+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *page_list)
+{
+   struct page *page;
+   unsigned long flags;
+   struct zone *zone;
+   struct zoneref *z;
+   struct per_cpu_pages *pcp;
+   struct list_head *pcp_list;
+   struct alloc_context ac;
+   gfp_t alloc_gfp;
+   unsigned int alloc_flags;
+   int allocated = 0;
+
+   if (WARN_ON_ONCE(nr_pages <= 0))
+   return 0;
+
+   if (WARN_ON_ONCE(!list_empty(page_list)))
+   return 0;
+
+   if (nr_pages == 1)
+   goto failed;
+
+   /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+   gfp &= gfp_allowed_mask;
+   alloc_gfp = gfp;
+   if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, , 
_gfp, _flags))
+   return 0;
+   gfp = alloc_gfp;
+
+   /* Find an allowed local zone that meets the high watermark. */
+   for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
ac.highest_zoneidx, ac.nodemask) {
+   unsigned long mark;
+
+   if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
+   !__cpuset_zone_allowed(zone, gfp)) {
+   continue;
+   }
+
+   if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
+   zone_to_nid(zone) != 
zone_to_nid(ac.preferred_zoneref->zone)) {
+   goto failed;
+   }
+
+   mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + 
nr_pages;
+   if (zone_watermark_fast(zone, 0,  mark,
+   zonelist_zone_idx(ac.preferred_zoneref),
+   alloc_flags, gfp)) {
+   break;
+   }
+   }
+
+   /*
+* If there are no allowed local zones that meets the watermarks then
+* try to allocate a single page and reclaim if necessary.
+*/
+   if (!zone)
+   goto failed;
+
+   /* Attempt the batch allocation */
+   local_irq_save(flags);
+   pcp = _cpu_ptr(zone->pageset)->pcp;
+   pcp_list = >lists[ac.migratetype];
+
+   while 

Re: [PATCH 3/7] mm/page_alloc: Add a bulk page allocator

2021-03-22 Thread Mel Gorman
On Fri, Mar 19, 2021 at 07:18:32PM +0100, Vlastimil Babka wrote:
> On 3/12/21 4:43 PM, Mel Gorman wrote:
> > This patch adds a new page allocator interface via alloc_pages_bulk,
> > and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> > to be allocated and added to a list. They can be freed in bulk using
> > free_pages_bulk().
> > 
> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory, the
> > cpuset changes during the allocation or page debugging decides to fail
> > an allocation. It's up to the caller to request more pages in batch
> > if necessary.
> > 
> > Note that this implementation is not very efficient and could be improved
> > but it would require refactoring. The intent is to make it available early
> > to determine what semantics are required by different callers. Once the
> > full semantics are nailed down, it can be refactored.
> > 
> > Signed-off-by: Mel Gorman 
> 
> Acked-by: Vlastimil Babka 
> 
> Although maybe premature, if it changes significantly due to the users'
> performance feedback, let's see :)
> 

Indeed. The next version will have no users so that Jesper and Chuck
can check if an array-based or LRU based version is better. There were
also bugs such as broken accounting of stats that had to be fixed which
increases overhead.

> Some nits below:
> 
> ...
> 
> > @@ -4963,6 +4978,107 @@ static inline bool prepare_alloc_pages(gfp_t gfp, 
> > unsigned int order,
> > return true;
> >  }
> >  
> > +/*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to list.
> > + *
> > + * Returns the number of pages allocated.
> > + */
> > +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> > +   nodemask_t *nodemask, int nr_pages,
> > +   struct list_head *alloc_list)
> > +{
> > +   struct page *page;
> > +   unsigned long flags;
> > +   struct zone *zone;
> > +   struct zoneref *z;
> > +   struct per_cpu_pages *pcp;
> > +   struct list_head *pcp_list;
> > +   struct alloc_context ac;
> > +   gfp_t alloc_gfp;
> > +   unsigned int alloc_flags;
> > +   int allocated = 0;
> > +
> > +   if (WARN_ON_ONCE(nr_pages <= 0))
> > +   return 0;
> > +
> > +   if (nr_pages == 1)
> > +   goto failed;
> > +
> > +   /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> > +   if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, ,
> > +   _gfp, _flags))
> 
> Unusual identation here.
> 

Fixed

> > +   return 0;
> > +   gfp = alloc_gfp;
> > +
> > +   /* Find an allowed local zone that meets the high watermark. */
> > +   for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
> > ac.highest_zoneidx, ac.nodemask) {
> > +   unsigned long mark;
> > +
> > +   if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> > +   !__cpuset_zone_allowed(zone, gfp)) {
> > +   continue;
> > +   }
> > +
> > +   if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> > +   zone_to_nid(zone) != 
> > zone_to_nid(ac.preferred_zoneref->zone)) {
> > +   goto failed;
> > +   }
> > +
> > +   mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + 
> > nr_pages;
> > +   if (zone_watermark_fast(zone, 0,  mark,
> > +   zonelist_zone_idx(ac.preferred_zoneref),
> > +   alloc_flags, gfp)) {
> > +   break;
> > +   }
> > +   }
> > +   if (!zone)
> > +   return 0;
> 
> Why not also "goto failed;" here?

Good question. When first written, it was because the zone search for the
normal allocator was almost certainly going to fail to find a zone and
it was expected that callers would prefer to fail fast over blocking.
Now we know that sunrpc can sleep on a failing allocation and it would
be better to enter the single page allocator and reclaim pages instead of
"sleep and hope for the best".

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 3/7] mm/page_alloc: Add a bulk page allocator

2021-03-19 Thread Vlastimil Babka
On 3/12/21 4:43 PM, Mel Gorman wrote:
> This patch adds a new page allocator interface via alloc_pages_bulk,
> and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> to be allocated and added to a list. They can be freed in bulk using
> free_pages_bulk().
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch
> if necessary.
> 
> Note that this implementation is not very efficient and could be improved
> but it would require refactoring. The intent is to make it available early
> to determine what semantics are required by different callers. Once the
> full semantics are nailed down, it can be refactored.
> 
> Signed-off-by: Mel Gorman 

Acked-by: Vlastimil Babka 

Although maybe premature, if it changes significantly due to the users'
performance feedback, let's see :)

Some nits below:

...

> @@ -4963,6 +4978,107 @@ static inline bool prepare_alloc_pages(gfp_t gfp, 
> unsigned int order,
>   return true;
>  }
>  
> +/*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + *
> + * Returns the number of pages allocated.
> + */
> +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *alloc_list)
> +{
> + struct page *page;
> + unsigned long flags;
> + struct zone *zone;
> + struct zoneref *z;
> + struct per_cpu_pages *pcp;
> + struct list_head *pcp_list;
> + struct alloc_context ac;
> + gfp_t alloc_gfp;
> + unsigned int alloc_flags;
> + int allocated = 0;
> +
> + if (WARN_ON_ONCE(nr_pages <= 0))
> + return 0;
> +
> + if (nr_pages == 1)
> + goto failed;
> +
> + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> + if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, ,
> + _gfp, _flags))

Unusual identation here.

> + return 0;
> + gfp = alloc_gfp;
> +
> + /* Find an allowed local zone that meets the high watermark. */
> + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
> ac.highest_zoneidx, ac.nodemask) {
> + unsigned long mark;
> +
> + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> + !__cpuset_zone_allowed(zone, gfp)) {
> + continue;
> + }
> +
> + if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> + zone_to_nid(zone) != 
> zone_to_nid(ac.preferred_zoneref->zone)) {
> + goto failed;
> + }
> +
> + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + 
> nr_pages;
> + if (zone_watermark_fast(zone, 0,  mark,
> + zonelist_zone_idx(ac.preferred_zoneref),
> + alloc_flags, gfp)) {
> + break;
> + }
> + }
> + if (!zone)
> + return 0;

Why not also "goto failed;" here?


Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-19 Thread Jesper Dangaard Brouer
On Sun, 14 Mar 2021 12:52:32 +
Mel Gorman  wrote:

> mm/page_alloc: Add an array-based interface to the bulk page allocator
> 
> The existing callers for the bulk allocator are storing the pages in
> arrays. This patch adds an array-based interface to the API to avoid
> multiple list iterations. The page list interface is preserved to
> avoid requiring all users of the bulk API to allocate and manage
> enough storage to store the pages.

I'm testing this patch, see results below and in commit[1].  The array
variant is clearly faster in these micro-benchmarks.
(Some comment inlined about code)

The change to my page_bench04_bulk is here[1]:
 [1] https://github.com/netoptimizer/prototype-kernel/commit/4c41fe0d4107f514

Notice these "per elem" measurements are alloc+free cost for order-0 pages.

BASELINE
 single_page alloc+put: 207 cycles(tsc) 57.773 ns

LIST variant: time_bulk_page_alloc_free_list: step=bulk size

 Per elem: 294 cycles(tsc) 81.866 ns (step:1)
 Per elem: 214 cycles(tsc) 59.477 ns (step:2)
 Per elem: 199 cycles(tsc) 55.504 ns (step:3)
 Per elem: 192 cycles(tsc) 53.489 ns (step:4)
 Per elem: 188 cycles(tsc) 52.456 ns (step:8)
 Per elem: 184 cycles(tsc) 51.346 ns (step:16)
 Per elem: 183 cycles(tsc) 50.883 ns (step:32)
 Per elem: 184 cycles(tsc) 51.236 ns (step:64)
 Per elem: 189 cycles(tsc) 52.620 ns (step:128)

ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size

 Per elem: 195 cycles(tsc) 54.174 ns (step:1)
 Per elem: 123 cycles(tsc) 34.224 ns (step:2)
 Per elem: 113 cycles(tsc) 31.430 ns (step:3)
 Per elem: 108 cycles(tsc) 30.003 ns (step:4)
 Per elem: 102 cycles(tsc) 28.417 ns (step:8)
 Per elem:  98 cycles(tsc) 27.475 ns (step:16)
 Per elem:  96 cycles(tsc) 26.901 ns (step:32)
 Per elem:  95 cycles(tsc) 26.487 ns (step:64)
 Per elem:  94 cycles(tsc) 26.170 ns (step:128)

The array variant is clearly faster.

It it worth mentioning that bulk=1 result in fallback to normal
single page allocation via __alloc_pages().


> Signed-off-by: Mel Gorman 
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4a304fd39916..fb6234e1fe59 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -520,13 +520,20 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int 
> order, int preferred_nid,
>  
>  int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>   nodemask_t *nodemask, int nr_pages,
> - struct list_head *list);
> + struct list_head *page_list,
> + struct page **page_array);
>  
>  /* Bulk allocate order-0 pages */
>  static inline unsigned long
> -alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
> +alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head 
> *list)
>  {
> - return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
> + return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, 
> NULL);
> +}
> +
> +static inline unsigned long
> +alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page 
> **page_array)
> +{
> + return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, 
> page_array);
>  }
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e0c87c588d3..96590f0726c7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4965,13 +4965,20 @@ static inline bool prepare_alloc_pages(gfp_t gfp, 
> unsigned int order,
>  
>  /*
>   * This is a batched version of the page allocator that attempts to
> - * allocate nr_pages quickly from the preferred zone and add them to list.
> + * allocate nr_pages quickly from the preferred zone. Pages are added
> + * to page_list if page_list is not NULL, otherwise it is assumed
> + * that the page_array is valid.
> + *
> + * If using page_array, only NULL elements are populated with pages.
> + * The caller must ensure that the array has enough NULL elements
> + * to store nr_pages or the buffer overruns.
>   *
>   * Returns the number of pages allocated.
>   */
>  int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>   nodemask_t *nodemask, int nr_pages,
> - struct list_head *alloc_list)
> + struct list_head *page_list,
> + struct page **page_array)
>  {
>   struct page *page;
>   unsigned long flags;
> @@ -4987,6 +4994,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>   if (WARN_ON_ONCE(nr_pages <= 0))
>   return 0;
>  
> + if (WARN_ON_ONCE(!page_list && !page_array))
> + return 0;
> +
>   if (nr_pages == 1)
>   goto failed;
>  
> @@ -5035,7 +5045,24 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>   break;
>   }
>  
> - list_add(>lru, alloc_list);
> + if (page_list) {
> + /* New page prep is deferred */
> + list_add(>lru, page_list);
> +  

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-15 Thread Jesper Dangaard Brouer
On Mon, 15 Mar 2021 10:42:05 +
Mel Gorman  wrote:

> On Sun, Mar 14, 2021 at 03:22:02PM +, Chuck Lever III wrote:
> > >> Anyway, I'm not arguing against a bulk allocator, nor even saying this
> > >> is a bad interface.  It just maybe could be better.
> > >>   
> > > 
> > > I think it puts more responsibility on the caller to use the API correctly
> > > but I also see no value in arguing about it further because there is no
> > > supporting data either way (I don't have routine access to a sufficiently
> > > fast network to generate the data). I can add the following patch and let
> > > callers figure out which interface is preferred. If one of the interfaces
> > > is dead in a year, it can be removed.
> > > 
> > > As there are a couple of ways the arrays could be used, I'm leaving it
> > > up to Jesper and Chuck which interface they want to use. In particular,
> > > it would be preferred if the array has no valid struct pages in it but
> > > it's up to them to judge how practical that is.  
> > 
> > I'm interested to hear from Jesper.
> > 
> > My two cents (US):
> > 
> > If svc_alloc_arg() is the /only/ consumer that wants to fill
> > a partially populated array of page pointers, then there's no
> > code-duplication benefit to changing the synopsis of
> > alloc_pages_bulk() at this point.
> > 
> > Also, if the consumers still have to pass in the number of
> > pages the array needs, rather than having the bulk allocator
> > figure it out, then there's not much additional benefit, IMO.
> > 
> > Ideally (for SUNRPC) alloc_pages_bulk() would take a pointer
> > to a sparsely-populated array and the total number of elements
> > in that array, and fill in the NULL elements. The return value
> > would be "success -- all elements are populated" or "failure --
> > some elements remain NULL".
> >   
> 
> If the array API interface was expected to handle sparse arrays, it would
> make sense to define nr_pages are the number of pages that need to be
> in the array instead of the number of pages to allocate. The preamble
> would skip the first N number of allocated pages and decrement nr_pages
> accordingly before the watermark check. The return value would then be the
> last populated array element and the caller decides if that is enough to
> proceed or if the API needs to be called again. There is a slight risk
> that with a spare array that only needed 1 page in reality would fail
> the watermark check but on low memory, allocations take more work anyway.
> That definition of nr_pages would avoid the potential buffer overrun but
> both you and Jesper would need to agree that it's an appropriate API.

I actually like the idea of doing it this way.  Even-though the
page_pool fast-path (__page_pool_get_cached()) doesn't clear/mark the
"consumed" elements with NULL.  I'm ready to change page_pool to handle
this when calling this API, as I think it will be faster than walking
the linked list.

Even-though my page_pool use-case doesn't have a sparse array to
populate (like NFS/SUNRPC) then I can still use this API that Chuck is
suggesting. Thus, I'm fine with this :-)


(p.s. working on implementing Alexander Duyck's suggestions, but I
don't have it ready yet, I will try to send new patch tomorrow. And I
do realize that with this API change I have to reimplement it again,
but as long as we make forward progress then I'll happily do it).
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

/* fast path */
static struct page *__page_pool_get_cached(struct page_pool *pool)
{
struct page *page;

/* Caller MUST guarantee safe non-concurrent access, e.g. softirq */
if (likely(pool->alloc.count)) {
/* Fast-path */
page = pool->alloc.cache[--pool->alloc.count];
} else {
page = page_pool_refill_alloc_cache(pool);
}

return page;
}



Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-15 Thread Mel Gorman
On Sun, Mar 14, 2021 at 03:22:02PM +, Chuck Lever III wrote:
> >> Anyway, I'm not arguing against a bulk allocator, nor even saying this
> >> is a bad interface.  It just maybe could be better.
> >> 
> > 
> > I think it puts more responsibility on the caller to use the API correctly
> > but I also see no value in arguing about it further because there is no
> > supporting data either way (I don't have routine access to a sufficiently
> > fast network to generate the data). I can add the following patch and let
> > callers figure out which interface is preferred. If one of the interfaces
> > is dead in a year, it can be removed.
> > 
> > As there are a couple of ways the arrays could be used, I'm leaving it
> > up to Jesper and Chuck which interface they want to use. In particular,
> > it would be preferred if the array has no valid struct pages in it but
> > it's up to them to judge how practical that is.
> 
> I'm interested to hear from Jesper.
> 
> My two cents (US):
> 
> If svc_alloc_arg() is the /only/ consumer that wants to fill
> a partially populated array of page pointers, then there's no
> code-duplication benefit to changing the synopsis of
> alloc_pages_bulk() at this point.
> 
> Also, if the consumers still have to pass in the number of
> pages the array needs, rather than having the bulk allocator
> figure it out, then there's not much additional benefit, IMO.
> 
> Ideally (for SUNRPC) alloc_pages_bulk() would take a pointer
> to a sparsely-populated array and the total number of elements
> in that array, and fill in the NULL elements. The return value
> would be "success -- all elements are populated" or "failure --
> some elements remain NULL".
> 

If the array API interface was expected to handle sparse arrays, it would
make sense to define nr_pages are the number of pages that need to be
in the array instead of the number of pages to allocate. The preamble
would skip the first N number of allocated pages and decrement nr_pages
accordingly before the watermark check. The return value would then be the
last populated array element and the caller decides if that is enough to
proceed or if the API needs to be called again. There is a slight risk
that with a spare array that only needed 1 page in reality would fail
the watermark check but on low memory, allocations take more work anyway.
That definition of nr_pages would avoid the potential buffer overrun but
both you and Jesper would need to agree that it's an appropriate API.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-14 Thread Chuck Lever III



> On Mar 14, 2021, at 8:52 AM, Mel Gorman  wrote:
> 
> On Sat, Mar 13, 2021 at 07:33:43PM +, Matthew Wilcox wrote:
>> On Sat, Mar 13, 2021 at 04:56:31PM +, Chuck Lever III wrote:
>>> IME lists are indeed less CPU-efficient, but I wonder if that
>>> expense is insignificant compared to serialization primitives like
>>> disabling and re-enabling IRQs, which we are avoiding by using
>>> bulk page allocation.
>> 
>> Cache misses are a worse problem than serialisation.  Paul McKenney had
>> a neat demonstration where he took a sheet of toilet paper to represent
>> an instruction, and then unrolled two rolls of toilet paper around the
>> lecture theatre to represent an L3 cache miss.  Obviously a serialising
>> instruction is worse than an add instruction, but i'm thinking maybe
>> 50-100 sheets of paper, not an entire roll?
>> 
> 
> I'm well array of the advantages of arrays over lists. The reality is that
> the penalty is incurred unconditionally as the pages have to be removed
> from the per-cpu or buddy lists and the cache footprint of the allocator
> and the data copies are already large. It's also the case that bulk free
> interfaces already exist that operate on lists (free_unref_page_list)
> so there is existing precedent. The bulk free API in this series was not
> used by the callers so I've deleted it.
> 
> Obviously the callers would need to be adjusted to use the array
> interface. The sunrpc user has an array but it is coded in a way that
> expects the array could be partially populated or has holes so the API has
> to skip populated elements. The caller is responsible for making sure that
> there are enough NULL elements available to store nr_pages or the buffer
> overruns. nr_elements could be passed in to avoid the buffer overrun but
> then further logic is needed to distinguish between a failed allocation
> and a failure to have enough space in the array to store the pointer.
> It also means that prep_new_page() should not be deferred outside of
> the IRQ disabled section as it does not have the storage to track which
> pages were freshly allocated and which ones were already on the array. It
> could be tracked using the lower bit of the pointer but that is not free
> either. Ideally the callers simply would ensure the array does not have
> valid struct page pointers in it already so prepping the new page could
> always be deferred.  Obviously the callers are also responsible for
> ensuring protecting the array from parallel access if necessary while
> calling into the allocator.
> 
>> Anyway, I'm not arguing against a bulk allocator, nor even saying this
>> is a bad interface.  It just maybe could be better.
>> 
> 
> I think it puts more responsibility on the caller to use the API correctly
> but I also see no value in arguing about it further because there is no
> supporting data either way (I don't have routine access to a sufficiently
> fast network to generate the data). I can add the following patch and let
> callers figure out which interface is preferred. If one of the interfaces
> is dead in a year, it can be removed.
> 
> As there are a couple of ways the arrays could be used, I'm leaving it
> up to Jesper and Chuck which interface they want to use. In particular,
> it would be preferred if the array has no valid struct pages in it but
> it's up to them to judge how practical that is.

I'm interested to hear from Jesper.

My two cents (US):

If svc_alloc_arg() is the /only/ consumer that wants to fill
a partially populated array of page pointers, then there's no
code-duplication benefit to changing the synopsis of
alloc_pages_bulk() at this point.

Also, if the consumers still have to pass in the number of
pages the array needs, rather than having the bulk allocator
figure it out, then there's not much additional benefit, IMO.

Ideally (for SUNRPC) alloc_pages_bulk() would take a pointer
to a sparsely-populated array and the total number of elements
in that array, and fill in the NULL elements. The return value
would be "success -- all elements are populated" or "failure --
some elements remain NULL".

But again, if no other consumer finds that useful, or that API
design obscures the performance benefits of the bulk allocator,
I can be very happy with the list-centric API. My interest in
this part of the exercise is simply to reduce the overall amount
of new complexity across mm/ and consumers of the bulk allocator.


> Patch is only lightly tested with a poor conversion of the sunrpc code
> to use the array interface.
> 
> ---8<---
> mm/page_alloc: Add an array-based interface to the bulk page allocator
> 
> The existing callers for the bulk allocator are storing the pages in
> arrays. This patch adds an array-based interface to the API to avoid
> multiple list iterations. The page list interface is preserved to
> avoid requiring all users of the bulk API to allocate and manage
> enough storage to store the pages.
> 
> Signed-off-by: Mel Gorman 
> 
> diff --git 

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-14 Thread Mel Gorman
On Sat, Mar 13, 2021 at 07:33:43PM +, Matthew Wilcox wrote:
> On Sat, Mar 13, 2021 at 04:56:31PM +, Chuck Lever III wrote:
> > IME lists are indeed less CPU-efficient, but I wonder if that
> > expense is insignificant compared to serialization primitives like
> > disabling and re-enabling IRQs, which we are avoiding by using
> > bulk page allocation.
> 
> Cache misses are a worse problem than serialisation.  Paul McKenney had
> a neat demonstration where he took a sheet of toilet paper to represent
> an instruction, and then unrolled two rolls of toilet paper around the
> lecture theatre to represent an L3 cache miss.  Obviously a serialising
> instruction is worse than an add instruction, but i'm thinking maybe
> 50-100 sheets of paper, not an entire roll?
> 

I'm well array of the advantages of arrays over lists. The reality is that
the penalty is incurred unconditionally as the pages have to be removed
from the per-cpu or buddy lists and the cache footprint of the allocator
and the data copies are already large. It's also the case that bulk free
interfaces already exist that operate on lists (free_unref_page_list)
so there is existing precedent. The bulk free API in this series was not
used by the callers so I've deleted it.

Obviously the callers would need to be adjusted to use the array
interface. The sunrpc user has an array but it is coded in a way that
expects the array could be partially populated or has holes so the API has
to skip populated elements. The caller is responsible for making sure that
there are enough NULL elements available to store nr_pages or the buffer
overruns. nr_elements could be passed in to avoid the buffer overrun but
then further logic is needed to distinguish between a failed allocation
and a failure to have enough space in the array to store the pointer.
It also means that prep_new_page() should not be deferred outside of
the IRQ disabled section as it does not have the storage to track which
pages were freshly allocated and which ones were already on the array. It
could be tracked using the lower bit of the pointer but that is not free
either. Ideally the callers simply would ensure the array does not have
valid struct page pointers in it already so prepping the new page could
always be deferred.  Obviously the callers are also responsible for
ensuring protecting the array from parallel access if necessary while
calling into the allocator.

> Anyway, I'm not arguing against a bulk allocator, nor even saying this
> is a bad interface.  It just maybe could be better.
> 

I think it puts more responsibility on the caller to use the API correctly
but I also see no value in arguing about it further because there is no
supporting data either way (I don't have routine access to a sufficiently
fast network to generate the data). I can add the following patch and let
callers figure out which interface is preferred. If one of the interfaces
is dead in a year, it can be removed.

As there are a couple of ways the arrays could be used, I'm leaving it
up to Jesper and Chuck which interface they want to use. In particular,
it would be preferred if the array has no valid struct pages in it but
it's up to them to judge how practical that is.

Patch is only lightly tested with a poor conversion of the sunrpc code
to use the array interface.

---8<---
mm/page_alloc: Add an array-based interface to the bulk page allocator

The existing callers for the bulk allocator are storing the pages in
arrays. This patch adds an array-based interface to the API to avoid
multiple list iterations. The page list interface is preserved to
avoid requiring all users of the bulk API to allocate and manage
enough storage to store the pages.

Signed-off-by: Mel Gorman 

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4a304fd39916..fb6234e1fe59 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -520,13 +520,20 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, 
int preferred_nid,
 
 int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
-   struct list_head *list);
+   struct list_head *page_list,
+   struct page **page_array);
 
 /* Bulk allocate order-0 pages */
 static inline unsigned long
-alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
+alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head 
*list)
 {
-   return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
+   return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, 
NULL);
+}
+
+static inline unsigned long
+alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page 
**page_array)
+{
+   return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, 
page_array);
 }
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e0c87c588d3..96590f0726c7 100644
--- 

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-13 Thread Matthew Wilcox
On Sat, Mar 13, 2021 at 04:56:31PM +, Chuck Lever III wrote:
> IME lists are indeed less CPU-efficient, but I wonder if that
> expense is insignificant compared to serialization primitives like
> disabling and re-enabling IRQs, which we are avoiding by using
> bulk page allocation.

Cache misses are a worse problem than serialisation.  Paul McKenney had
a neat demonstration where he took a sheet of toilet paper to represent
an instruction, and then unrolled two rolls of toilet paper around the
lecture theatre to represent an L3 cache miss.  Obviously a serialising
instruction is worse than an add instruction, but i'm thinking maybe
50-100 sheets of paper, not an entire roll?

Anyway, I'm not arguing against a bulk allocator, nor even saying this
is a bad interface.  It just maybe could be better.

> My initial experience with the current interface left me feeling
> uneasy about re-using the lru list field. That seems to expose an
> internal API feature to consumers of the page allocator. If we
> continue with a list-centric bulk allocator API I hope there can
> be some conveniently-placed documentation that explains when it is
> safe to use that field. Or perhaps the field should be renamed.

Heh.  Spoken like a filesystem developer who's never been exposed to the
->readpages API (it's almost dead).  It's fairly common in the memory
management world to string pages together through the lru list_head.
Slab does it, as does put_pages_list() in mm/swap.c.  It's natural for
Mel to keep using this pattern ... and I dislike it intensely.

> I have a mild preference for an array-style interface because that's
> more natural for the NFSD consumer, but I'm happy to have a bulk
> allocator either way. Purely from a code-reuse point of view, I
> wonder how many consumers of alloc_pages_bulk() will be like
> svc_alloc_arg(), where they need to fill in pages in an array. Each
> such consumer would need to repeat the logic to convert the returned
> list into an array. We have, for instance, release_pages(), which is
> an array-centric page allocator API. Maybe a helper function or two
> might prevent duplication of the list conversion logic.
> 
> And I agree with Mel that passing a single large array seems more
> useful then having to build code at each consumer call-site to
> iterate over smaller page_vecs until that array is filled.

So how about this?

You provide the interface you'd _actually_ like to use (array-based) and
implement it on top of Mel's lru-list implementation.  If it's general
enough to be used by Jesper's use-case, we lift it to page_alloc.c.
If we go a year and there are no users of the lru-list interface, we
can just change the implementation.



Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-13 Thread Chuck Lever III



> On Mar 13, 2021, at 11:39 AM, Matthew Wilcox  wrote:
> 
> On Sat, Mar 13, 2021 at 01:16:48PM +, Mel Gorman wrote:
>>> I'm not claiming the pagevec is definitely a win, but it's very
>>> unclear which tradeoff is actually going to lead to better performance.
>>> Hopefully Jesper or Chuck can do some tests and figure out what actually
>>> works better with their hardware & usage patterns.
>> 
>> The NFS user is often going to need to make round trips to get the pages it
>> needs. The pagevec would have to be copied into the target array meaning
>> it's not much better than a list manipulation.
> 
> I don't think you fully realise how bad CPUs are at list manipulation.
> See the attached program (and run it on your own hardware).  On my
> less-than-a-year-old core-i7:
> 
> $ gcc -W -Wall -O2 -g array-vs-list.c -o array-vs-list
> $ ./array-vs-list 
> walked sequential array in 0.001765s
> walked sequential list in 0.002920s
> walked sequential array in 0.001777s
> walked shuffled list in 0.081955s
> walked shuffled array in 0.007367s
> 
> If you happen to get the objects in-order, it's only 64% worse to walk
> a list as an array.  If they're out of order, it's *11.1* times as bad.
> 

IME lists are indeed less CPU-efficient, but I wonder if that
expense is insignificant compared to serialization primitives like
disabling and re-enabling IRQs, which we are avoiding by using
bulk page allocation.

My initial experience with the current interface left me feeling
uneasy about re-using the lru list field. That seems to expose an
internal API feature to consumers of the page allocator. If we
continue with a list-centric bulk allocator API I hope there can
be some conveniently-placed documentation that explains when it is
safe to use that field. Or perhaps the field should be renamed.

I have a mild preference for an array-style interface because that's
more natural for the NFSD consumer, but I'm happy to have a bulk
allocator either way. Purely from a code-reuse point of view, I
wonder how many consumers of alloc_pages_bulk() will be like
svc_alloc_arg(), where they need to fill in pages in an array. Each
such consumer would need to repeat the logic to convert the returned
list into an array. We have, for instance, release_pages(), which is
an array-centric page allocator API. Maybe a helper function or two
might prevent duplication of the list conversion logic.

And I agree with Mel that passing a single large array seems more
useful then having to build code at each consumer call-site to
iterate over smaller page_vecs until that array is filled.


--
Chuck Lever





Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-13 Thread Matthew Wilcox
On Sat, Mar 13, 2021 at 01:16:48PM +, Mel Gorman wrote:
> > I'm not claiming the pagevec is definitely a win, but it's very
> > unclear which tradeoff is actually going to lead to better performance.
> > Hopefully Jesper or Chuck can do some tests and figure out what actually
> > works better with their hardware & usage patterns.
> 
> The NFS user is often going to need to make round trips to get the pages it
> needs. The pagevec would have to be copied into the target array meaning
> it's not much better than a list manipulation.

I don't think you fully realise how bad CPUs are at list manipulation.
See the attached program (and run it on your own hardware).  On my
less-than-a-year-old core-i7:

$ gcc -W -Wall -O2 -g array-vs-list.c -o array-vs-list
$ ./array-vs-list 
walked sequential array in 0.001765s
walked sequential list in 0.002920s
walked sequential array in 0.001777s
walked shuffled list in 0.081955s
walked shuffled array in 0.007367s

If you happen to get the objects in-order, it's only 64% worse to walk
a list as an array.  If they're out of order, it's *11.1* times as bad.
#include 
#include 
#include 
#include 

unsigned long count = 1000 * 1000;
unsigned int verbose;

struct object {
struct object *next;
struct object *prev;
unsigned int value;
};

#define printv(level, fmt, ...) \
if (level <= verbose) { printf(fmt, ##__VA_ARGS__); }

void check_total(unsigned long total)
{
if (total * 2 != count * (count + 1))
printf("Check your staging! (%lu %lu)\n", total, count);
}

void alloc_objs(struct object **array)
{
unsigned long i;

for (i = 0; i < count; i++) {
struct object *obj = malloc(sizeof(*obj));

obj->value = i + 1;
/* Add to the array */
array[i] = obj;
}
}

void shuffle(struct object **array, unsigned long seed)
{
unsigned long i;

printv(1, "random seed %lu\n", seed);
srand48(seed);

/* Shuffle the array */
for (i = 1; i < count; i++) {
struct object *obj;
unsigned long j = (unsigned int)mrand48() % (i + 1);

if (i == j)
continue;
obj = array[j];
array[j] = array[i];
array[i] = obj;
}
}

void create_list(struct object **array, struct object *list)
{
unsigned long i;

list->next = list;
list->prev = list;

for (i = 0; i < count; i++) {
struct object *obj = array[i];
/* Add to the tail of the list */
obj->next = list;
obj->prev = list->prev;
list->prev->next = obj;
list->prev = obj;
}
}

void walk_list(struct object *list)
{
unsigned long total = 0;
struct object *obj;

for (obj = list->next; obj != list; obj = obj->next) {
total += obj->value;
}

check_total(total);
}

void walk_array(struct object **array)
{
unsigned long total = 0;
unsigned long i;

for (i = 0; i < count; i++) {
total += array[i]->value;
}

check_total(total);
}

/* time2 - time1 */
double diff_time(struct timespec *time1, struct timespec *time2)
{
double result = time2->tv_nsec - time1->tv_nsec;

return time2->tv_sec - time1->tv_sec + result / 1000 / 1000 / 1000;
}

int main(int argc, char **argv)
{
int opt;
unsigned long seed = time(NULL);
struct object **array;
struct object list;
struct timespec time1, time2;

while ((opt = getopt(argc, argv, "c:s:v")) != -1) {
if (opt == 'c')
count *= strtoul(optarg, NULL, 0);
else if (opt == 's')
seed = strtoul(optarg, NULL, 0);
else if (opt == 'v')
verbose++;
}

clock_gettime(CLOCK_MONOTONIC, );
array = calloc(count, sizeof(void *));
alloc_objs(array);
clock_gettime(CLOCK_MONOTONIC, );
printv(1, "allocated %lu items in %fs\n", count,
diff_time(, ));

clock_gettime(CLOCK_MONOTONIC, );
walk_array(array);
clock_gettime(CLOCK_MONOTONIC, );
printf("walked sequential array in %fs\n",
diff_time(, ));

clock_gettime(CLOCK_MONOTONIC, );
create_list(array, );
clock_gettime(CLOCK_MONOTONIC, );
printv(1, "created list in %fs\n",
diff_time(, ));

clock_gettime(CLOCK_MONOTONIC, );
walk_list();
clock_gettime(CLOCK_MONOTONIC, );
printf("walked sequential list in %fs\n",
diff_time(, ));

clock_gettime(CLOCK_MONOTONIC, );
walk_array(array);
clock_gettime(CLOCK_MONOTONIC, );
printf("walked 

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-13 Thread Mel Gorman
On Fri, Mar 12, 2021 at 09:08:23PM +, Matthew Wilcox wrote:
> > > > The result of the API is to deliver pages as a double-linked list via
> > > > LRU (page->lru member).  If you are planning to use llist, then how to
> > > > handle this API change later?
> > > > 
> > > > Have you notice that the two users store the struct-page pointers in an
> > > > array?  We could have the caller provide the array to store struct-page
> > > > pointers, like we do with kmem_cache_alloc_bulk API.
> > > 
> > > My preference would be for a pagevec.  That does limit you to 15 pages
> > > per call [1], but I do think that might be enough.  And the overhead of
> > > manipulating a linked list isn't free.
> > > 
> > 
> > I'm opposed to a pagevec because it unnecessarily limits the caller.  The
> > sunrpc user for example knows how many pages it needs at the time the bulk
> > allocator is called but it's not the same value every time.  When tracing,
> > I found it sometimes requested 1 page (most common request actually) and
> > other times requested 200+ pages. Forcing it to call the batch allocator
> > in chunks of 15 means the caller incurs the cost of multiple allocation
> > requests which is almost as bad as calling __alloc_pages in a loop.
> 
> Well, no.  It reduces the cost by a factor of 15 -- or by 93%.  200 is
> an interesting example because putting 200 pages on a list costs 200 *
> 64 bytes of dirty cachelines, or 12KiB. 

That's a somewhat limited view. Yes, the overall cost gets reduced by
some factor but forcing the caller to limit the batch sizes incurs an
unnecessary cost. The SUNRPC user is particularly relevant as it cannot
make progress until it gets all the pages it requests -- it sleeps if
it cannot get the pages it needs. The whole point of the bulk allocator
is to avoid multiple round-trips through the page allocator. Forcing a
limit in the API requiring multiple round trips is just weird.

> That's larger than some CPU L1
> caches (mine's 48KB, 12-way set associative), but I think it's safe to say
> some of those 200 cache lines are going to force others out into L2 cache.
> Compared to a smaller batch of 15 pages in a pagevec, it'll dirty two cache
> lines (admittedly the 15 struct pages are also going to get dirtied by being
> allocated and then by being set up for whatever use they're getting, but
> they should stay in L1 cache while that's happening).
> 

The cache footprint is irrelevant if the caller *requires* the pages. If
the caller has to zero the pages then the cache gets thrashed anyway.
Even if non-temporal zeroing was used, the cache is likely thrashed by the
data copies. The page allocator in general is a cache nightmare because
of the number of cache lines it potentially dirties, particularly if it
has to call into the buddy allocator to split/merge pages for allocations
and frees respectively.

> I'm not claiming the pagevec is definitely a win, but it's very
> unclear which tradeoff is actually going to lead to better performance.
> Hopefully Jesper or Chuck can do some tests and figure out what actually
> works better with their hardware & usage patterns.
> 

The NFS user is often going to need to make round trips to get the pages it
needs. The pagevec would have to be copied into the target array meaning
it's not much better than a list manipulation.

Pagevecs are a bad interface in general simply because it puts hard
constraints on how many pages can be bulk allocatoed. Pagevecs are
primarily there to avoid excessive LRU lock acquisition and they are
bad at the job. These days, the LRU lock protects such a massive amount
of data that the pagevec is barely a band aid. Increasing its size just
shifts the problem slightly. I see very little value in introducing a
fundamental limitation into the bulk allocator by mandating pagevecs.

Now, I can see a case where the API moves to using arrays when there is a
user that is such a common hot path and using arrays that it is justified
but we're not there yet. The two callers are somewhat of corner cases and
both of them are limited by wire speed of networking. Not all users may
require arrays -- SLUB using batched order-0 pages on a high-allocation
failure for example would not need an array. Such an intensively hot user
does not currently exist so it's premature to even consider it.

> > I think the first version should have an easy API to start with. Optimise
> > the implementation if it is a bottleneck. Only make the API harder to
> > use if the callers are really willing to always allocate and size the
> > array in advance and it's shown that it really makes a big difference
> > performance-wise.
> 
> I'm not entirely sure that a pagevec is harder to use than a list_head.

Leaving aside the limitations of pagevecs, arrays get messy if the caller
does not necessarily use all the pages returned by the allocator. The
arrays would need to be tracked and/or preserved for some time. The
order pages are taken out of the array matters 

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-12 Thread Matthew Wilcox
On Fri, Mar 12, 2021 at 04:03:50PM +, Mel Gorman wrote:
> On Fri, Mar 12, 2021 at 02:58:14PM +, Matthew Wilcox wrote:
> > On Fri, Mar 12, 2021 at 12:46:09PM +0100, Jesper Dangaard Brouer wrote:
> > > In my page_pool patch I'm bulk allocating 64 pages. I wanted to ask if
> > > this is too much? (PP_ALLOC_CACHE_REFILL=64).
> > > 
> > > The mlx5 driver have a while loop for allocation 64 pages, which it
> > > used in this case, that is why 64 is chosen.  If we choose a lower
> > > bulk number, then the bulk-alloc will just be called more times.
> > 
> > The thing about batching is that smaller batches are often better.
> > Let's suppose you need to allocate 100 pages for something, and the page
> > allocator takes up 90% of your latency budget.  Batching just ten pages
> > at a time is going to reduce the overhead to 9%.  Going to 64 pages
> > reduces the overhead from 9% to 2% -- maybe that's important, but
> > possibly not.
> > 
> 
> I do not think that something like that can be properly accessed in
> advance. It heavily depends on whether the caller is willing to amortise
> the cost of the batch allocation or if the timing of the bulk request is
> critical every single time.
> 
> > > The result of the API is to deliver pages as a double-linked list via
> > > LRU (page->lru member).  If you are planning to use llist, then how to
> > > handle this API change later?
> > > 
> > > Have you notice that the two users store the struct-page pointers in an
> > > array?  We could have the caller provide the array to store struct-page
> > > pointers, like we do with kmem_cache_alloc_bulk API.
> > 
> > My preference would be for a pagevec.  That does limit you to 15 pages
> > per call [1], but I do think that might be enough.  And the overhead of
> > manipulating a linked list isn't free.
> > 
> 
> I'm opposed to a pagevec because it unnecessarily limits the caller.  The
> sunrpc user for example knows how many pages it needs at the time the bulk
> allocator is called but it's not the same value every time.  When tracing,
> I found it sometimes requested 1 page (most common request actually) and
> other times requested 200+ pages. Forcing it to call the batch allocator
> in chunks of 15 means the caller incurs the cost of multiple allocation
> requests which is almost as bad as calling __alloc_pages in a loop.

Well, no.  It reduces the cost by a factor of 15 -- or by 93%.  200 is
an interesting example because putting 200 pages on a list costs 200 *
64 bytes of dirty cachelines, or 12KiB.  That's larger than some CPU L1
caches (mine's 48KB, 12-way set associative), but I think it's safe to say
some of those 200 cache lines are going to force others out into L2 cache.
Compared to a smaller batch of 15 pages in a pagevec, it'll dirty two cache
lines (admittedly the 15 struct pages are also going to get dirtied by being
allocated and then by being set up for whatever use they're getting, but
they should stay in L1 cache while that's happening).

I'm not claiming the pagevec is definitely a win, but it's very
unclear which tradeoff is actually going to lead to better performance.
Hopefully Jesper or Chuck can do some tests and figure out what actually
works better with their hardware & usage patterns.

> I think the first version should have an easy API to start with. Optimise
> the implementation if it is a bottleneck. Only make the API harder to
> use if the callers are really willing to always allocate and size the
> array in advance and it's shown that it really makes a big difference
> performance-wise.

I'm not entirely sure that a pagevec is harder to use than a list_head.


Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-12 Thread Mel Gorman
On Fri, Mar 12, 2021 at 02:58:14PM +, Matthew Wilcox wrote:
> On Fri, Mar 12, 2021 at 12:46:09PM +0100, Jesper Dangaard Brouer wrote:
> > In my page_pool patch I'm bulk allocating 64 pages. I wanted to ask if
> > this is too much? (PP_ALLOC_CACHE_REFILL=64).
> > 
> > The mlx5 driver have a while loop for allocation 64 pages, which it
> > used in this case, that is why 64 is chosen.  If we choose a lower
> > bulk number, then the bulk-alloc will just be called more times.
> 
> The thing about batching is that smaller batches are often better.
> Let's suppose you need to allocate 100 pages for something, and the page
> allocator takes up 90% of your latency budget.  Batching just ten pages
> at a time is going to reduce the overhead to 9%.  Going to 64 pages
> reduces the overhead from 9% to 2% -- maybe that's important, but
> possibly not.
> 

I do not think that something like that can be properly accessed in
advance. It heavily depends on whether the caller is willing to amortise
the cost of the batch allocation or if the timing of the bulk request is
critical every single time.

> > The result of the API is to deliver pages as a double-linked list via
> > LRU (page->lru member).  If you are planning to use llist, then how to
> > handle this API change later?
> > 
> > Have you notice that the two users store the struct-page pointers in an
> > array?  We could have the caller provide the array to store struct-page
> > pointers, like we do with kmem_cache_alloc_bulk API.
> 
> My preference would be for a pagevec.  That does limit you to 15 pages
> per call [1], but I do think that might be enough.  And the overhead of
> manipulating a linked list isn't free.
> 

I'm opposed to a pagevec because it unnecessarily limits the caller.  The
sunrpc user for example knows how many pages it needs at the time the bulk
allocator is called but it's not the same value every time.  When tracing,
I found it sometimes requested 1 page (most common request actually) and
other times requested 200+ pages. Forcing it to call the batch allocator
in chunks of 15 means the caller incurs the cost of multiple allocation
requests which is almost as bad as calling __alloc_pages in a loop.

I think the first version should have an easy API to start with. Optimise
the implementation if it is a bottleneck. Only make the API harder to
use if the callers are really willing to always allocate and size the
array in advance and it's shown that it really makes a big difference
performance-wise.

-- 
Mel Gorman
SUSE Labs


[PATCH 3/7] mm/page_alloc: Add a bulk page allocator

2021-03-12 Thread Mel Gorman
This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list. They can be freed in bulk using
free_pages_bulk().

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.

Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.

Signed-off-by: Mel Gorman 
---
 include/linux/gfp.h |  12 +
 mm/page_alloc.c | 116 
 2 files changed, 128 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0a88f84b08f4..e2cd98dba72e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -518,6 +518,17 @@ static inline int arch_make_page_accessible(struct page 
*page)
 struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
 
+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *list);
+
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
+{
+   return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -581,6 +592,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t 
size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 880b1d6368bd..f48f94375b66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t 
gfp_mask,
}
 }
 
+/* Drop reference counts and free order-0 pages from a list. */
+void free_pages_bulk(struct list_head *list)
+{
+   struct page *page, *next;
+
+   list_for_each_entry_safe(page, next, list, lru) {
+   trace_mm_page_free_batched(page);
+   if (put_page_testzero(page)) {
+   list_del(>lru);
+   __free_pages_ok(page, 0, FPI_NONE);
+   }
+   }
+}
+EXPORT_SYMBOL_GPL(free_pages_bulk);
+
 static inline unsigned int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
@@ -4963,6 +4978,107 @@ static inline bool prepare_alloc_pages(gfp_t gfp, 
unsigned int order,
return true;
 }
 
+/*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ *
+ * Returns the number of pages allocated.
+ */
+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *alloc_list)
+{
+   struct page *page;
+   unsigned long flags;
+   struct zone *zone;
+   struct zoneref *z;
+   struct per_cpu_pages *pcp;
+   struct list_head *pcp_list;
+   struct alloc_context ac;
+   gfp_t alloc_gfp;
+   unsigned int alloc_flags;
+   int allocated = 0;
+
+   if (WARN_ON_ONCE(nr_pages <= 0))
+   return 0;
+
+   if (nr_pages == 1)
+   goto failed;
+
+   /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+   if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, ,
+   _gfp, _flags))
+   return 0;
+   gfp = alloc_gfp;
+
+   /* Find an allowed local zone that meets the high watermark. */
+   for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
ac.highest_zoneidx, ac.nodemask) {
+   unsigned long mark;
+
+   if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
+   !__cpuset_zone_allowed(zone, gfp)) {
+   continue;
+   }
+
+   if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
+   zone_to_nid(zone) != 
zone_to_nid(ac.preferred_zoneref->zone)) {
+   goto failed;
+   }
+
+   mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + 
nr_pages;
+   if (zone_watermark_fast(zone, 0,  mark,
+   

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-12 Thread Matthew Wilcox
On Fri, Mar 12, 2021 at 12:46:09PM +0100, Jesper Dangaard Brouer wrote:
> In my page_pool patch I'm bulk allocating 64 pages. I wanted to ask if
> this is too much? (PP_ALLOC_CACHE_REFILL=64).
> 
> The mlx5 driver have a while loop for allocation 64 pages, which it
> used in this case, that is why 64 is chosen.  If we choose a lower
> bulk number, then the bulk-alloc will just be called more times.

The thing about batching is that smaller batches are often better.
Let's suppose you need to allocate 100 pages for something, and the page
allocator takes up 90% of your latency budget.  Batching just ten pages
at a time is going to reduce the overhead to 9%.  Going to 64 pages
reduces the overhead from 9% to 2% -- maybe that's important, but
possibly not.

> The result of the API is to deliver pages as a double-linked list via
> LRU (page->lru member).  If you are planning to use llist, then how to
> handle this API change later?
> 
> Have you notice that the two users store the struct-page pointers in an
> array?  We could have the caller provide the array to store struct-page
> pointers, like we do with kmem_cache_alloc_bulk API.

My preference would be for a pagevec.  That does limit you to 15 pages
per call [1], but I do think that might be enough.  And the overhead of
manipulating a linked list isn't free.

[1] patches exist to increase this, because it turns out that 15 may
not be enough for all systems!  but it would limit to 255 as an absolute
hard cap.


Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-12 Thread Mel Gorman
On Fri, Mar 12, 2021 at 12:43:31PM +, Matthew Wilcox wrote:
> On Wed, Mar 10, 2021 at 10:46:15AM +, Mel Gorman wrote:
> > +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > +   nodemask_t *nodemask, int nr_pages,
> > +   struct list_head *list);
> 
> For the next revision, can you ditch the '_nodemask' part of the name?
> Andrew just took this patch from me:
> 

Ok, the first three patches are needed from that series. For convenience,
I'm going to post the same series with the rest of the patches as a
pre-requisite to avoid people having to take patches out of mmotm to test.
For review purposes, they can be ignored.

> > 
> >
> > @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t 
> > gfp_mask, unsigned int order,
> > struct alloc_context *ac, gfp_t *alloc_mask,
> > unsigned int *alloc_flags)
> >  {
> > +   gfp_mask &= gfp_allowed_mask;
> > +   *alloc_mask = gfp_mask;
> 
> Also I renamed alloc_mask to alloc_gfp.
> 

It then becomes obvious that prepare_alloc_pages does not share the same
naming convention as __alloc_pages(). In an effort to keep the naming
convention consistent, I updated the patch to also rename gfp_mask to
gfp in prepare_alloc_pages.

As a complete aside, I don't actually like the gfp name and would have
preferred gfp_flags because GFP is just an acronym and the context of the
variable is that it's a set of GFP Flags. The mask naming was wrong I admit
because it's not a mask but I'm not interested in naming the bike shed :)

Thanks for pointing this out early because it would have been a merge
headache!

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-12 Thread Mel Gorman
On Fri, Mar 12, 2021 at 12:46:09PM +0100, Jesper Dangaard Brouer wrote:
> > > > 
> > > > +   if (!zone)
> > > > +   return 0;
> > > > +
> > > > +   /* Attempt the batch allocation */
> > > > +   local_irq_save(flags);
> > > > +   pcp = _cpu_ptr(zone->pageset)->pcp;
> > > > +   pcp_list = >lists[ac.migratetype];
> > > > +
> > > > +   while (alloced < nr_pages) {
> > > > +   page = __rmqueue_pcplist(zone, ac.migratetype, 
> > > > alloc_flags,
> > > > +   pcp, 
> > > > pcp_list);
> > > > +   if (!page)
> > > > +   break;
> > > > +
> > > > +   prep_new_page(page, 0, gfp_mask, 0);  
> > > 
> > > I wonder if it would be worth running prep_new_page() in a second pass,
> > > after reenabling interrupts.
> > >   
> > 
> > Possibly, I could add another patch on top that does this because it's
> > trading the time that IRQs are disabled for a list iteration.
> 
> I for one like this idea, of moving prep_new_page() to a second pass.
> As per below realtime concern, to reduce the time that IRQs are
> disabled.
> 

Already done.

> > > Speaking of which, will the realtime people get upset about the
> > > irqs-off latency?  How many pages are we talking about here?
> > >   
> 
> In my page_pool patch I'm bulk allocating 64 pages. I wanted to ask if
> this is too much? (PP_ALLOC_CACHE_REFILL=64).
> 

I expect no, it's not too much. The refill path should be short.

> > At the moment, it looks like batches of up to a few hundred at worst. I
> > don't think realtime sensitive applications are likely to be using the
> > bulk allocator API at this point.
> > 
> > The realtime people have a worse problem in that the per-cpu list does
> > not use local_lock and disable IRQs more than it needs to on x86 in
> > particular. I've a prototype series for this as well which splits the
> > locking for the per-cpu list and statistic handling and then converts the
> > per-cpu list to local_lock but I'm getting this off the table first because
> > I don't want multiple page allocator series in flight at the same time.
> > Thomas, Peter and Ingo would need to be cc'd on that series to review
> > the local_lock aspects.
> > 
> > Even with local_lock, it's not clear to me why per-cpu lists need to be
> > locked at all because potentially it could use a lock-free llist with some
> > struct page overloading. That one is harder to predict when batches are
> > taken into account as splicing a batch of free pages with llist would be
> > unsafe so batch free might exchange IRQ disabling overhead with multiple
> > atomics. I'd need to recheck things like whether NMI handlers ever call
> > the page allocator (they shouldn't but it should be checked).  It would
> > need a lot of review and testing.
> 
> The result of the API is to deliver pages as a double-linked list via
> LRU (page->lru member).  If you are planning to use llist, then how to
> handle this API change later?
> 

I would not have to. The per-cpu list internally can use llist internally
while pages returned to the bulk allocator user can still be a doubly
linked list. An llist_node fits in less space than the list_head lru.

> Have you notice that the two users store the struct-page pointers in an
> array?  We could have the caller provide the array to store struct-page
> pointers, like we do with kmem_cache_alloc_bulk API.
> 

That is a possibility but it ties the caller into declaring an array,
either via kmalloc, within an existing struct or on-stack. They would
then need to ensure that nr_pages does not exceed the array size or pass
in the array size. It's more error prone and a harder API to use.

> You likely have good reasons for returning the pages as a list (via
> lru), as I can see/imagine that there are some potential for grabbing
> the entire PCP-list.
> 

I used a list so that user was only required to define a list_head on
the stack to use the API.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-12 Thread Matthew Wilcox
On Wed, Mar 10, 2021 at 10:46:15AM +, Mel Gorman wrote:
> +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *list);

For the next revision, can you ditch the '_nodemask' part of the name?
Andrew just took this patch from me:

mm/page_alloc: combine __alloc_pages and __alloc_pages_nodemask

There are only two callers of __alloc_pages() so prune the thicket of
alloc_page variants by combining the two functions together.  Current
callers of __alloc_pages() simply add an extra 'NULL' parameter and
current callers of __alloc_pages_nodemask() call __alloc_pages() instead.

...

-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
-   nodemask_t *nodemask);
-
-static inline struct page *
-__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
-{
-   return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
-}
+struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
+   nodemask_t *nodemask);

So calling this function __alloc_pages_bulk() fits with the new naming
scheme.

> @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
> unsigned int order,
>   struct alloc_context *ac, gfp_t *alloc_mask,
>   unsigned int *alloc_flags)
>  {
> + gfp_mask &= gfp_allowed_mask;
> + *alloc_mask = gfp_mask;

Also I renamed alloc_mask to alloc_gfp.



Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-12 Thread Jesper Dangaard Brouer
On Wed, 10 Mar 2021 11:38:36 +
Mel Gorman  wrote:

> On Wed, Mar 10, 2021 at 01:04:17PM +0200, Shay Agroskin wrote:
> > 
> > Mel Gorman  writes:
> >   
> > > 
> > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > index 8572a1474e16..4903d1cc48dc 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -515,6 +515,10 @@ static inline int arch_make_page_accessible(struct
> > > page *page)
> > >  }
> > >  #endif
> > >   +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > > + nodemask_t *nodemask, int nr_pages,
> > > + struct list_head *list);
> > > +
> > >  struct page *
> > >  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int
> > > preferred_nid,
> > >   nodemask_t  *nodemask);
> > > @@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > > int preferred_nid)
> > >   return __alloc_pages_nodemask(gfp_mask, order,  preferred_nid, NULL);
> > >  }
> > > +/* Bulk allocate order-0 pages */
> > > +static inline unsigned long
> > > +alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct
> > > list_head *list)
> > > +{
> > > + return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
> > > + nr_pages, list);  
> > 
> > Is the second line indentation intentional ? Why not align it to the first
> > argument (gfp_mask) ?
> >   
> 
> No particular reason. I usually pick this as it's visually clearer to me
> that it's part of the same line when the multi-line is part of an if block.
> 
> > > +}
> > > +
[...]
> > 
> > Same indentation comment as before
> >   
> 
> Again, simple personal perference to avoid any possibility it's mixed
> up with a later line. There has not been consistent code styling
> enforcement of what indentation style should be used for a multi-line
> within mm/page_alloc.c

Hi Shay, it is might be surprising that indentation style actually
differs slightly in different parts of the kernel.  I started in
networking area of the kernel, and I was also surprised when I started
working in MM area that the coding style differs.  I can tell you that
the indentation style Mel choose is consistent with the code styling in
MM area.  I usually respect that even-though I prefer the networking
style as I was "raised" with that style.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer



Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-12 Thread Jesper Dangaard Brouer
On Thu, 11 Mar 2021 08:42:00 +
Mel Gorman  wrote:

> On Wed, Mar 10, 2021 at 03:46:50PM -0800, Andrew Morton wrote:
> > On Wed, 10 Mar 2021 10:46:15 + Mel Gorman  
> > wrote:
> >   
> > > This patch adds a new page allocator interface via alloc_pages_bulk,
> > > and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> > > to be allocated and added to a list. They can be freed in bulk using
> > > free_pages_bulk().  
> > 
> > Why am I surprised we don't already have this.
> >   
> 
> It was prototyped a few years ago and discussed at LSF/MM so it's in
> the back of your memory somewhere. It never got merged because it lacked
> users and I didn't think carrying dead untested code was appropriate.

And I guess didn't push hard enough and showed the use-case in code.
Thus, I will also take part of the blame for this stalling out.


> > > The API is not guaranteed to return the requested number of pages and
> > > may fail if the preferred allocation zone has limited free memory, the
> > > cpuset changes during the allocation or page debugging decides to fail
> > > an allocation. It's up to the caller to request more pages in batch
> > > if necessary.
> > > 
> > > Note that this implementation is not very efficient and could be improved
> > > but it would require refactoring. The intent is to make it available early
> > > to determine what semantics are required by different callers. Once the
> > > full semantics are nailed down, it can be refactored.
> > > 
> > > ...
> > >
> > > +/* Drop reference counts and free order-0 pages from a list. */
> > > +void free_pages_bulk(struct list_head *list)
> > > +{
> > > + struct page *page, *next;
> > > +
> > > + list_for_each_entry_safe(page, next, list, lru) {
> > > + trace_mm_page_free_batched(page);
> > > + if (put_page_testzero(page)) {
> > > + list_del(>lru);
> > > + __free_pages_ok(page, 0, FPI_NONE);
> > > + }
> > > + }
> > > +}
> > > +EXPORT_SYMBOL_GPL(free_pages_bulk);  
> > 
> > I expect that batching games are planned in here as well?
> >   
> 
> Potentially it could be done but the page allocator would need to be
> fundamentally aware of batching to make it tidy or the per-cpu allocator
> would need knowledge of how to handle batches in the free path.  Batch
> freeing to the buddy allocator is problematic as buddy merging has to
> happen. Batch freeing to per-cpu hits pcp->high limitations.
> 
> There are a couple of ways it *could* be done. Per-cpu lists could be
> allowed to temporarily exceed the high limits and reduce them out-of-band
> like what happens with counter updates or remote pcp freeing. Care
> would need to be taken when memory is low to avoid premature OOM
> and to guarantee draining happens in a timely fashion. There would be
> additional benefits to this. For example, release_pages() can hammer the
> zone lock when freeing very large batches and would benefit from either
> large batching or "plugging" the per-cpu list. I prototyped a series to
> allow the batch limits to be temporarily exceeded but it did not actually
> improve performance because of errors in the implementation and it needs
> a lot of work.
> 
> > >  static inline unsigned int
> > >  gfp_to_alloc_flags(gfp_t gfp_mask)
> > >  {
> > > @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t 
> > > gfp_mask, unsigned int order,
> > >   struct alloc_context *ac, gfp_t *alloc_mask,
> > >   unsigned int *alloc_flags)
> > >  {
> > > + gfp_mask &= gfp_allowed_mask;
> > > + *alloc_mask = gfp_mask;
> > > +
> > >   ac->highest_zoneidx = gfp_zone(gfp_mask);
> > >   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > >   ac->nodemask = nodemask;
> > > @@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t 
> > > gfp_mask, unsigned int order,
> > >   return true;
> > >  }
> > >  
> > > +/*
> > > + * This is a batched version of the page allocator that attempts to
> > > + * allocate nr_pages quickly from the preferred zone and add them to 
> > > list.
> > > + */  
> > 
> > Documentation is rather lame.  Returns number of pages allocated...
> >   
> 
> I added a note on the return value. The documentation is lame because at
> this point, we do not know what the required semantics for future users
> are. We have two examples at the moment in this series but I think it
> would be better to add kerneldoc documentation when there is a reasonable
> expectation that the API will not change. For example, SLUB could use
> this API when it fails to allocate a high-order page and instead allocate
> batches of order-0 pages but I did not investigate how feasible that
> is. Similarly, it's possible that we really need to deal with high-order
> batch allocations in which case, the per-cpu list should be high-order
> aware or the core buddy allocator needs to be batch-allocation aware.
> 
> > > +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > > +  

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-11 Thread Mel Gorman
On Thu, Mar 11, 2021 at 08:42:16AM -0800, Alexander Duyck wrote:
> > @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t 
> > gfp_mask, unsigned int order,
> > struct alloc_context *ac, gfp_t *alloc_mask,
> > unsigned int *alloc_flags)
> >  {
> > +   gfp_mask &= gfp_allowed_mask;
> > +   *alloc_mask = gfp_mask;
> > +
> > ac->highest_zoneidx = gfp_zone(gfp_mask);
> > ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > ac->nodemask = nodemask;
> 
> It might be better to pull this and the change from the bottom out
> into a seperate patch. I was reviewing this and when I hit the bottom
> I apparently had the same question other reviewers had wondering if it
> was intentional. By splitting it out it would be easier to review.
> 

Done. I felt it was obvious from context that the paths were sharing code
and splitting it out felt like patch count stuffing. Still, you're the
second person to point it out so now it's a separate patch in v4.

> > @@ -4960,6 +4978,104 @@ static inline bool prepare_alloc_pages(gfp_t 
> > gfp_mask, unsigned int order,
> > return true;
> >  }
> >
> > +/*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to list.
> > + *
> > + * Returns the number of pages allocated.
> > + */
> > +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > +   nodemask_t *nodemask, int nr_pages,
> > +   struct list_head *alloc_list)
> > +{
> > +   struct page *page;
> > +   unsigned long flags;
> > +   struct zone *zone;
> > +   struct zoneref *z;
> > +   struct per_cpu_pages *pcp;
> > +   struct list_head *pcp_list;
> > +   struct alloc_context ac;
> > +   gfp_t alloc_mask;
> > +   unsigned int alloc_flags;
> > +   int alloced = 0;
> > +
> > +   if (nr_pages == 1)
> > +   goto failed;
> 
> I might change this to "<= 1" just to cover the case where somebody
> messed something up and passed a negative value.
> 

I put in a WARN_ON_ONCE check that returns 0 allocated pages. It should
be the case that it only happens during the development of a new user but
better safe than sorry. It's an open question whether the max nr_pages
should be clamped but stupidly large values will either fail the watermark
check or wrap and hit the <= 0 check. I guess it's still possible the zone
would hit a dangerously low level of free pages but that is no different
to a user calling __alloc_pages_nodemask a stupidly large number of times.

> > +
> > +   /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> > +   if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, , 
> > _mask, _flags))
> > +   return 0;
> > +   gfp_mask = alloc_mask;
> > +
> > +   /* Find an allowed local zone that meets the high watermark. */
> > +   for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
> > ac.highest_zoneidx, ac.nodemask) {
> > +   unsigned long mark;
> > +
> > +   if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> > +   !__cpuset_zone_allowed(zone, gfp_mask)) {
> > +   continue;
> > +   }
> > +
> > +   if (nr_online_nodes > 1 && zone != 
> > ac.preferred_zoneref->zone &&
> > +   zone_to_nid(zone) != 
> > zone_to_nid(ac.preferred_zoneref->zone)) {
> > +   goto failed;
> > +   }
> > +
> > +   mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + 
> > nr_pages;
> > +   if (zone_watermark_fast(zone, 0,  mark,
> > +   zonelist_zone_idx(ac.preferred_zoneref),
> > +   alloc_flags, gfp_mask)) {
> > +   break;
> > +   }
> > +   }
> > +   if (!zone)
> > +   return 0;
> > +
> > +   /* Attempt the batch allocation */
> > +   local_irq_save(flags);
> > +   pcp = _cpu_ptr(zone->pageset)->pcp;
> > +   pcp_list = >lists[ac.migratetype];
> > +
> > +   while (alloced < nr_pages) {
> > +   page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> > +   pcp, 
> > pcp_list);
> > +   if (!page)
> > +   break;
> > +
> > +   list_add(>lru, alloc_list);
> > +   alloced++;
> > +   }
> > +
> > +   if (!alloced)
> > +   goto failed_irq;
> 
> Since we already covered the case above verifying the nr_pages is
> greater than one it might make sense to move this check inside the
> loop for the !page case. Then we only are checking this if we failed
> an allocation.
> 

Yes, good idea, it moves a branch into a very unlikely path.

> > +
> > +   if (alloced) {
> 
> Isn't this 

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-11 Thread Alexander Duyck
On Thu, Mar 11, 2021 at 3:49 AM Mel Gorman  wrote:
>
> This patch adds a new page allocator interface via alloc_pages_bulk,
> and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> to be allocated and added to a list. They can be freed in bulk using
> free_pages_bulk().
>
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch
> if necessary.
>
> Note that this implementation is not very efficient and could be improved
> but it would require refactoring. The intent is to make it available early
> to determine what semantics are required by different callers. Once the
> full semantics are nailed down, it can be refactored.
>
> Signed-off-by: Mel Gorman 
> ---
>  include/linux/gfp.h |  13 +
>  mm/page_alloc.c | 118 +++-
>  2 files changed, 129 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 8572a1474e16..4903d1cc48dc 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -515,6 +515,10 @@ static inline int arch_make_page_accessible(struct page 
> *page)
>  }
>  #endif
>
> +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> +   nodemask_t *nodemask, int nr_pages,
> +   struct list_head *list);
> +
>  struct page *
>  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> nodemask_t *nodemask);
> @@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int 
> preferred_nid)
> return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
>  }
>
> +/* Bulk allocate order-0 pages */
> +static inline unsigned long
> +alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct list_head 
> *list)
> +{
> +   return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
> +   nr_pages, list);
> +}
> +
>  /*
>   * Allocate pages, preferring the node given as nid. The node must be valid 
> and
>   * online. For more general interface, see alloc_pages_node().
> @@ -594,6 +606,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t 
> size, gfp_t gfp_mask);
>
>  extern void __free_pages(struct page *page, unsigned int order);
>  extern void free_pages(unsigned long addr, unsigned int order);
> +extern void free_pages_bulk(struct list_head *list);
>
>  struct page_frag_cache;
>  extern void __page_frag_cache_drain(struct page *page, unsigned int count);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..415059324dc3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t 
> gfp_mask,
> }
>  }
>
> +/* Drop reference counts and free order-0 pages from a list. */
> +void free_pages_bulk(struct list_head *list)
> +{
> +   struct page *page, *next;
> +
> +   list_for_each_entry_safe(page, next, list, lru) {
> +   trace_mm_page_free_batched(page);
> +   if (put_page_testzero(page)) {
> +   list_del(>lru);
> +   __free_pages_ok(page, 0, FPI_NONE);
> +   }
> +   }
> +}
> +EXPORT_SYMBOL_GPL(free_pages_bulk);
> +
>  static inline unsigned int
>  gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
> @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
> unsigned int order,
> struct alloc_context *ac, gfp_t *alloc_mask,
> unsigned int *alloc_flags)
>  {
> +   gfp_mask &= gfp_allowed_mask;
> +   *alloc_mask = gfp_mask;
> +
> ac->highest_zoneidx = gfp_zone(gfp_mask);
> ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> ac->nodemask = nodemask;

It might be better to pull this and the change from the bottom out
into a seperate patch. I was reviewing this and when I hit the bottom
I apparently had the same question other reviewers had wondering if it
was intentional. By splitting it out it would be easier to review.

> @@ -4960,6 +4978,104 @@ static inline bool prepare_alloc_pages(gfp_t 
> gfp_mask, unsigned int order,
> return true;
>  }
>
> +/*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + *
> + * Returns the number of pages allocated.
> + */
> +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> +   nodemask_t *nodemask, int nr_pages,
> +   struct list_head *alloc_list)
> +{
> +   struct page *page;
> +   unsigned long flags;
> +   struct zone *zone;
> +   struct zoneref *z;
> + 

[PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-11 Thread Mel Gorman
This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list. They can be freed in bulk using
free_pages_bulk().

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.

Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.

Signed-off-by: Mel Gorman 
---
 include/linux/gfp.h |  13 +
 mm/page_alloc.c | 118 +++-
 2 files changed, 129 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8572a1474e16..4903d1cc48dc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -515,6 +515,10 @@ static inline int arch_make_page_accessible(struct page 
*page)
 }
 #endif
 
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *list);
+
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
@@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int 
preferred_nid)
return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
 }
 
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct list_head 
*list)
+{
+   return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
+   nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -594,6 +606,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t 
size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..415059324dc3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t 
gfp_mask,
}
 }
 
+/* Drop reference counts and free order-0 pages from a list. */
+void free_pages_bulk(struct list_head *list)
+{
+   struct page *page, *next;
+
+   list_for_each_entry_safe(page, next, list, lru) {
+   trace_mm_page_free_batched(page);
+   if (put_page_testzero(page)) {
+   list_del(>lru);
+   __free_pages_ok(page, 0, FPI_NONE);
+   }
+   }
+}
+EXPORT_SYMBOL_GPL(free_pages_bulk);
+
 static inline unsigned int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
@@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
struct alloc_context *ac, gfp_t *alloc_mask,
unsigned int *alloc_flags)
 {
+   gfp_mask &= gfp_allowed_mask;
+   *alloc_mask = gfp_mask;
+
ac->highest_zoneidx = gfp_zone(gfp_mask);
ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
ac->nodemask = nodemask;
@@ -4960,6 +4978,104 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
return true;
 }
 
+/*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ *
+ * Returns the number of pages allocated.
+ */
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *alloc_list)
+{
+   struct page *page;
+   unsigned long flags;
+   struct zone *zone;
+   struct zoneref *z;
+   struct per_cpu_pages *pcp;
+   struct list_head *pcp_list;
+   struct alloc_context ac;
+   gfp_t alloc_mask;
+   unsigned int alloc_flags;
+   int alloced = 0;
+
+   if (nr_pages == 1)
+   goto failed;
+
+   /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+   if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, , 
_mask, _flags))
+   return 0;
+   gfp_mask = alloc_mask;
+
+   /* Find an allowed local zone that meets the high watermark. */
+   

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-11 Thread Mel Gorman
On Wed, Mar 10, 2021 at 03:46:50PM -0800, Andrew Morton wrote:
> On Wed, 10 Mar 2021 10:46:15 + Mel Gorman  
> wrote:
> 
> > This patch adds a new page allocator interface via alloc_pages_bulk,
> > and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> > to be allocated and added to a list. They can be freed in bulk using
> > free_pages_bulk().
> 
> Why am I surprised we don't already have this.
> 

It was prototyped a few years ago and discussed at LSF/MM so it's in
the back of your memory somewhere. It never got merged because it lacked
users and I didn't think carrying dead untested code was appropriate.

> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory, the
> > cpuset changes during the allocation or page debugging decides to fail
> > an allocation. It's up to the caller to request more pages in batch
> > if necessary.
> > 
> > Note that this implementation is not very efficient and could be improved
> > but it would require refactoring. The intent is to make it available early
> > to determine what semantics are required by different callers. Once the
> > full semantics are nailed down, it can be refactored.
> > 
> > ...
> >
> > +/* Drop reference counts and free order-0 pages from a list. */
> > +void free_pages_bulk(struct list_head *list)
> > +{
> > +   struct page *page, *next;
> > +
> > +   list_for_each_entry_safe(page, next, list, lru) {
> > +   trace_mm_page_free_batched(page);
> > +   if (put_page_testzero(page)) {
> > +   list_del(>lru);
> > +   __free_pages_ok(page, 0, FPI_NONE);
> > +   }
> > +   }
> > +}
> > +EXPORT_SYMBOL_GPL(free_pages_bulk);
> 
> I expect that batching games are planned in here as well?
> 

Potentially it could be done but the page allocator would need to be
fundamentally aware of batching to make it tidy or the per-cpu allocator
would need knowledge of how to handle batches in the free path.  Batch
freeing to the buddy allocator is problematic as buddy merging has to
happen. Batch freeing to per-cpu hits pcp->high limitations.

There are a couple of ways it *could* be done. Per-cpu lists could be
allowed to temporarily exceed the high limits and reduce them out-of-band
like what happens with counter updates or remote pcp freeing. Care
would need to be taken when memory is low to avoid premature OOM
and to guarantee draining happens in a timely fashion. There would be
additional benefits to this. For example, release_pages() can hammer the
zone lock when freeing very large batches and would benefit from either
large batching or "plugging" the per-cpu list. I prototyped a series to
allow the batch limits to be temporarily exceeded but it did not actually
improve performance because of errors in the implementation and it needs
a lot of work.

> >  static inline unsigned int
> >  gfp_to_alloc_flags(gfp_t gfp_mask)
> >  {
> > @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t 
> > gfp_mask, unsigned int order,
> > struct alloc_context *ac, gfp_t *alloc_mask,
> > unsigned int *alloc_flags)
> >  {
> > +   gfp_mask &= gfp_allowed_mask;
> > +   *alloc_mask = gfp_mask;
> > +
> > ac->highest_zoneidx = gfp_zone(gfp_mask);
> > ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > ac->nodemask = nodemask;
> > @@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t 
> > gfp_mask, unsigned int order,
> > return true;
> >  }
> >  
> > +/*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to list.
> > + */
> 
> Documentation is rather lame.  Returns number of pages allocated...
> 

I added a note on the return value. The documentation is lame because at
this point, we do not know what the required semantics for future users
are. We have two examples at the moment in this series but I think it
would be better to add kerneldoc documentation when there is a reasonable
expectation that the API will not change. For example, SLUB could use
this API when it fails to allocate a high-order page and instead allocate
batches of order-0 pages but I did not investigate how feasible that
is. Similarly, it's possible that we really need to deal with high-order
batch allocations in which case, the per-cpu list should be high-order
aware or the core buddy allocator needs to be batch-allocation aware.

> > +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > +   nodemask_t *nodemask, int nr_pages,
> > +   struct list_head *alloc_list)
> > +{
> > +   struct page *page;
> > +   unsigned long flags;
> > +   struct zone *zone;
> > +   struct zoneref *z;
> > +   struct per_cpu_pages *pcp;
> > +   struct list_head *pcp_list;
> > +   struct alloc_context ac;
> > +   gfp_t alloc_mask;
> > +   unsigned int alloc_flags;
> > +   

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-10 Thread Andrew Morton
On Wed, 10 Mar 2021 10:46:15 + Mel Gorman  
wrote:

> This patch adds a new page allocator interface via alloc_pages_bulk,
> and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> to be allocated and added to a list. They can be freed in bulk using
> free_pages_bulk().

Why am I surprised we don't already have this.

> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch
> if necessary.
> 
> Note that this implementation is not very efficient and could be improved
> but it would require refactoring. The intent is to make it available early
> to determine what semantics are required by different callers. Once the
> full semantics are nailed down, it can be refactored.
> 
> ...
>
> +/* Drop reference counts and free order-0 pages from a list. */
> +void free_pages_bulk(struct list_head *list)
> +{
> + struct page *page, *next;
> +
> + list_for_each_entry_safe(page, next, list, lru) {
> + trace_mm_page_free_batched(page);
> + if (put_page_testzero(page)) {
> + list_del(>lru);
> + __free_pages_ok(page, 0, FPI_NONE);
> + }
> + }
> +}
> +EXPORT_SYMBOL_GPL(free_pages_bulk);

I expect that batching games are planned in here as well?

>  static inline unsigned int
>  gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
> @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
> unsigned int order,
>   struct alloc_context *ac, gfp_t *alloc_mask,
>   unsigned int *alloc_flags)
>  {
> + gfp_mask &= gfp_allowed_mask;
> + *alloc_mask = gfp_mask;
> +
>   ac->highest_zoneidx = gfp_zone(gfp_mask);
>   ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
>   ac->nodemask = nodemask;
> @@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
> unsigned int order,
>   return true;
>  }
>  
> +/*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + */

Documentation is rather lame.  Returns number of pages allocated...

> +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *alloc_list)
> +{
> + struct page *page;
> + unsigned long flags;
> + struct zone *zone;
> + struct zoneref *z;
> + struct per_cpu_pages *pcp;
> + struct list_head *pcp_list;
> + struct alloc_context ac;
> + gfp_t alloc_mask;
> + unsigned int alloc_flags;
> + int alloced = 0;
> +
> + if (nr_pages == 1)
> + goto failed;
> +
> + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> + if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, , 
> _mask, _flags))
> + return 0;
> + gfp_mask = alloc_mask;
> +
> + /* Find an allowed local zone that meets the high watermark. */
> + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
> ac.highest_zoneidx, ac.nodemask) {
> + unsigned long mark;
> +
> + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> + !__cpuset_zone_allowed(zone, gfp_mask)) {
> + continue;
> + }
> +
> + if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> + zone_to_nid(zone) != 
> zone_to_nid(ac.preferred_zoneref->zone)) {
> + goto failed;
> + }
> +
> + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + 
> nr_pages;
> + if (zone_watermark_fast(zone, 0,  mark,
> + zonelist_zone_idx(ac.preferred_zoneref),
> + alloc_flags, gfp_mask)) {
> + break;
> + }
> + }

I suspect the above was stolen from elsewhere and that some code
commonification is planned.


> + if (!zone)
> + return 0;
> +
> + /* Attempt the batch allocation */
> + local_irq_save(flags);
> + pcp = _cpu_ptr(zone->pageset)->pcp;
> + pcp_list = >lists[ac.migratetype];
> +
> + while (alloced < nr_pages) {
> + page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> + pcp, pcp_list);
> + if (!page)
> + break;
> +
> + prep_new_page(page, 0, gfp_mask, 0);

I wonder if it would be worth running prep_new_page() in a second pass,
after reenabling interrupts.

Speaking of which, will the realtime people get upset about the
irqs-off latency?  How many pages are we talking about here?

> + list_add(>lru, alloc_list);
> +

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-10 Thread Mel Gorman
On Wed, Mar 10, 2021 at 01:04:17PM +0200, Shay Agroskin wrote:
> 
> Mel Gorman  writes:
> 
> > 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 8572a1474e16..4903d1cc48dc 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -515,6 +515,10 @@ static inline int arch_make_page_accessible(struct
> > page *page)
> >  }
> >  #endif
> >   +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > +   nodemask_t *nodemask, int nr_pages,
> > +   struct list_head *list);
> > +
> >  struct page *
> >  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int
> > preferred_nid,
> > nodemask_t  *nodemask);
> > @@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > int preferred_nid)
> > return __alloc_pages_nodemask(gfp_mask, order,  preferred_nid, NULL);
> >  }
> > +/* Bulk allocate order-0 pages */
> > +static inline unsigned long
> > +alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct
> > list_head *list)
> > +{
> > +   return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
> > +   nr_pages, list);
> 
> Is the second line indentation intentional ? Why not align it to the first
> argument (gfp_mask) ?
> 

No particular reason. I usually pick this as it's visually clearer to me
that it's part of the same line when the multi-line is part of an if block.

> > +}
> > +
> >  /*
> >   * Allocate pages, preferring the node given as nid. The node   must be
> > valid and
> >   * online. For more general interface, see alloc_pages_node().
> > @@ -594,6 +606,7 @@ void * __meminit alloc_pages_exact_nid(int nid,
> > size_t size, gfp_t gfp_mask);
> >extern void __free_pages(struct page *page, unsigned int  order);
> >  extern void free_pages(unsigned long addr, unsigned int order);
> > +extern void free_pages_bulk(struct list_head *list);
> >  struct page_frag_cache;
> >  extern void __page_frag_cache_drain(struct page *page, unsigned  int
> > count);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3e4b29ee2b1e..ff1e55793786 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int order,
> > gfp_t gfp_mask,
> > }
> >  }
> > ...
> > +/*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to
> > list.
> > + */
> > +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > +   nodemask_t *nodemask, int nr_pages,
> > +   struct list_head *alloc_list)
> > +{
> > +   struct page *page;
> > +   unsigned long flags;
> > +   struct zone *zone;
> > +   struct zoneref *z;
> > +   struct per_cpu_pages *pcp;
> > +   struct list_head *pcp_list;
> > +   struct alloc_context ac;
> > +   gfp_t alloc_mask;
> > +   unsigned int alloc_flags;
> > +   int alloced = 0;
> 
> Does alloced count the number of allocated pages ?

Yes.

> Do you mind renaming it to 'allocated' ?

I will if there is another version as I do not feel particularly strongly
about alloced vs allocated. alloc was to match the function name and I
don't think the change makes it much clearer.


> > 
> > +   /* Attempt the batch allocation */
> > +   local_irq_save(flags);
> > +   pcp = _cpu_ptr(zone->pageset)->pcp;
> > +   pcp_list = >lists[ac.migratetype];
> > +
> > +   while (alloced < nr_pages) {
> > +   page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> > + pcp, pcp_list);
> 
> Same indentation comment as before
> 

Again, simple personal perference to avoid any possibility it's mixed
up with a later line. There has not been consistent code styling
enforcement of what indentation style should be used for a multi-line
within mm/page_alloc.c

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-10 Thread Shay Agroskin



Mel Gorman  writes:



diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8572a1474e16..4903d1cc48dc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -515,6 +515,10 @@ static inline int 
arch_make_page_accessible(struct page *page)

 }
 #endif
 
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int 
preferred_nid,
+nodemask_t *nodemask, int 
nr_pages,

+   struct list_head *list);
+
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int 
 preferred_nid,
 			nodemask_t 
 *nodemask);
@@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int 
order, int preferred_nid)
 	return __alloc_pages_nodemask(gfp_mask, order, 
 preferred_nid, NULL);

 }
 
+/* Bulk allocate order-0 pages */

+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct 
list_head *list)

+{
+	return __alloc_pages_bulk_nodemask(gfp_mask, 
numa_mem_id(), NULL,
+			nr_pages, 
list);


Is the second line indentation intentional ? Why not align it to 
the first argument (gfp_mask) ?



+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node 
  must be valid and

  * online. For more general interface, see alloc_pages_node().
@@ -594,6 +606,7 @@ void * __meminit alloc_pages_exact_nid(int 
nid, size_t size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int 
 order);

 extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned 
 int count);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..ff1e55793786 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int 
order, gfp_t gfp_mask,

}
 }
...
 
+/*
+ * This is a batched version of the page allocator that 
attempts to
+ * allocate nr_pages quickly from the preferred zone and add 
them to list.

+ */
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int 
preferred_nid,

+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *alloc_list)
+{
+   struct page *page;
+   unsigned long flags;
+   struct zone *zone;
+   struct zoneref *z;
+   struct per_cpu_pages *pcp;
+   struct list_head *pcp_list;
+   struct alloc_context ac;
+   gfp_t alloc_mask;
+   unsigned int alloc_flags;
+   int alloced = 0;


Does alloced count the number of allocated pages ? Do you mind 
renaming it to 'allocated' ?



+
+   if (nr_pages == 1)
+   goto failed;
+
+	/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 
page. */
+	if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, 
nodemask, , _mask, _flags))

+   return 0;
+   gfp_mask = alloc_mask;
+
+	/* Find an allowed local zone that meets the high 
watermark. */
+	for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
ac.highest_zoneidx, ac.nodemask) {

+   unsigned long mark;
+
+		if (cpusets_enabled() && (alloc_flags & 
ALLOC_CPUSET) &&

+   !__cpuset_zone_allowed(zone, gfp_mask)) {
+   continue;
+   }
+
+		if (nr_online_nodes > 1 && zone != 
ac.preferred_zoneref->zone &&
+		zone_to_nid(zone) != 
zone_to_nid(ac.preferred_zoneref->zone)) {

+   goto failed;
+   }
+
+		mark = wmark_pages(zone, alloc_flags & 
ALLOC_WMARK_MASK) + nr_pages;

+   if (zone_watermark_fast(zone, 0,  mark,
+ 
zonelist_zone_idx(ac.preferred_zoneref),

+   alloc_flags, gfp_mask)) {
+   break;
+   }
+   }
+   if (!zone)
+   return 0;
+
+   /* Attempt the batch allocation */
+   local_irq_save(flags);
+   pcp = _cpu_ptr(zone->pageset)->pcp;
+   pcp_list = >lists[ac.migratetype];
+
+   while (alloced < nr_pages) {
+		page = __rmqueue_pcplist(zone, ac.migratetype, 
alloc_flags,
+ 
pcp, pcp_list);


Same indentation comment as before


+   if (!page)
+   break;
+
+   prep_new_page(page, 0, gfp_mask, 0);
+   list_add(>lru, alloc_list);
+   alloced++;
+   }
+
+   if (!alloced)
+   goto failed_irq;
+
+   if (alloced) {
+		__count_zid_vm_events(PGALLOC, zone_idx(zone), 
alloced);

+   zone_statistics(zone, zone);
+   }
+
+   local_irq_restore(flags);
+
+   return alloced;
+
+failed_irq:
+   local_irq_restore(flags);
+
+failed:
+	page = __alloc_pages_nodemask(gfp_mask, 0, preferred_nid, 
nodemask);

+   if (page) {
+   alloced++;
+   list_add(>lru, alloc_list);
+   }
+
+   return alloced;
+}
+EXPORT_SYMBOL_GPL(__alloc_pages_bulk_nodemask);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -4981,8 +5092,6 @@ 

[PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-10 Thread Mel Gorman
This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list. They can be freed in bulk using
free_pages_bulk().

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.

Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.

Signed-off-by: Mel Gorman 
---
 include/linux/gfp.h |  13 +
 mm/page_alloc.c | 113 +++-
 2 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8572a1474e16..4903d1cc48dc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -515,6 +515,10 @@ static inline int arch_make_page_accessible(struct page 
*page)
 }
 #endif
 
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *list);
+
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
@@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int 
preferred_nid)
return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
 }
 
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct list_head 
*list)
+{
+   return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
+   nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -594,6 +606,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t 
size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..ff1e55793786 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t 
gfp_mask,
}
 }
 
+/* Drop reference counts and free order-0 pages from a list. */
+void free_pages_bulk(struct list_head *list)
+{
+   struct page *page, *next;
+
+   list_for_each_entry_safe(page, next, list, lru) {
+   trace_mm_page_free_batched(page);
+   if (put_page_testzero(page)) {
+   list_del(>lru);
+   __free_pages_ok(page, 0, FPI_NONE);
+   }
+   }
+}
+EXPORT_SYMBOL_GPL(free_pages_bulk);
+
 static inline unsigned int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
@@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
struct alloc_context *ac, gfp_t *alloc_mask,
unsigned int *alloc_flags)
 {
+   gfp_mask &= gfp_allowed_mask;
+   *alloc_mask = gfp_mask;
+
ac->highest_zoneidx = gfp_zone(gfp_mask);
ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
ac->nodemask = nodemask;
@@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
return true;
 }
 
+/*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ */
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *alloc_list)
+{
+   struct page *page;
+   unsigned long flags;
+   struct zone *zone;
+   struct zoneref *z;
+   struct per_cpu_pages *pcp;
+   struct list_head *pcp_list;
+   struct alloc_context ac;
+   gfp_t alloc_mask;
+   unsigned int alloc_flags;
+   int alloced = 0;
+
+   if (nr_pages == 1)
+   goto failed;
+
+   /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+   if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, , 
_mask, _flags))
+   return 0;
+   gfp_mask = alloc_mask;
+
+   /* Find an allowed local zone that meets the high watermark. */
+   for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 

Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-09 Thread Mel Gorman
On Tue, Mar 09, 2021 at 05:12:30PM +, Christoph Hellwig wrote:
> Would vmalloc be another good user of this API? 
> 
> > +   /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> > +   if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, , 
> > _mask, _flags))
> 
> This crazy long line is really hard to follow.
> 

It's not crazier than what is already in alloc_pages_nodemask to share
code.

> > +   return 0;
> > +   gfp_mask = alloc_mask;
> > +
> > +   /* Find an allowed local zone that meets the high watermark. */
> > +   for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
> > ac.highest_zoneidx, ac.nodemask) {
> 
> Same here.
> 

Similar to what happens in get_page_from_freelist with the
for_next_zone_zonelist_nodemask iterator.

> > +   unsigned long mark;
> > +
> > +   if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> > +   !__cpuset_zone_allowed(zone, gfp_mask)) {
> > +   continue;
> > +   }
> 
> No need for the curly braces.
> 

Yes, but it's for coding style. MM has no hard coding style guidelines
around this but for sched, it's generally preferred that if the "if"
statement spans multiple lines then it should use {} even if the block
is one line long for clarity.

> > }
> >  
> > -   gfp_mask &= gfp_allowed_mask;
> > -   alloc_mask = gfp_mask;
> 
> Is this change intentional?

Yes so that prepare_alloc_pages works for both the single page and bulk
allocator. Slightly less code duplication.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-09 Thread Christoph Hellwig
Would vmalloc be another good user of this API? 

> + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> + if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, , 
> _mask, _flags))

This crazy long line is really hard to follow.

> + return 0;
> + gfp_mask = alloc_mask;
> +
> + /* Find an allowed local zone that meets the high watermark. */
> + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 
> ac.highest_zoneidx, ac.nodemask) {

Same here.

> + unsigned long mark;
> +
> + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> + !__cpuset_zone_allowed(zone, gfp_mask)) {
> + continue;
> + }

No need for the curly braces.

>   }
>  
> - gfp_mask &= gfp_allowed_mask;
> - alloc_mask = gfp_mask;

Is this change intentional?


[PATCH 2/5] mm/page_alloc: Add a bulk page allocator

2021-03-01 Thread Mel Gorman
This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list. They can be freed in bulk using
free_pages_bulk().

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.

Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.

Signed-off-by: Mel Gorman 
---
 include/linux/gfp.h |  13 +
 mm/page_alloc.c | 113 +++-
 2 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8572a1474e16..4903d1cc48dc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -515,6 +515,10 @@ static inline int arch_make_page_accessible(struct page 
*page)
 }
 #endif
 
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *list);
+
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
@@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int 
preferred_nid)
return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
 }
 
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct list_head 
*list)
+{
+   return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
+   nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -594,6 +606,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t 
size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..ff1e55793786 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t 
gfp_mask,
}
 }
 
+/* Drop reference counts and free order-0 pages from a list. */
+void free_pages_bulk(struct list_head *list)
+{
+   struct page *page, *next;
+
+   list_for_each_entry_safe(page, next, list, lru) {
+   trace_mm_page_free_batched(page);
+   if (put_page_testzero(page)) {
+   list_del(>lru);
+   __free_pages_ok(page, 0, FPI_NONE);
+   }
+   }
+}
+EXPORT_SYMBOL_GPL(free_pages_bulk);
+
 static inline unsigned int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
@@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
struct alloc_context *ac, gfp_t *alloc_mask,
unsigned int *alloc_flags)
 {
+   gfp_mask &= gfp_allowed_mask;
+   *alloc_mask = gfp_mask;
+
ac->highest_zoneidx = gfp_zone(gfp_mask);
ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
ac->nodemask = nodemask;
@@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
return true;
 }
 
+/*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ */
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *alloc_list)
+{
+   struct page *page;
+   unsigned long flags;
+   struct zone *zone;
+   struct zoneref *z;
+   struct per_cpu_pages *pcp;
+   struct list_head *pcp_list;
+   struct alloc_context ac;
+   gfp_t alloc_mask;
+   unsigned int alloc_flags;
+   int alloced = 0;
+
+   if (nr_pages == 1)
+   goto failed;
+
+   /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+   if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, , 
_mask, _flags))
+   return 0;
+   gfp_mask = alloc_mask;
+
+   /* Find an allowed local zone that meets the high watermark. */
+   for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 

[PATCH 2/3] mm, page_alloc: Add a bulk page allocator

2021-02-24 Thread Mel Gorman
This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list. They can be freed in bulk using
free_pages_bulk().

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.

Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.

Signed-off-by: Mel Gorman 
---
 include/linux/gfp.h |  13 +
 mm/page_alloc.c | 113 +++-
 2 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6e479e9c48ce..f2a1ae4b95b9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -501,6 +501,10 @@ static inline int arch_make_page_accessible(struct page 
*page)
 }
 #endif
 
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *list);
+
 struct page *
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
@@ -511,6 +515,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int 
preferred_nid)
return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
 }
 
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct list_head 
*list)
+{
+   return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
+   nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
@@ -580,6 +592,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t 
size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);
 
 struct page_frag_cache;
 extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d5b6f7..a36344bc1045 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4435,6 +4435,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t 
gfp_mask,
}
 }
 
+/* Drop reference counts and free order-0 pages from a list. */
+void free_pages_bulk(struct list_head *list)
+{
+   struct page *page, *next;
+
+   list_for_each_entry_safe(page, next, list, lru) {
+   trace_mm_page_free_batched(page);
+   if (put_page_testzero(page)) {
+   list_del(>lru);
+   __free_pages_ok(page, 0, FPI_NONE);
+   }
+   }
+}
+EXPORT_SYMBOL_GPL(free_pages_bulk);
+
 static inline unsigned int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
@@ -4918,6 +4933,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
struct alloc_context *ac, gfp_t *alloc_mask,
unsigned int *alloc_flags)
 {
+   gfp_mask &= gfp_allowed_mask;
+   *alloc_mask = gfp_mask;
+
ac->highest_zoneidx = gfp_zone(gfp_mask);
ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
ac->nodemask = nodemask;
@@ -4959,6 +4977,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, 
unsigned int order,
return true;
 }
 
+/*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ */
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+   nodemask_t *nodemask, int nr_pages,
+   struct list_head *alloc_list)
+{
+   struct page *page;
+   unsigned long flags;
+   struct zone *zone;
+   struct zoneref *z;
+   struct per_cpu_pages *pcp;
+   struct list_head *pcp_list;
+   struct alloc_context ac;
+   gfp_t alloc_mask;
+   unsigned int alloc_flags;
+   int alloced = 0;
+
+   if (nr_pages == 1)
+   goto failed;
+
+   /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+   if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, , 
_mask, _flags))
+   return 0;
+   gfp_mask = alloc_mask;
+
+   /* Find an allowed local zone that meets the high watermark. */
+   for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, 

Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-16 Thread Mel Gorman
On Mon, Jan 16, 2017 at 03:25:18PM +0100, Jesper Dangaard Brouer wrote:
> On Mon,  9 Jan 2017 16:35:18 +
> Mel Gorman  wrote:
> 
> > This patch adds a new page allocator interface via alloc_pages_bulk,
> > __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> > number of pages to be allocated and added to a list. They can be freed in
> > bulk using free_pages_bulk(). Note that it would theoretically be possible
> > to use free_hot_cold_page_list for faster frees if the symbol was exported,
> > the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> > This would be significantly faster in the free path but also more unsafer
> > and a harder API to use.
> > 
> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory, the
> > cpuset changes during the allocation or page debugging decides to fail
> > an allocation. It's up to the caller to request more pages in batch if
> > necessary.
> > 
> > The following compares the allocation cost per page for different batch
> > sizes. The baseline is allocating them one at a time and it compares with
> > the performance when using the new allocation interface.
> 
> I've also played with testing the bulking API here:
>  [1] 
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
> 
> My baseline single (order-0 page) show: 158 cycles(tsc) 39.593 ns
> 
> Using bulking API:
>  Bulk:   1 cycles: 128 nanosec: 32.134
>  Bulk:   2 cycles: 107 nanosec: 26.783
>  Bulk:   3 cycles: 100 nanosec: 25.047
>  Bulk:   4 cycles:  95 nanosec: 23.988
>  Bulk:   8 cycles:  91 nanosec: 22.823
>  Bulk:  16 cycles:  88 nanosec: 22.093
>  Bulk:  32 cycles:  85 nanosec: 21.338
>  Bulk:  64 cycles:  85 nanosec: 21.315
>  Bulk: 128 cycles:  84 nanosec: 21.214
>  Bulk: 256 cycles: 115 nanosec: 28.979
> 
> This bulk API (and other improvements part of patchset) definitely
> moves the speed of the page allocator closer to my (crazy) time budget
> target of between 201 to 269 cycles per packet[1].  Remember I was
> reporting[2] order-0 cost between 231 to 277 cycles, at MM-summit
> 2016, so this is a huge improvement since then.
> 

Good to hear.

> The bulk numbers are great, but it still cannot compete with the
> recycles tricks used by drivers.  Looking at the code (and as Mel also
> mentions) there is room for improvements especially on the bulk free-side.
> 

A major component there is how the ref handling is done and the safety
checks. If necessary, you could mandate that callers drop the reference
count or allow pages to be freed with an elevated count to avoid the atomic
ops. In an early prototype, I made the refcount "mistake" and freeing was
half the cost. I restored it in the final version to have an API that was
almost identical to the existing allocator other than the bulking aspects.

You could also disable all the other safety checks and flag that the bulk
alloc/free potentially frees pages in inconsistent state.  That would
increase the performance at the cost of safety but that may be acceptable
given that driver recycling of pages also avoids the same checks.

You could also consider disabling the statistics updates to avoid a bunch
of per-cpu stat operations, particularly if the pages were mostly recycled
by the generic pool allocator.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-16 Thread Mel Gorman
On Mon, Jan 16, 2017 at 03:25:18PM +0100, Jesper Dangaard Brouer wrote:
> On Mon,  9 Jan 2017 16:35:18 +
> Mel Gorman  wrote:
> 
> > This patch adds a new page allocator interface via alloc_pages_bulk,
> > __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> > number of pages to be allocated and added to a list. They can be freed in
> > bulk using free_pages_bulk(). Note that it would theoretically be possible
> > to use free_hot_cold_page_list for faster frees if the symbol was exported,
> > the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> > This would be significantly faster in the free path but also more unsafer
> > and a harder API to use.
> > 
> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory, the
> > cpuset changes during the allocation or page debugging decides to fail
> > an allocation. It's up to the caller to request more pages in batch if
> > necessary.
> > 
> > The following compares the allocation cost per page for different batch
> > sizes. The baseline is allocating them one at a time and it compares with
> > the performance when using the new allocation interface.
> 
> I've also played with testing the bulking API here:
>  [1] 
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
> 
> My baseline single (order-0 page) show: 158 cycles(tsc) 39.593 ns
> 
> Using bulking API:
>  Bulk:   1 cycles: 128 nanosec: 32.134
>  Bulk:   2 cycles: 107 nanosec: 26.783
>  Bulk:   3 cycles: 100 nanosec: 25.047
>  Bulk:   4 cycles:  95 nanosec: 23.988
>  Bulk:   8 cycles:  91 nanosec: 22.823
>  Bulk:  16 cycles:  88 nanosec: 22.093
>  Bulk:  32 cycles:  85 nanosec: 21.338
>  Bulk:  64 cycles:  85 nanosec: 21.315
>  Bulk: 128 cycles:  84 nanosec: 21.214
>  Bulk: 256 cycles: 115 nanosec: 28.979
> 
> This bulk API (and other improvements part of patchset) definitely
> moves the speed of the page allocator closer to my (crazy) time budget
> target of between 201 to 269 cycles per packet[1].  Remember I was
> reporting[2] order-0 cost between 231 to 277 cycles, at MM-summit
> 2016, so this is a huge improvement since then.
> 

Good to hear.

> The bulk numbers are great, but it still cannot compete with the
> recycles tricks used by drivers.  Looking at the code (and as Mel also
> mentions) there is room for improvements especially on the bulk free-side.
> 

A major component there is how the ref handling is done and the safety
checks. If necessary, you could mandate that callers drop the reference
count or allow pages to be freed with an elevated count to avoid the atomic
ops. In an early prototype, I made the refcount "mistake" and freeing was
half the cost. I restored it in the final version to have an API that was
almost identical to the existing allocator other than the bulking aspects.

You could also disable all the other safety checks and flag that the bulk
alloc/free potentially frees pages in inconsistent state.  That would
increase the performance at the cost of safety but that may be acceptable
given that driver recycling of pages also avoids the same checks.

You could also consider disabling the statistics updates to avoid a bunch
of per-cpu stat operations, particularly if the pages were mostly recycled
by the generic pool allocator.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-16 Thread Jesper Dangaard Brouer
On Mon,  9 Jan 2017 16:35:18 +
Mel Gorman  wrote:

> This patch adds a new page allocator interface via alloc_pages_bulk,
> __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> number of pages to be allocated and added to a list. They can be freed in
> bulk using free_pages_bulk(). Note that it would theoretically be possible
> to use free_hot_cold_page_list for faster frees if the symbol was exported,
> the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> This would be significantly faster in the free path but also more unsafer
> and a harder API to use.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch if
> necessary.
> 
> The following compares the allocation cost per page for different batch
> sizes. The baseline is allocating them one at a time and it compares with
> the performance when using the new allocation interface.

I've also played with testing the bulking API here:
 [1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c

My baseline single (order-0 page) show: 158 cycles(tsc) 39.593 ns

Using bulking API:
 Bulk:   1 cycles: 128 nanosec: 32.134
 Bulk:   2 cycles: 107 nanosec: 26.783
 Bulk:   3 cycles: 100 nanosec: 25.047
 Bulk:   4 cycles:  95 nanosec: 23.988
 Bulk:   8 cycles:  91 nanosec: 22.823
 Bulk:  16 cycles:  88 nanosec: 22.093
 Bulk:  32 cycles:  85 nanosec: 21.338
 Bulk:  64 cycles:  85 nanosec: 21.315
 Bulk: 128 cycles:  84 nanosec: 21.214
 Bulk: 256 cycles: 115 nanosec: 28.979

This bulk API (and other improvements part of patchset) definitely
moves the speed of the page allocator closer to my (crazy) time budget
target of between 201 to 269 cycles per packet[1].  Remember I was
reporting[2] order-0 cost between 231 to 277 cycles, at MM-summit
2016, so this is a huge improvement since then.

The bulk numbers are great, but it still cannot compete with the
recycles tricks used by drivers.  Looking at the code (and as Mel also
mentions) there is room for improvements especially on the bulk free-side.


[1] 
http://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
[2] 
http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

> pagealloc
>   4.10.0-rc2 
> 4.10.0-rc2
>one-at-a-time
> bulk-v2
> Ameanalloc-odr0-1   259.54 (  0.00%)   106.62 ( 
> 58.92%)
> Ameanalloc-odr0-2   193.38 (  0.00%)76.38 ( 
> 60.50%)
> Ameanalloc-odr0-4   162.38 (  0.00%)57.23 ( 
> 64.76%)
> Ameanalloc-odr0-8   144.31 (  0.00%)48.77 ( 
> 66.20%)
> Ameanalloc-odr0-16  134.08 (  0.00%)45.38 ( 
> 66.15%)
> Ameanalloc-odr0-32  128.62 (  0.00%)42.77 ( 
> 66.75%)
> Ameanalloc-odr0-64  126.00 (  0.00%)41.00 ( 
> 67.46%)
> Ameanalloc-odr0-128 125.00 (  0.00%)40.08 ( 
> 67.94%)
> Ameanalloc-odr0-256 136.62 (  0.00%)56.00 ( 
> 59.01%)
> Ameanalloc-odr0-512 152.00 (  0.00%)69.00 ( 
> 54.61%)
> Ameanalloc-odr0-1024158.00 (  0.00%)76.23 ( 
> 51.75%)
> Ameanalloc-odr0-2048163.00 (  0.00%)81.15 ( 
> 50.21%)
> Ameanalloc-odr0-4096169.77 (  0.00%)85.92 ( 
> 49.39%)
> Ameanalloc-odr0-8192170.00 (  0.00%)88.00 ( 
> 48.24%)
> Ameanalloc-odr0-16384   170.00 (  0.00%)89.00 ( 
> 47.65%)
> Ameanfree-odr0-1 88.69 (  0.00%)55.69 ( 
> 37.21%)
> Ameanfree-odr0-2 66.00 (  0.00%)49.38 ( 
> 25.17%)
> Ameanfree-odr0-4 54.23 (  0.00%)45.38 ( 
> 16.31%)
> Ameanfree-odr0-8 48.23 (  0.00%)44.23 (  
> 8.29%)
> Ameanfree-odr0-1647.00 (  0.00%)45.00 (  
> 4.26%)
> Ameanfree-odr0-3244.77 (  0.00%)43.92 (  
> 1.89%)
> Ameanfree-odr0-6444.00 (  0.00%)43.00 (  
> 2.27%)
> Ameanfree-odr0-128   43.00 (  0.00%)43.00 (  
> 0.00%)
> Ameanfree-odr0-256   60.69 (  0.00%)60.46 (  
> 0.38%)
> Ameanfree-odr0-512   79.23 (  0.00%)76.00 (  
> 4.08%)
> Ameanfree-odr0-1024  86.00 (  0.00%)85.38 (  
> 0.72%)
> Ameanfree-odr0-2048  91.00 (  0.00%)91.23 ( 

Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-16 Thread Jesper Dangaard Brouer
On Mon,  9 Jan 2017 16:35:18 +
Mel Gorman  wrote:

> This patch adds a new page allocator interface via alloc_pages_bulk,
> __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> number of pages to be allocated and added to a list. They can be freed in
> bulk using free_pages_bulk(). Note that it would theoretically be possible
> to use free_hot_cold_page_list for faster frees if the symbol was exported,
> the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> This would be significantly faster in the free path but also more unsafer
> and a harder API to use.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch if
> necessary.
> 
> The following compares the allocation cost per page for different batch
> sizes. The baseline is allocating them one at a time and it compares with
> the performance when using the new allocation interface.

I've also played with testing the bulking API here:
 [1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c

My baseline single (order-0 page) show: 158 cycles(tsc) 39.593 ns

Using bulking API:
 Bulk:   1 cycles: 128 nanosec: 32.134
 Bulk:   2 cycles: 107 nanosec: 26.783
 Bulk:   3 cycles: 100 nanosec: 25.047
 Bulk:   4 cycles:  95 nanosec: 23.988
 Bulk:   8 cycles:  91 nanosec: 22.823
 Bulk:  16 cycles:  88 nanosec: 22.093
 Bulk:  32 cycles:  85 nanosec: 21.338
 Bulk:  64 cycles:  85 nanosec: 21.315
 Bulk: 128 cycles:  84 nanosec: 21.214
 Bulk: 256 cycles: 115 nanosec: 28.979

This bulk API (and other improvements part of patchset) definitely
moves the speed of the page allocator closer to my (crazy) time budget
target of between 201 to 269 cycles per packet[1].  Remember I was
reporting[2] order-0 cost between 231 to 277 cycles, at MM-summit
2016, so this is a huge improvement since then.

The bulk numbers are great, but it still cannot compete with the
recycles tricks used by drivers.  Looking at the code (and as Mel also
mentions) there is room for improvements especially on the bulk free-side.


[1] 
http://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
[2] 
http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

> pagealloc
>   4.10.0-rc2 
> 4.10.0-rc2
>one-at-a-time
> bulk-v2
> Ameanalloc-odr0-1   259.54 (  0.00%)   106.62 ( 
> 58.92%)
> Ameanalloc-odr0-2   193.38 (  0.00%)76.38 ( 
> 60.50%)
> Ameanalloc-odr0-4   162.38 (  0.00%)57.23 ( 
> 64.76%)
> Ameanalloc-odr0-8   144.31 (  0.00%)48.77 ( 
> 66.20%)
> Ameanalloc-odr0-16  134.08 (  0.00%)45.38 ( 
> 66.15%)
> Ameanalloc-odr0-32  128.62 (  0.00%)42.77 ( 
> 66.75%)
> Ameanalloc-odr0-64  126.00 (  0.00%)41.00 ( 
> 67.46%)
> Ameanalloc-odr0-128 125.00 (  0.00%)40.08 ( 
> 67.94%)
> Ameanalloc-odr0-256 136.62 (  0.00%)56.00 ( 
> 59.01%)
> Ameanalloc-odr0-512 152.00 (  0.00%)69.00 ( 
> 54.61%)
> Ameanalloc-odr0-1024158.00 (  0.00%)76.23 ( 
> 51.75%)
> Ameanalloc-odr0-2048163.00 (  0.00%)81.15 ( 
> 50.21%)
> Ameanalloc-odr0-4096169.77 (  0.00%)85.92 ( 
> 49.39%)
> Ameanalloc-odr0-8192170.00 (  0.00%)88.00 ( 
> 48.24%)
> Ameanalloc-odr0-16384   170.00 (  0.00%)89.00 ( 
> 47.65%)
> Ameanfree-odr0-1 88.69 (  0.00%)55.69 ( 
> 37.21%)
> Ameanfree-odr0-2 66.00 (  0.00%)49.38 ( 
> 25.17%)
> Ameanfree-odr0-4 54.23 (  0.00%)45.38 ( 
> 16.31%)
> Ameanfree-odr0-8 48.23 (  0.00%)44.23 (  
> 8.29%)
> Ameanfree-odr0-1647.00 (  0.00%)45.00 (  
> 4.26%)
> Ameanfree-odr0-3244.77 (  0.00%)43.92 (  
> 1.89%)
> Ameanfree-odr0-6444.00 (  0.00%)43.00 (  
> 2.27%)
> Ameanfree-odr0-128   43.00 (  0.00%)43.00 (  
> 0.00%)
> Ameanfree-odr0-256   60.69 (  0.00%)60.46 (  
> 0.38%)
> Ameanfree-odr0-512   79.23 (  0.00%)76.00 (  
> 4.08%)
> Ameanfree-odr0-1024  86.00 (  0.00%)85.38 (  
> 0.72%)
> Ameanfree-odr0-2048  91.00 (  0.00%)91.23 ( 
> -0.25%)
> Amean

Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-10 Thread Mel Gorman
On Tue, Jan 10, 2017 at 12:00:27PM +0800, Hillf Danton wrote:
> > It shows a roughly 50-60% reduction in the cost of allocating pages.
> > The free paths are not improved as much but relatively little can be batched
> > there. It's not quite as fast as it could be but taking further shortcuts
> > would require making a lot of assumptions about the state of the page and
> > the context of the caller.
> > 
> > Signed-off-by: Mel Gorman 
> > ---
> Acked-by: Hillf Danton 
> 

Thanks.

> > @@ -2485,7 +2485,7 @@ void free_hot_cold_page(struct page *page, bool cold)
> >  }
> > 
> >  /*
> > - * Free a list of 0-order pages
> > + * Free a list of 0-order pages whose reference count is already zero.
> >   */
> >  void free_hot_cold_page_list(struct list_head *list, bool cold)
> >  {
> > @@ -2495,7 +2495,28 @@ void free_hot_cold_page_list(struct list_head *list, 
> > bool cold)
> > trace_mm_page_free_batched(page, cold);
> > free_hot_cold_page(page, cold);
> > }
> > +
> > +   INIT_LIST_HEAD(list);
> 
> Nit: can we cut this overhead off?

Yes, but note that any caller of free_hot_cold_page_list() is then
required to reinit the list themselves or it'll cause corruption. It's
unlikely that a user of the bulk interface will handle the refcounts and
be able to use this interface properly but if they do, they need to
either reinit this or add the hunk back in.

It happens that all callers currently don't care.

> >  /*
> >   * split_page takes a non-compound higher-order page, and splits it into
> > @@ -3887,6 +3908,99 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> > order,
> >  EXPORT_SYMBOL(__alloc_pages_nodemask);
> > 
> >  /*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to list.
> > + * Note that there is no guarantee that nr_pages will be allocated although
> > + * every effort will be made to allocate at least one. Unlike the core
> > + * allocator, no special effort is made to recover from transient
> > + * failures caused by changes in cpusets. It should only be used from !IRQ
> > + * context. An attempt to allocate a batch of patches from an interrupt
> > + * will allocate a single page.
> > + */
> > +unsigned long
> > +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> > +   struct zonelist *zonelist, nodemask_t *nodemask,
> > +   unsigned long nr_pages, struct list_head *alloc_list)
> > +{
> > +   struct page *page;
> > +   unsigned long alloced = 0;
> > +   unsigned int alloc_flags = ALLOC_WMARK_LOW;
> > +   struct zone *zone;
> > +   struct per_cpu_pages *pcp;
> > +   struct list_head *pcp_list;
> > +   int migratetype;
> > +   gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for 
> > allocation */
> > +   struct alloc_context ac = { };
> > +   bool cold = ((gfp_mask & __GFP_COLD) != 0);
> > +
> > +   /* If there are already pages on the list, don't bother */
> > +   if (!list_empty(alloc_list))
> > +   return 0;
> 
> Nit: can we move the check to the call site?

Yes, but it makes the API slightly more hazardous to use.

> > +
> > +   /* Only handle bulk allocation of order-0 */
> > +   if (order || in_interrupt())
> > +   goto failed;
> 
> Ditto
> 

Same, if the caller is in interrupt context, there is a slight risk that
they'll corrupt the list in a manner that will be tricky to catch. The
checks are to minimise the risk of being surprising.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-10 Thread Mel Gorman
On Tue, Jan 10, 2017 at 12:00:27PM +0800, Hillf Danton wrote:
> > It shows a roughly 50-60% reduction in the cost of allocating pages.
> > The free paths are not improved as much but relatively little can be batched
> > there. It's not quite as fast as it could be but taking further shortcuts
> > would require making a lot of assumptions about the state of the page and
> > the context of the caller.
> > 
> > Signed-off-by: Mel Gorman 
> > ---
> Acked-by: Hillf Danton 
> 

Thanks.

> > @@ -2485,7 +2485,7 @@ void free_hot_cold_page(struct page *page, bool cold)
> >  }
> > 
> >  /*
> > - * Free a list of 0-order pages
> > + * Free a list of 0-order pages whose reference count is already zero.
> >   */
> >  void free_hot_cold_page_list(struct list_head *list, bool cold)
> >  {
> > @@ -2495,7 +2495,28 @@ void free_hot_cold_page_list(struct list_head *list, 
> > bool cold)
> > trace_mm_page_free_batched(page, cold);
> > free_hot_cold_page(page, cold);
> > }
> > +
> > +   INIT_LIST_HEAD(list);
> 
> Nit: can we cut this overhead off?

Yes, but note that any caller of free_hot_cold_page_list() is then
required to reinit the list themselves or it'll cause corruption. It's
unlikely that a user of the bulk interface will handle the refcounts and
be able to use this interface properly but if they do, they need to
either reinit this or add the hunk back in.

It happens that all callers currently don't care.

> >  /*
> >   * split_page takes a non-compound higher-order page, and splits it into
> > @@ -3887,6 +3908,99 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> > order,
> >  EXPORT_SYMBOL(__alloc_pages_nodemask);
> > 
> >  /*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to list.
> > + * Note that there is no guarantee that nr_pages will be allocated although
> > + * every effort will be made to allocate at least one. Unlike the core
> > + * allocator, no special effort is made to recover from transient
> > + * failures caused by changes in cpusets. It should only be used from !IRQ
> > + * context. An attempt to allocate a batch of patches from an interrupt
> > + * will allocate a single page.
> > + */
> > +unsigned long
> > +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> > +   struct zonelist *zonelist, nodemask_t *nodemask,
> > +   unsigned long nr_pages, struct list_head *alloc_list)
> > +{
> > +   struct page *page;
> > +   unsigned long alloced = 0;
> > +   unsigned int alloc_flags = ALLOC_WMARK_LOW;
> > +   struct zone *zone;
> > +   struct per_cpu_pages *pcp;
> > +   struct list_head *pcp_list;
> > +   int migratetype;
> > +   gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for 
> > allocation */
> > +   struct alloc_context ac = { };
> > +   bool cold = ((gfp_mask & __GFP_COLD) != 0);
> > +
> > +   /* If there are already pages on the list, don't bother */
> > +   if (!list_empty(alloc_list))
> > +   return 0;
> 
> Nit: can we move the check to the call site?

Yes, but it makes the API slightly more hazardous to use.

> > +
> > +   /* Only handle bulk allocation of order-0 */
> > +   if (order || in_interrupt())
> > +   goto failed;
> 
> Ditto
> 

Same, if the caller is in interrupt context, there is a slight risk that
they'll corrupt the list in a manner that will be tricky to catch. The
checks are to minimise the risk of being surprising.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-09 Thread Hillf Danton
On Tuesday, January 10, 2017 12:35 AM Mel Gorman wrote: 
> 
> This patch adds a new page allocator interface via alloc_pages_bulk,
> __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> number of pages to be allocated and added to a list. They can be freed in
> bulk using free_pages_bulk(). Note that it would theoretically be possible
> to use free_hot_cold_page_list for faster frees if the symbol was exported,
> the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> This would be significantly faster in the free path but also more unsafer
> and a harder API to use.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch if
> necessary.
> 
> The following compares the allocation cost per page for different batch
> sizes. The baseline is allocating them one at a time and it compares with
> the performance when using the new allocation interface.
> 
> pagealloc
>   4.10.0-rc2 
> 4.10.0-rc2
>one-at-a-time
> bulk-v2
> Ameanalloc-odr0-1   259.54 (  0.00%)   106.62 ( 
> 58.92%)
> Ameanalloc-odr0-2   193.38 (  0.00%)76.38 ( 
> 60.50%)
> Ameanalloc-odr0-4   162.38 (  0.00%)57.23 ( 
> 64.76%)
> Ameanalloc-odr0-8   144.31 (  0.00%)48.77 ( 
> 66.20%)
> Ameanalloc-odr0-16  134.08 (  0.00%)45.38 ( 
> 66.15%)
> Ameanalloc-odr0-32  128.62 (  0.00%)42.77 ( 
> 66.75%)
> Ameanalloc-odr0-64  126.00 (  0.00%)41.00 ( 
> 67.46%)
> Ameanalloc-odr0-128 125.00 (  0.00%)40.08 ( 
> 67.94%)
> Ameanalloc-odr0-256 136.62 (  0.00%)56.00 ( 
> 59.01%)
> Ameanalloc-odr0-512 152.00 (  0.00%)69.00 ( 
> 54.61%)
> Ameanalloc-odr0-1024158.00 (  0.00%)76.23 ( 
> 51.75%)
> Ameanalloc-odr0-2048163.00 (  0.00%)81.15 ( 
> 50.21%)
> Ameanalloc-odr0-4096169.77 (  0.00%)85.92 ( 
> 49.39%)
> Ameanalloc-odr0-8192170.00 (  0.00%)88.00 ( 
> 48.24%)
> Ameanalloc-odr0-16384   170.00 (  0.00%)89.00 ( 
> 47.65%)
> Ameanfree-odr0-1 88.69 (  0.00%)55.69 ( 
> 37.21%)
> Ameanfree-odr0-2 66.00 (  0.00%)49.38 ( 
> 25.17%)
> Ameanfree-odr0-4 54.23 (  0.00%)45.38 ( 
> 16.31%)
> Ameanfree-odr0-8 48.23 (  0.00%)44.23 (  
> 8.29%)
> Ameanfree-odr0-1647.00 (  0.00%)45.00 (  
> 4.26%)
> Ameanfree-odr0-3244.77 (  0.00%)43.92 (  
> 1.89%)
> Ameanfree-odr0-6444.00 (  0.00%)43.00 (  
> 2.27%)
> Ameanfree-odr0-128   43.00 (  0.00%)43.00 (  
> 0.00%)
> Ameanfree-odr0-256   60.69 (  0.00%)60.46 (  
> 0.38%)
> Ameanfree-odr0-512   79.23 (  0.00%)76.00 (  
> 4.08%)
> Ameanfree-odr0-1024  86.00 (  0.00%)85.38 (  
> 0.72%)
> Ameanfree-odr0-2048  91.00 (  0.00%)91.23 ( 
> -0.25%)
> Ameanfree-odr0-4096  94.85 (  0.00%)95.62 ( 
> -0.81%)
> Ameanfree-odr0-8192  97.00 (  0.00%)97.00 (  
> 0.00%)
> Ameanfree-odr0-16384 98.00 (  0.00%)97.46 (  
> 0.55%)
> Ameantotal-odr0-1   348.23 (  0.00%)   162.31 ( 
> 53.39%)
> Ameantotal-odr0-2   259.38 (  0.00%)   125.77 ( 
> 51.51%)
> Ameantotal-odr0-4   216.62 (  0.00%)   102.62 ( 
> 52.63%)
> Ameantotal-odr0-8   192.54 (  0.00%)93.00 ( 
> 51.70%)
> Ameantotal-odr0-16  181.08 (  0.00%)90.38 ( 
> 50.08%)
> Ameantotal-odr0-32  173.38 (  0.00%)86.69 ( 
> 50.00%)
> Ameantotal-odr0-64  170.00 (  0.00%)84.00 ( 
> 50.59%)
> Ameantotal-odr0-128 168.00 (  0.00%)83.08 ( 
> 50.55%)
> Ameantotal-odr0-256 197.31 (  0.00%)   116.46 ( 
> 40.97%)
> Ameantotal-odr0-512 231.23 (  0.00%)   145.00 ( 
> 37.29%)
> Ameantotal-odr0-1024244.00 (  0.00%)   161.62 ( 
> 33.76%)
> Ameantotal-odr0-2048254.00 (  0.00%)   172.38 ( 
> 32.13%)
> Ameantotal-odr0-4096264.62 (  0.00%)   181.54 ( 
> 31.40%)
> Amean

Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-09 Thread Hillf Danton
On Tuesday, January 10, 2017 12:35 AM Mel Gorman wrote: 
> 
> This patch adds a new page allocator interface via alloc_pages_bulk,
> __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
> number of pages to be allocated and added to a list. They can be freed in
> bulk using free_pages_bulk(). Note that it would theoretically be possible
> to use free_hot_cold_page_list for faster frees if the symbol was exported,
> the refcounts were 0 and the caller guaranteed it was not in an interrupt.
> This would be significantly faster in the free path but also more unsafer
> and a harder API to use.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch if
> necessary.
> 
> The following compares the allocation cost per page for different batch
> sizes. The baseline is allocating them one at a time and it compares with
> the performance when using the new allocation interface.
> 
> pagealloc
>   4.10.0-rc2 
> 4.10.0-rc2
>one-at-a-time
> bulk-v2
> Ameanalloc-odr0-1   259.54 (  0.00%)   106.62 ( 
> 58.92%)
> Ameanalloc-odr0-2   193.38 (  0.00%)76.38 ( 
> 60.50%)
> Ameanalloc-odr0-4   162.38 (  0.00%)57.23 ( 
> 64.76%)
> Ameanalloc-odr0-8   144.31 (  0.00%)48.77 ( 
> 66.20%)
> Ameanalloc-odr0-16  134.08 (  0.00%)45.38 ( 
> 66.15%)
> Ameanalloc-odr0-32  128.62 (  0.00%)42.77 ( 
> 66.75%)
> Ameanalloc-odr0-64  126.00 (  0.00%)41.00 ( 
> 67.46%)
> Ameanalloc-odr0-128 125.00 (  0.00%)40.08 ( 
> 67.94%)
> Ameanalloc-odr0-256 136.62 (  0.00%)56.00 ( 
> 59.01%)
> Ameanalloc-odr0-512 152.00 (  0.00%)69.00 ( 
> 54.61%)
> Ameanalloc-odr0-1024158.00 (  0.00%)76.23 ( 
> 51.75%)
> Ameanalloc-odr0-2048163.00 (  0.00%)81.15 ( 
> 50.21%)
> Ameanalloc-odr0-4096169.77 (  0.00%)85.92 ( 
> 49.39%)
> Ameanalloc-odr0-8192170.00 (  0.00%)88.00 ( 
> 48.24%)
> Ameanalloc-odr0-16384   170.00 (  0.00%)89.00 ( 
> 47.65%)
> Ameanfree-odr0-1 88.69 (  0.00%)55.69 ( 
> 37.21%)
> Ameanfree-odr0-2 66.00 (  0.00%)49.38 ( 
> 25.17%)
> Ameanfree-odr0-4 54.23 (  0.00%)45.38 ( 
> 16.31%)
> Ameanfree-odr0-8 48.23 (  0.00%)44.23 (  
> 8.29%)
> Ameanfree-odr0-1647.00 (  0.00%)45.00 (  
> 4.26%)
> Ameanfree-odr0-3244.77 (  0.00%)43.92 (  
> 1.89%)
> Ameanfree-odr0-6444.00 (  0.00%)43.00 (  
> 2.27%)
> Ameanfree-odr0-128   43.00 (  0.00%)43.00 (  
> 0.00%)
> Ameanfree-odr0-256   60.69 (  0.00%)60.46 (  
> 0.38%)
> Ameanfree-odr0-512   79.23 (  0.00%)76.00 (  
> 4.08%)
> Ameanfree-odr0-1024  86.00 (  0.00%)85.38 (  
> 0.72%)
> Ameanfree-odr0-2048  91.00 (  0.00%)91.23 ( 
> -0.25%)
> Ameanfree-odr0-4096  94.85 (  0.00%)95.62 ( 
> -0.81%)
> Ameanfree-odr0-8192  97.00 (  0.00%)97.00 (  
> 0.00%)
> Ameanfree-odr0-16384 98.00 (  0.00%)97.46 (  
> 0.55%)
> Ameantotal-odr0-1   348.23 (  0.00%)   162.31 ( 
> 53.39%)
> Ameantotal-odr0-2   259.38 (  0.00%)   125.77 ( 
> 51.51%)
> Ameantotal-odr0-4   216.62 (  0.00%)   102.62 ( 
> 52.63%)
> Ameantotal-odr0-8   192.54 (  0.00%)93.00 ( 
> 51.70%)
> Ameantotal-odr0-16  181.08 (  0.00%)90.38 ( 
> 50.08%)
> Ameantotal-odr0-32  173.38 (  0.00%)86.69 ( 
> 50.00%)
> Ameantotal-odr0-64  170.00 (  0.00%)84.00 ( 
> 50.59%)
> Ameantotal-odr0-128 168.00 (  0.00%)83.08 ( 
> 50.55%)
> Ameantotal-odr0-256 197.31 (  0.00%)   116.46 ( 
> 40.97%)
> Ameantotal-odr0-512 231.23 (  0.00%)   145.00 ( 
> 37.29%)
> Ameantotal-odr0-1024244.00 (  0.00%)   161.62 ( 
> 33.76%)
> Ameantotal-odr0-2048254.00 (  0.00%)   172.38 ( 
> 32.13%)
> Ameantotal-odr0-4096264.62 (  0.00%)   181.54 ( 
> 31.40%)
> Amean

[PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-09 Thread Mel Gorman
This patch adds a new page allocator interface via alloc_pages_bulk,
__alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
number of pages to be allocated and added to a list. They can be freed in
bulk using free_pages_bulk(). Note that it would theoretically be possible
to use free_hot_cold_page_list for faster frees if the symbol was exported,
the refcounts were 0 and the caller guaranteed it was not in an interrupt.
This would be significantly faster in the free path but also more unsafer
and a harder API to use.

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch if
necessary.

The following compares the allocation cost per page for different batch
sizes. The baseline is allocating them one at a time and it compares with
the performance when using the new allocation interface.

pagealloc
  4.10.0-rc2 4.10.0-rc2
   one-at-a-timebulk-v2
Ameanalloc-odr0-1   259.54 (  0.00%)   106.62 ( 58.92%)
Ameanalloc-odr0-2   193.38 (  0.00%)76.38 ( 60.50%)
Ameanalloc-odr0-4   162.38 (  0.00%)57.23 ( 64.76%)
Ameanalloc-odr0-8   144.31 (  0.00%)48.77 ( 66.20%)
Ameanalloc-odr0-16  134.08 (  0.00%)45.38 ( 66.15%)
Ameanalloc-odr0-32  128.62 (  0.00%)42.77 ( 66.75%)
Ameanalloc-odr0-64  126.00 (  0.00%)41.00 ( 67.46%)
Ameanalloc-odr0-128 125.00 (  0.00%)40.08 ( 67.94%)
Ameanalloc-odr0-256 136.62 (  0.00%)56.00 ( 59.01%)
Ameanalloc-odr0-512 152.00 (  0.00%)69.00 ( 54.61%)
Ameanalloc-odr0-1024158.00 (  0.00%)76.23 ( 51.75%)
Ameanalloc-odr0-2048163.00 (  0.00%)81.15 ( 50.21%)
Ameanalloc-odr0-4096169.77 (  0.00%)85.92 ( 49.39%)
Ameanalloc-odr0-8192170.00 (  0.00%)88.00 ( 48.24%)
Ameanalloc-odr0-16384   170.00 (  0.00%)89.00 ( 47.65%)
Ameanfree-odr0-1 88.69 (  0.00%)55.69 ( 37.21%)
Ameanfree-odr0-2 66.00 (  0.00%)49.38 ( 25.17%)
Ameanfree-odr0-4 54.23 (  0.00%)45.38 ( 16.31%)
Ameanfree-odr0-8 48.23 (  0.00%)44.23 (  8.29%)
Ameanfree-odr0-1647.00 (  0.00%)45.00 (  4.26%)
Ameanfree-odr0-3244.77 (  0.00%)43.92 (  1.89%)
Ameanfree-odr0-6444.00 (  0.00%)43.00 (  2.27%)
Ameanfree-odr0-128   43.00 (  0.00%)43.00 (  0.00%)
Ameanfree-odr0-256   60.69 (  0.00%)60.46 (  0.38%)
Ameanfree-odr0-512   79.23 (  0.00%)76.00 (  4.08%)
Ameanfree-odr0-1024  86.00 (  0.00%)85.38 (  0.72%)
Ameanfree-odr0-2048  91.00 (  0.00%)91.23 ( -0.25%)
Ameanfree-odr0-4096  94.85 (  0.00%)95.62 ( -0.81%)
Ameanfree-odr0-8192  97.00 (  0.00%)97.00 (  0.00%)
Ameanfree-odr0-16384 98.00 (  0.00%)97.46 (  0.55%)
Ameantotal-odr0-1   348.23 (  0.00%)   162.31 ( 53.39%)
Ameantotal-odr0-2   259.38 (  0.00%)   125.77 ( 51.51%)
Ameantotal-odr0-4   216.62 (  0.00%)   102.62 ( 52.63%)
Ameantotal-odr0-8   192.54 (  0.00%)93.00 ( 51.70%)
Ameantotal-odr0-16  181.08 (  0.00%)90.38 ( 50.08%)
Ameantotal-odr0-32  173.38 (  0.00%)86.69 ( 50.00%)
Ameantotal-odr0-64  170.00 (  0.00%)84.00 ( 50.59%)
Ameantotal-odr0-128 168.00 (  0.00%)83.08 ( 50.55%)
Ameantotal-odr0-256 197.31 (  0.00%)   116.46 ( 40.97%)
Ameantotal-odr0-512 231.23 (  0.00%)   145.00 ( 37.29%)
Ameantotal-odr0-1024244.00 (  0.00%)   161.62 ( 33.76%)
Ameantotal-odr0-2048254.00 (  0.00%)   172.38 ( 32.13%)
Ameantotal-odr0-4096264.62 (  0.00%)   181.54 ( 31.40%)
Ameantotal-odr0-8192267.00 (  0.00%)   185.00 ( 30.71%)
Ameantotal-odr0-16384   268.00 (  0.00%)   186.46 ( 30.42%)

It shows a roughly 50-60% reduction in the cost of allocating pages.
The free paths are not improved as much but relatively little can be batched
there. It's not quite as fast as it could 

[PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-09 Thread Mel Gorman
This patch adds a new page allocator interface via alloc_pages_bulk,
__alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests a
number of pages to be allocated and added to a list. They can be freed in
bulk using free_pages_bulk(). Note that it would theoretically be possible
to use free_hot_cold_page_list for faster frees if the symbol was exported,
the refcounts were 0 and the caller guaranteed it was not in an interrupt.
This would be significantly faster in the free path but also more unsafer
and a harder API to use.

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch if
necessary.

The following compares the allocation cost per page for different batch
sizes. The baseline is allocating them one at a time and it compares with
the performance when using the new allocation interface.

pagealloc
  4.10.0-rc2 4.10.0-rc2
   one-at-a-timebulk-v2
Ameanalloc-odr0-1   259.54 (  0.00%)   106.62 ( 58.92%)
Ameanalloc-odr0-2   193.38 (  0.00%)76.38 ( 60.50%)
Ameanalloc-odr0-4   162.38 (  0.00%)57.23 ( 64.76%)
Ameanalloc-odr0-8   144.31 (  0.00%)48.77 ( 66.20%)
Ameanalloc-odr0-16  134.08 (  0.00%)45.38 ( 66.15%)
Ameanalloc-odr0-32  128.62 (  0.00%)42.77 ( 66.75%)
Ameanalloc-odr0-64  126.00 (  0.00%)41.00 ( 67.46%)
Ameanalloc-odr0-128 125.00 (  0.00%)40.08 ( 67.94%)
Ameanalloc-odr0-256 136.62 (  0.00%)56.00 ( 59.01%)
Ameanalloc-odr0-512 152.00 (  0.00%)69.00 ( 54.61%)
Ameanalloc-odr0-1024158.00 (  0.00%)76.23 ( 51.75%)
Ameanalloc-odr0-2048163.00 (  0.00%)81.15 ( 50.21%)
Ameanalloc-odr0-4096169.77 (  0.00%)85.92 ( 49.39%)
Ameanalloc-odr0-8192170.00 (  0.00%)88.00 ( 48.24%)
Ameanalloc-odr0-16384   170.00 (  0.00%)89.00 ( 47.65%)
Ameanfree-odr0-1 88.69 (  0.00%)55.69 ( 37.21%)
Ameanfree-odr0-2 66.00 (  0.00%)49.38 ( 25.17%)
Ameanfree-odr0-4 54.23 (  0.00%)45.38 ( 16.31%)
Ameanfree-odr0-8 48.23 (  0.00%)44.23 (  8.29%)
Ameanfree-odr0-1647.00 (  0.00%)45.00 (  4.26%)
Ameanfree-odr0-3244.77 (  0.00%)43.92 (  1.89%)
Ameanfree-odr0-6444.00 (  0.00%)43.00 (  2.27%)
Ameanfree-odr0-128   43.00 (  0.00%)43.00 (  0.00%)
Ameanfree-odr0-256   60.69 (  0.00%)60.46 (  0.38%)
Ameanfree-odr0-512   79.23 (  0.00%)76.00 (  4.08%)
Ameanfree-odr0-1024  86.00 (  0.00%)85.38 (  0.72%)
Ameanfree-odr0-2048  91.00 (  0.00%)91.23 ( -0.25%)
Ameanfree-odr0-4096  94.85 (  0.00%)95.62 ( -0.81%)
Ameanfree-odr0-8192  97.00 (  0.00%)97.00 (  0.00%)
Ameanfree-odr0-16384 98.00 (  0.00%)97.46 (  0.55%)
Ameantotal-odr0-1   348.23 (  0.00%)   162.31 ( 53.39%)
Ameantotal-odr0-2   259.38 (  0.00%)   125.77 ( 51.51%)
Ameantotal-odr0-4   216.62 (  0.00%)   102.62 ( 52.63%)
Ameantotal-odr0-8   192.54 (  0.00%)93.00 ( 51.70%)
Ameantotal-odr0-16  181.08 (  0.00%)90.38 ( 50.08%)
Ameantotal-odr0-32  173.38 (  0.00%)86.69 ( 50.00%)
Ameantotal-odr0-64  170.00 (  0.00%)84.00 ( 50.59%)
Ameantotal-odr0-128 168.00 (  0.00%)83.08 ( 50.55%)
Ameantotal-odr0-256 197.31 (  0.00%)   116.46 ( 40.97%)
Ameantotal-odr0-512 231.23 (  0.00%)   145.00 ( 37.29%)
Ameantotal-odr0-1024244.00 (  0.00%)   161.62 ( 33.76%)
Ameantotal-odr0-2048254.00 (  0.00%)   172.38 ( 32.13%)
Ameantotal-odr0-4096264.62 (  0.00%)   181.54 ( 31.40%)
Ameantotal-odr0-8192267.00 (  0.00%)   185.00 ( 30.71%)
Ameantotal-odr0-16384   268.00 (  0.00%)   186.46 ( 30.42%)

It shows a roughly 50-60% reduction in the cost of allocating pages.
The free paths are not improved as much but relatively little can be batched
there. It's not quite as fast as it could 

Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-04 Thread Mel Gorman
On Wed, Jan 04, 2017 at 02:48:44PM +0100, Jesper Dangaard Brouer wrote:
> On Wed,  4 Jan 2017 11:10:49 +
> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory,
> > the cpuset changes during the allocation or page debugging decides
> > to fail an allocation. It's up to the caller to request more pages
> > in batch if necessary.
> 
> I generally like it, thanks! :-)
> 

No problem.

> > +   /*
> > +* Only attempt a batch allocation if watermarks on the preferred zone
> > +* are safe.
> > +*/
> > +   zone = ac.preferred_zoneref->zone;
> > +   if (!zone_watermark_fast(zone, order, zone->watermark[ALLOC_WMARK_HIGH] 
> > + nr_pages,
> > +zonelist_zone_idx(ac.preferred_zoneref), 
> > alloc_flags))
> > +   goto failed;
> > +
> > +   /* Attempt the batch allocation */
> > +   migratetype = ac.migratetype;
> > +
> > +   local_irq_save(flags);
> 
> It would be a win if we could either use local_irq_{disable,enable} or
> preempt_{disable,enable} here, by dictating it can only be used from
> irq-safe context (like you did in patch 3).
> 

This was a stupid mistake during a rebase. I should have removed all the
IRQ-disabling entirely and made it only usable from non-IRQ context to
keep the motivation of patch 3 in place. It was a botched rebase of the
patch on top of patch 3 that wasn't properly tested. It still illustrates
the general shape at least. For extra safety, I should force it to return
just a single page if called from interrupt context.

Is bulk allocation from IRQ context a requirement? If so, the motivation
for patch 3 disappears which is a pity but IRQ safety has a price. The
shape of V2 depends on the answer.

> 
> > +   pcp = _cpu_ptr(zone->pageset)->pcp;
> > +   pcp_list = >lists[migratetype];
> > +
> > +   while (nr_pages) {
> > +   page = __rmqueue_pcplist(zone, order, gfp_mask, migratetype,
> > +   cold, pcp, 
> > pcp_list);
> > +   if (!page)
> > +   break;
> > +
> > +   nr_pages--;
> > +   alloced++;
> > +   list_add(>lru, alloc_list);
> > +   }
> > +
> > +   if (!alloced) {
> > +   local_irq_restore(flags);
> > +   preempt_enable();
> 
> The preempt_enable here looks wrong.
> 

It is because I screwed up the rebase.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-04 Thread Mel Gorman
On Wed, Jan 04, 2017 at 02:48:44PM +0100, Jesper Dangaard Brouer wrote:
> On Wed,  4 Jan 2017 11:10:49 +
> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory,
> > the cpuset changes during the allocation or page debugging decides
> > to fail an allocation. It's up to the caller to request more pages
> > in batch if necessary.
> 
> I generally like it, thanks! :-)
> 

No problem.

> > +   /*
> > +* Only attempt a batch allocation if watermarks on the preferred zone
> > +* are safe.
> > +*/
> > +   zone = ac.preferred_zoneref->zone;
> > +   if (!zone_watermark_fast(zone, order, zone->watermark[ALLOC_WMARK_HIGH] 
> > + nr_pages,
> > +zonelist_zone_idx(ac.preferred_zoneref), 
> > alloc_flags))
> > +   goto failed;
> > +
> > +   /* Attempt the batch allocation */
> > +   migratetype = ac.migratetype;
> > +
> > +   local_irq_save(flags);
> 
> It would be a win if we could either use local_irq_{disable,enable} or
> preempt_{disable,enable} here, by dictating it can only be used from
> irq-safe context (like you did in patch 3).
> 

This was a stupid mistake during a rebase. I should have removed all the
IRQ-disabling entirely and made it only usable from non-IRQ context to
keep the motivation of patch 3 in place. It was a botched rebase of the
patch on top of patch 3 that wasn't properly tested. It still illustrates
the general shape at least. For extra safety, I should force it to return
just a single page if called from interrupt context.

Is bulk allocation from IRQ context a requirement? If so, the motivation
for patch 3 disappears which is a pity but IRQ safety has a price. The
shape of V2 depends on the answer.

> 
> > +   pcp = _cpu_ptr(zone->pageset)->pcp;
> > +   pcp_list = >lists[migratetype];
> > +
> > +   while (nr_pages) {
> > +   page = __rmqueue_pcplist(zone, order, gfp_mask, migratetype,
> > +   cold, pcp, 
> > pcp_list);
> > +   if (!page)
> > +   break;
> > +
> > +   nr_pages--;
> > +   alloced++;
> > +   list_add(>lru, alloc_list);
> > +   }
> > +
> > +   if (!alloced) {
> > +   local_irq_restore(flags);
> > +   preempt_enable();
> 
> The preempt_enable here looks wrong.
> 

It is because I screwed up the rebase.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-04 Thread Jesper Dangaard Brouer
On Wed,  4 Jan 2017 11:10:49 +
Mel Gorman  wrote:

> This patch adds a new page allocator interface via alloc_pages_bulk,
> __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests
> a number of pages to be allocated and added to a list. They can be
> freed in bulk using free_hot_cold_page_list.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory,
> the cpuset changes during the allocation or page debugging decides
> to fail an allocation. It's up to the caller to request more pages
> in batch if necessary.

I generally like it, thanks! :-)

> Signed-off-by: Mel Gorman 
> ---
>  include/linux/gfp.h | 23 ++
>  mm/page_alloc.c | 92 
> +
>  2 files changed, 115 insertions(+)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4175dca4ac39..1da3a9a48701 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -433,6 +433,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
>   return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
>  }
>  
> +unsigned long
> +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> + struct zonelist *zonelist, nodemask_t *nodemask,
> + unsigned long nr_pages, struct list_head *alloc_list);
> +
> +static inline unsigned long
> +__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
> + struct zonelist *zonelist, unsigned long nr_pages,
> + struct list_head *list)
> +{
> + return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL,
> + nr_pages, list);
> +}
> +
> +static inline unsigned long
> +alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
> + unsigned long nr_pages, struct list_head *list)
> +{
> + int nid = numa_mem_id();
> + return __alloc_pages_bulk(gfp_mask, order,
> + node_zonelist(nid, gfp_mask), nr_pages, list);
> +}
> +
>  /*
>   * Allocate pages, preferring the node given as nid. The node must be valid 
> and
>   * online. For more general interface, see alloc_pages_node().
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 01b09f9da288..307ad4299dec 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3887,6 +3887,98 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>  EXPORT_SYMBOL(__alloc_pages_nodemask);
>  
>  /*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + * Note that there is no guarantee that nr_pages will be allocated although
> + * every effort will be made to allocate at least one. Unlike the core
> + * allocator, no special effort is made to recover from transient
> + * failures caused by changes in cpusets.
> + */
> +unsigned long
> +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> + struct zonelist *zonelist, nodemask_t *nodemask,
> + unsigned long nr_pages, struct list_head *alloc_list)
> +{
> + struct page *page;
> + unsigned long alloced = 0;
> + unsigned int alloc_flags = ALLOC_WMARK_LOW;
> + struct zone *zone;
> + struct per_cpu_pages *pcp;
> + struct list_head *pcp_list;
> + unsigned long flags;
> + int migratetype;
> + gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for 
> allocation */
> + struct alloc_context ac = { };
> + bool cold = ((gfp_mask & __GFP_COLD) != 0);
> +
> + /* If there are already pages on the list, don't bother */
> + if (!list_empty(alloc_list))
> + return 0;
> +
> + /* Only handle bulk allocation of order-0 */
> + if (order)
> + goto failed;
> +
> + gfp_mask &= gfp_allowed_mask;
> + if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, , 
> _mask, _flags))
> + return 0;
> +
> + finalise_ac(gfp_mask, order, );
> + if (!ac.preferred_zoneref)
> + return 0;
> +
> + /*
> +  * Only attempt a batch allocation if watermarks on the preferred zone
> +  * are safe.
> +  */
> + zone = ac.preferred_zoneref->zone;
> + if (!zone_watermark_fast(zone, order, zone->watermark[ALLOC_WMARK_HIGH] 
> + nr_pages,
> +  zonelist_zone_idx(ac.preferred_zoneref), 
> alloc_flags))
> + goto failed;
> +
> + /* Attempt the batch allocation */
> + migratetype = ac.migratetype;
> +
> + local_irq_save(flags);

It would be a win if we could either use local_irq_{disable,enable} or
preempt_{disable,enable} here, by dictating it can only be used from
irq-safe context (like you did in patch 3).


> + pcp = _cpu_ptr(zone->pageset)->pcp;
> + pcp_list = >lists[migratetype];
> +
> + while (nr_pages) {

Re: [PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-04 Thread Jesper Dangaard Brouer
On Wed,  4 Jan 2017 11:10:49 +
Mel Gorman  wrote:

> This patch adds a new page allocator interface via alloc_pages_bulk,
> __alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests
> a number of pages to be allocated and added to a list. They can be
> freed in bulk using free_hot_cold_page_list.
> 
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory,
> the cpuset changes during the allocation or page debugging decides
> to fail an allocation. It's up to the caller to request more pages
> in batch if necessary.

I generally like it, thanks! :-)

> Signed-off-by: Mel Gorman 
> ---
>  include/linux/gfp.h | 23 ++
>  mm/page_alloc.c | 92 
> +
>  2 files changed, 115 insertions(+)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4175dca4ac39..1da3a9a48701 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -433,6 +433,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
>   return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
>  }
>  
> +unsigned long
> +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> + struct zonelist *zonelist, nodemask_t *nodemask,
> + unsigned long nr_pages, struct list_head *alloc_list);
> +
> +static inline unsigned long
> +__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
> + struct zonelist *zonelist, unsigned long nr_pages,
> + struct list_head *list)
> +{
> + return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL,
> + nr_pages, list);
> +}
> +
> +static inline unsigned long
> +alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
> + unsigned long nr_pages, struct list_head *list)
> +{
> + int nid = numa_mem_id();
> + return __alloc_pages_bulk(gfp_mask, order,
> + node_zonelist(nid, gfp_mask), nr_pages, list);
> +}
> +
>  /*
>   * Allocate pages, preferring the node given as nid. The node must be valid 
> and
>   * online. For more general interface, see alloc_pages_node().
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 01b09f9da288..307ad4299dec 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3887,6 +3887,98 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
> order,
>  EXPORT_SYMBOL(__alloc_pages_nodemask);
>  
>  /*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + * Note that there is no guarantee that nr_pages will be allocated although
> + * every effort will be made to allocate at least one. Unlike the core
> + * allocator, no special effort is made to recover from transient
> + * failures caused by changes in cpusets.
> + */
> +unsigned long
> +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> + struct zonelist *zonelist, nodemask_t *nodemask,
> + unsigned long nr_pages, struct list_head *alloc_list)
> +{
> + struct page *page;
> + unsigned long alloced = 0;
> + unsigned int alloc_flags = ALLOC_WMARK_LOW;
> + struct zone *zone;
> + struct per_cpu_pages *pcp;
> + struct list_head *pcp_list;
> + unsigned long flags;
> + int migratetype;
> + gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for 
> allocation */
> + struct alloc_context ac = { };
> + bool cold = ((gfp_mask & __GFP_COLD) != 0);
> +
> + /* If there are already pages on the list, don't bother */
> + if (!list_empty(alloc_list))
> + return 0;
> +
> + /* Only handle bulk allocation of order-0 */
> + if (order)
> + goto failed;
> +
> + gfp_mask &= gfp_allowed_mask;
> + if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, , 
> _mask, _flags))
> + return 0;
> +
> + finalise_ac(gfp_mask, order, );
> + if (!ac.preferred_zoneref)
> + return 0;
> +
> + /*
> +  * Only attempt a batch allocation if watermarks on the preferred zone
> +  * are safe.
> +  */
> + zone = ac.preferred_zoneref->zone;
> + if (!zone_watermark_fast(zone, order, zone->watermark[ALLOC_WMARK_HIGH] 
> + nr_pages,
> +  zonelist_zone_idx(ac.preferred_zoneref), 
> alloc_flags))
> + goto failed;
> +
> + /* Attempt the batch allocation */
> + migratetype = ac.migratetype;
> +
> + local_irq_save(flags);

It would be a win if we could either use local_irq_{disable,enable} or
preempt_{disable,enable} here, by dictating it can only be used from
irq-safe context (like you did in patch 3).


> + pcp = _cpu_ptr(zone->pageset)->pcp;
> + pcp_list = >lists[migratetype];
> +
> + while (nr_pages) {
> + page = __rmqueue_pcplist(zone, order, 

[PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-04 Thread Mel Gorman
This patch adds a new page allocator interface via alloc_pages_bulk,
__alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests
a number of pages to be allocated and added to a list. They can be
freed in bulk using free_hot_cold_page_list.

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory,
the cpuset changes during the allocation or page debugging decides
to fail an allocation. It's up to the caller to request more pages
in batch if necessary.

Signed-off-by: Mel Gorman 
---
 include/linux/gfp.h | 23 ++
 mm/page_alloc.c | 92 +
 2 files changed, 115 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..1da3a9a48701 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -433,6 +433,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
+unsigned long
+__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
+   struct zonelist *zonelist, nodemask_t *nodemask,
+   unsigned long nr_pages, struct list_head *alloc_list);
+
+static inline unsigned long
+__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
+   struct zonelist *zonelist, unsigned long nr_pages,
+   struct list_head *list)
+{
+   return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL,
+   nr_pages, list);
+}
+
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
+   unsigned long nr_pages, struct list_head *list)
+{
+   int nid = numa_mem_id();
+   return __alloc_pages_bulk(gfp_mask, order,
+   node_zonelist(nid, gfp_mask), nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 01b09f9da288..307ad4299dec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3887,6 +3887,98 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order,
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ * Note that there is no guarantee that nr_pages will be allocated although
+ * every effort will be made to allocate at least one. Unlike the core
+ * allocator, no special effort is made to recover from transient
+ * failures caused by changes in cpusets.
+ */
+unsigned long
+__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
+   struct zonelist *zonelist, nodemask_t *nodemask,
+   unsigned long nr_pages, struct list_head *alloc_list)
+{
+   struct page *page;
+   unsigned long alloced = 0;
+   unsigned int alloc_flags = ALLOC_WMARK_LOW;
+   struct zone *zone;
+   struct per_cpu_pages *pcp;
+   struct list_head *pcp_list;
+   unsigned long flags;
+   int migratetype;
+   gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for 
allocation */
+   struct alloc_context ac = { };
+   bool cold = ((gfp_mask & __GFP_COLD) != 0);
+
+   /* If there are already pages on the list, don't bother */
+   if (!list_empty(alloc_list))
+   return 0;
+
+   /* Only handle bulk allocation of order-0 */
+   if (order)
+   goto failed;
+
+   gfp_mask &= gfp_allowed_mask;
+   if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, , 
_mask, _flags))
+   return 0;
+
+   finalise_ac(gfp_mask, order, );
+   if (!ac.preferred_zoneref)
+   return 0;
+
+   /*
+* Only attempt a batch allocation if watermarks on the preferred zone
+* are safe.
+*/
+   zone = ac.preferred_zoneref->zone;
+   if (!zone_watermark_fast(zone, order, zone->watermark[ALLOC_WMARK_HIGH] 
+ nr_pages,
+zonelist_zone_idx(ac.preferred_zoneref), 
alloc_flags))
+   goto failed;
+
+   /* Attempt the batch allocation */
+   migratetype = ac.migratetype;
+
+   local_irq_save(flags);
+   pcp = _cpu_ptr(zone->pageset)->pcp;
+   pcp_list = >lists[migratetype];
+
+   while (nr_pages) {
+   page = __rmqueue_pcplist(zone, order, gfp_mask, migratetype,
+   cold, pcp, 
pcp_list);
+   if (!page)
+   break;
+
+   nr_pages--;
+   alloced++;
+   list_add(>lru, alloc_list);
+   }
+
+   if (!alloced) {
+   local_irq_restore(flags);
+   preempt_enable();
+   

[PATCH 4/4] mm, page_alloc: Add a bulk page allocator

2017-01-04 Thread Mel Gorman
This patch adds a new page allocator interface via alloc_pages_bulk,
__alloc_pages_bulk and __alloc_pages_bulk_nodemask. A caller requests
a number of pages to be allocated and added to a list. They can be
freed in bulk using free_hot_cold_page_list.

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory,
the cpuset changes during the allocation or page debugging decides
to fail an allocation. It's up to the caller to request more pages
in batch if necessary.

Signed-off-by: Mel Gorman 
---
 include/linux/gfp.h | 23 ++
 mm/page_alloc.c | 92 +
 2 files changed, 115 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..1da3a9a48701 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -433,6 +433,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order,
return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
 }
 
+unsigned long
+__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
+   struct zonelist *zonelist, nodemask_t *nodemask,
+   unsigned long nr_pages, struct list_head *alloc_list);
+
+static inline unsigned long
+__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
+   struct zonelist *zonelist, unsigned long nr_pages,
+   struct list_head *list)
+{
+   return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL,
+   nr_pages, list);
+}
+
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
+   unsigned long nr_pages, struct list_head *list)
+{
+   int nid = numa_mem_id();
+   return __alloc_pages_bulk(gfp_mask, order,
+   node_zonelist(nid, gfp_mask), nr_pages, list);
+}
+
 /*
  * Allocate pages, preferring the node given as nid. The node must be valid and
  * online. For more general interface, see alloc_pages_node().
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 01b09f9da288..307ad4299dec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3887,6 +3887,98 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int 
order,
 EXPORT_SYMBOL(__alloc_pages_nodemask);
 
 /*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ * Note that there is no guarantee that nr_pages will be allocated although
+ * every effort will be made to allocate at least one. Unlike the core
+ * allocator, no special effort is made to recover from transient
+ * failures caused by changes in cpusets.
+ */
+unsigned long
+__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
+   struct zonelist *zonelist, nodemask_t *nodemask,
+   unsigned long nr_pages, struct list_head *alloc_list)
+{
+   struct page *page;
+   unsigned long alloced = 0;
+   unsigned int alloc_flags = ALLOC_WMARK_LOW;
+   struct zone *zone;
+   struct per_cpu_pages *pcp;
+   struct list_head *pcp_list;
+   unsigned long flags;
+   int migratetype;
+   gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for 
allocation */
+   struct alloc_context ac = { };
+   bool cold = ((gfp_mask & __GFP_COLD) != 0);
+
+   /* If there are already pages on the list, don't bother */
+   if (!list_empty(alloc_list))
+   return 0;
+
+   /* Only handle bulk allocation of order-0 */
+   if (order)
+   goto failed;
+
+   gfp_mask &= gfp_allowed_mask;
+   if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, , 
_mask, _flags))
+   return 0;
+
+   finalise_ac(gfp_mask, order, );
+   if (!ac.preferred_zoneref)
+   return 0;
+
+   /*
+* Only attempt a batch allocation if watermarks on the preferred zone
+* are safe.
+*/
+   zone = ac.preferred_zoneref->zone;
+   if (!zone_watermark_fast(zone, order, zone->watermark[ALLOC_WMARK_HIGH] 
+ nr_pages,
+zonelist_zone_idx(ac.preferred_zoneref), 
alloc_flags))
+   goto failed;
+
+   /* Attempt the batch allocation */
+   migratetype = ac.migratetype;
+
+   local_irq_save(flags);
+   pcp = _cpu_ptr(zone->pageset)->pcp;
+   pcp_list = >lists[migratetype];
+
+   while (nr_pages) {
+   page = __rmqueue_pcplist(zone, order, gfp_mask, migratetype,
+   cold, pcp, 
pcp_list);
+   if (!page)
+   break;
+
+   nr_pages--;
+   alloced++;
+   list_add(>lru, alloc_list);
+   }
+
+   if (!alloced) {
+   local_irq_restore(flags);
+   preempt_enable();
+   goto failed;
+   }
+
+