Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v4

2016-12-01 Thread Vlastimil Babka

On 12/01/2016 03:24 PM, Mel Gorman wrote:

On Thu, Dec 01, 2016 at 02:41:29PM +0100, Vlastimil Babka wrote:

On 12/01/2016 01:24 AM, Mel Gorman wrote:

...


Hmm I think that if this hits, we don't decrease count/increase nr_freed and
pcp->count will become wrong.


Ok, I think you're right but I also think it's relatively trivial to fix
with

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 94808f565f74..8777aefc1b8e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1134,13 +1134,13 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
if (unlikely(isolated_pageblocks))
mt = get_pageblock_migratetype(page);

+   nr_freed += (1 << order);
+   count -= (1 << order);
if (bulkfree_pcp_prepare(page))
continue;

__free_one_page(page, page_to_pfn(page), zone, order, 
mt);
trace_mm_page_pcpu_drain(page, order, mt);
-   nr_freed += (1 << order);
-   count -= (1 << order);
} while (count > 0 && --batch_free && !list_empty(list));
}
spin_unlock(&zone->lock);


And if we are unlucky/doing full drain, all
lists will get empty, but as count stays e.g. 1, we loop forever on the
outer while()?



Potentially yes. Granted the system is already in a bad state as pages
are being freed in a bad or unknown state but we haven't halted the
system for that in the past.


BTW, I think there's a similar problem (but not introduced by this patch) in
rmqueue_bulk() and its

if (unlikely(check_pcp_refill(page)))
continue;



Potentially yes. It's outside the scope of this patch but it needs
fixing.

If you agree with the above fix, I'll roll it into a v5 and append
another patch for this issue.


Yeah, looks fine. Thanks.




Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v4

2016-12-01 Thread Mel Gorman
On Thu, Dec 01, 2016 at 02:41:29PM +0100, Vlastimil Babka wrote:
> On 12/01/2016 01:24 AM, Mel Gorman wrote:
> 
> ...
> 
> > @@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, 
> > int count,
> > if (nr_scanned)
> > __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, 
> > -nr_scanned);
> > 
> > -   while (count) {
> > +   while (count > 0) {
> > struct page *page;
> > struct list_head *list;
> > +   unsigned int order;
> > 
> > /*
> >  * Remove pages from lists in a round-robin fashion. A
> >  * batch_free count is maintained that is incremented when an
> > -* empty list is encountered.  This is so more pages are freed
> > -* off fuller lists instead of spinning excessively around empty
> > -* lists
> > +* empty list is encountered. This is not exact due to
> > +* high-order but percision is not required.
> >  */
> > do {
> > batch_free++;
> > -   if (++migratetype == MIGRATE_PCPTYPES)
> > -   migratetype = 0;
> > -   list = &pcp->lists[migratetype];
> > +   if (++pindex == NR_PCP_LISTS)
> > +   pindex = 0;
> > +   list = &pcp->lists[pindex];
> > } while (list_empty(list));
> > 
> > /* This is the only non-empty list. Free them all. */
> > -   if (batch_free == MIGRATE_PCPTYPES)
> > +   if (batch_free == NR_PCP_LISTS)
> > batch_free = count;
> > 
> > +   order = pindex_to_order(pindex);
> > do {
> > int mt; /* migratetype of the to-be-freed page */
> > 
> > @@ -1135,11 +1137,14 @@ static void free_pcppages_bulk(struct zone *zone, 
> > int count,
> > if (bulkfree_pcp_prepare(page))
> > continue;
> 
> Hmm I think that if this hits, we don't decrease count/increase nr_freed and
> pcp->count will become wrong.

Ok, I think you're right but I also think it's relatively trivial to fix
with

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 94808f565f74..8777aefc1b8e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1134,13 +1134,13 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
if (unlikely(isolated_pageblocks))
mt = get_pageblock_migratetype(page);
 
+   nr_freed += (1 << order);
+   count -= (1 << order);
if (bulkfree_pcp_prepare(page))
continue;
 
__free_one_page(page, page_to_pfn(page), zone, order, 
mt);
trace_mm_page_pcpu_drain(page, order, mt);
-   nr_freed += (1 << order);
-   count -= (1 << order);
} while (count > 0 && --batch_free && !list_empty(list));
}
spin_unlock(&zone->lock);

> And if we are unlucky/doing full drain, all
> lists will get empty, but as count stays e.g. 1, we loop forever on the
> outer while()?
> 

Potentially yes. Granted the system is already in a bad state as pages
are being freed in a bad or unknown state but we haven't halted the
system for that in the past.

> BTW, I think there's a similar problem (but not introduced by this patch) in
> rmqueue_bulk() and its
> 
> if (unlikely(check_pcp_refill(page)))
> continue;
> 

Potentially yes. It's outside the scope of this patch but it needs
fixing.

If you agree with the above fix, I'll roll it into a v5 and append
another patch for this issue.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v4

2016-12-01 Thread Michal Hocko
On Thu 01-12-16 00:24:40, Mel Gorman wrote:
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
> 
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
> 
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.
>   The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
>   remaining are high-order caches up to and including
>   PAGE_ALLOC_COSTLY_ORDER
> 
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
>   impossible for the caller to know exactly how many pages were freed.
>   Due to the high-order caches, the number of pages drained for a request
>   is no longer precise.
> 
> o The high watermark for per-cpu pages is increased to reduce the probability
>   that a single refill causes a drain on the next free.
> 
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
> 
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
> 
> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v4
> Hmeansend-64 178.38 (  0.00%)  259.91 ( 45.70%)
> Hmeansend-128351.49 (  0.00%)  532.46 ( 51.49%)
> Hmeansend-256671.23 (  0.00%) 1022.99 ( 52.40%)
> Hmeansend-1024  2663.60 (  0.00%) 3995.17 ( 49.99%)
> Hmeansend-2048  5126.53 (  0.00%) 7587.05 ( 48.00%)
> Hmeansend-3312  7949.99 (  0.00%)11697.16 ( 47.13%)
> Hmeansend-4096  9433.56 (  0.00%)13188.14 ( 39.80%)
> Hmeansend-8192 15940.64 (  0.00%)22023.98 ( 38.16%)
> Hmeansend-1638426699.54 (  0.00%)32703.27 ( 22.49%)
> Hmeanrecv-64 178.38 (  0.00%)  259.89 ( 45.70%)
> Hmeanrecv-128351.49 (  0.00%)  532.43 ( 51.48%)
> Hmeanrecv-256671.20 (  0.00%) 1022.75 ( 52.38%)
> Hmeanrecv-1024  2663.45 (  0.00%) 3994.49 ( 49.97%)
> Hmeanrecv-2048  5126.26 (  0.00%) 7585.38 ( 47.97%)
> Hmeanrecv-3312  7949.50 (  0.00%)11695.43 ( 47.12%)
> Hmeanrecv-4096  9433.04 (  0.00%)13186.34 ( 39.79%)
> Hmeanrecv-8192 15939.64 (  0.00%)22021.41 ( 38.15%)
> Hmeanrecv-1638426698.44 (  0.00%)32699.75 ( 22.48%)
> 
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v4
> Hmeansend-64  87.47 (  0.00%)  129.07 ( 47.56%)
> Hmeansend-128174.36 (  0.00%)  258.53 ( 48.27%)
> Hmeansend-256347.52 (  0.00%)  511.02 ( 47.05%)
> Hmeansend-1024  1363.03 (  0.00%) 1990.02 ( 46.00%)
> Hmeansend-2048  2632.68 (  0.00%) 3775.89 ( 43.42%)
> Hmeansend-3312  4123.19 (  0.00%) 5915.77 ( 43.48%)
> Hmeansend-4096  5056.48 (  0.00%) 7152.31 ( 41.45%)
> Hmeansend-8192  8784.22 (  0.00%)12089.53 ( 37.63%)
> Hmeansend-1638415081.60 (  0.00%)19650.42 ( 30.29%)
> Hmeanrecv-64  86.19 (  0.00%)  128.43 ( 49.01%)
> Hmeanrecv-128173.93 (  0.00%)  257.28 ( 47.92%)
> Hmeanrecv-256346.19 (  0.00%)  508.40 ( 46.86%)
> Hmeanrecv-1024  1358.28 (  0.00%) 1979.21 ( 45.71%)
> Hmeanrecv-2048  2623.45 (  0.00%) 3747.30 ( 42.84%)
> Hmeanrecv-3312  4108.63 (  0.00%) 5873.42 ( 42.95%)
> Hmeanrecv-4096  5037.25 (  0.00%) 7101.54 ( 40.98%)
> Hmeanrecv-8192  8762.32 (  0.00%)12009.88 ( 37.06%)
> Hmeanrecv-1638415042.36 (  0.00%)19513.27 ( 29.72%)
> 
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was almost
> no difference but pcc-cpufreq is also a known performance hazard.
> 
> These are quite dif

Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v4

2016-12-01 Thread Vlastimil Babka

On 12/01/2016 01:24 AM, Mel Gorman wrote:

...


@@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
if (nr_scanned)
__mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, 
-nr_scanned);

-   while (count) {
+   while (count > 0) {
struct page *page;
struct list_head *list;
+   unsigned int order;

/*
 * Remove pages from lists in a round-robin fashion. A
 * batch_free count is maintained that is incremented when an
-* empty list is encountered.  This is so more pages are freed
-* off fuller lists instead of spinning excessively around empty
-* lists
+* empty list is encountered. This is not exact due to
+* high-order but percision is not required.
 */
do {
batch_free++;
-   if (++migratetype == MIGRATE_PCPTYPES)
-   migratetype = 0;
-   list = &pcp->lists[migratetype];
+   if (++pindex == NR_PCP_LISTS)
+   pindex = 0;
+   list = &pcp->lists[pindex];
} while (list_empty(list));

/* This is the only non-empty list. Free them all. */
-   if (batch_free == MIGRATE_PCPTYPES)
+   if (batch_free == NR_PCP_LISTS)
batch_free = count;

+   order = pindex_to_order(pindex);
do {
int mt; /* migratetype of the to-be-freed page */

@@ -1135,11 +1137,14 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
if (bulkfree_pcp_prepare(page))
continue;


Hmm I think that if this hits, we don't decrease count/increase nr_freed and 
pcp->count will become wrong. And if we are unlucky/doing full drain, all lists 
will get empty, but as count stays e.g. 1, we loop forever on the outer while()?


BTW, I think there's a similar problem (but not introduced by this patch) in 
rmqueue_bulk() and its


if (unlikely(check_pcp_refill(page)))
continue;

This might result in pcp->count being higher than actual pages. That one would 
be introduced by 479f854a207c ("mm, page_alloc: defer debugging checks of pages 
allocated from the PCP").




-   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
-   trace_mm_page_pcpu_drain(page, 0, mt);
-   } while (--count && --batch_free && !list_empty(list));
+   __free_one_page(page, page_to_pfn(page), zone, order, 
mt);
+   trace_mm_page_pcpu_drain(page, order, mt);
+   nr_freed += (1 << order);
+   count -= (1 << order);
+   } while (count > 0 && --batch_free && !list_empty(list));
}
spin_unlock(&zone->lock);
+   pcp->count -= nr_freed;
 }

 static void free_one_page(struct zone *zone,