Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v4
On 12/01/2016 03:24 PM, Mel Gorman wrote: On Thu, Dec 01, 2016 at 02:41:29PM +0100, Vlastimil Babka wrote: On 12/01/2016 01:24 AM, Mel Gorman wrote: ... Hmm I think that if this hits, we don't decrease count/increase nr_freed and pcp->count will become wrong. Ok, I think you're right but I also think it's relatively trivial to fix with diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 94808f565f74..8777aefc1b8e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1134,13 +1134,13 @@ static void free_pcppages_bulk(struct zone *zone, int count, if (unlikely(isolated_pageblocks)) mt = get_pageblock_migratetype(page); + nr_freed += (1 << order); + count -= (1 << order); if (bulkfree_pcp_prepare(page)) continue; __free_one_page(page, page_to_pfn(page), zone, order, mt); trace_mm_page_pcpu_drain(page, order, mt); - nr_freed += (1 << order); - count -= (1 << order); } while (count > 0 && --batch_free && !list_empty(list)); } spin_unlock(&zone->lock); And if we are unlucky/doing full drain, all lists will get empty, but as count stays e.g. 1, we loop forever on the outer while()? Potentially yes. Granted the system is already in a bad state as pages are being freed in a bad or unknown state but we haven't halted the system for that in the past. BTW, I think there's a similar problem (but not introduced by this patch) in rmqueue_bulk() and its if (unlikely(check_pcp_refill(page))) continue; Potentially yes. It's outside the scope of this patch but it needs fixing. If you agree with the above fix, I'll roll it into a v5 and append another patch for this issue. Yeah, looks fine. Thanks.
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v4
On Thu, Dec 01, 2016 at 02:41:29PM +0100, Vlastimil Babka wrote: > On 12/01/2016 01:24 AM, Mel Gorman wrote: > > ... > > > @@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, > > int count, > > if (nr_scanned) > > __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, > > -nr_scanned); > > > > - while (count) { > > + while (count > 0) { > > struct page *page; > > struct list_head *list; > > + unsigned int order; > > > > /* > > * Remove pages from lists in a round-robin fashion. A > > * batch_free count is maintained that is incremented when an > > -* empty list is encountered. This is so more pages are freed > > -* off fuller lists instead of spinning excessively around empty > > -* lists > > +* empty list is encountered. This is not exact due to > > +* high-order but percision is not required. > > */ > > do { > > batch_free++; > > - if (++migratetype == MIGRATE_PCPTYPES) > > - migratetype = 0; > > - list = &pcp->lists[migratetype]; > > + if (++pindex == NR_PCP_LISTS) > > + pindex = 0; > > + list = &pcp->lists[pindex]; > > } while (list_empty(list)); > > > > /* This is the only non-empty list. Free them all. */ > > - if (batch_free == MIGRATE_PCPTYPES) > > + if (batch_free == NR_PCP_LISTS) > > batch_free = count; > > > > + order = pindex_to_order(pindex); > > do { > > int mt; /* migratetype of the to-be-freed page */ > > > > @@ -1135,11 +1137,14 @@ static void free_pcppages_bulk(struct zone *zone, > > int count, > > if (bulkfree_pcp_prepare(page)) > > continue; > > Hmm I think that if this hits, we don't decrease count/increase nr_freed and > pcp->count will become wrong. Ok, I think you're right but I also think it's relatively trivial to fix with diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 94808f565f74..8777aefc1b8e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1134,13 +1134,13 @@ static void free_pcppages_bulk(struct zone *zone, int count, if (unlikely(isolated_pageblocks)) mt = get_pageblock_migratetype(page); + nr_freed += (1 << order); + count -= (1 << order); if (bulkfree_pcp_prepare(page)) continue; __free_one_page(page, page_to_pfn(page), zone, order, mt); trace_mm_page_pcpu_drain(page, order, mt); - nr_freed += (1 << order); - count -= (1 << order); } while (count > 0 && --batch_free && !list_empty(list)); } spin_unlock(&zone->lock); > And if we are unlucky/doing full drain, all > lists will get empty, but as count stays e.g. 1, we loop forever on the > outer while()? > Potentially yes. Granted the system is already in a bad state as pages are being freed in a bad or unknown state but we haven't halted the system for that in the past. > BTW, I think there's a similar problem (but not introduced by this patch) in > rmqueue_bulk() and its > > if (unlikely(check_pcp_refill(page))) > continue; > Potentially yes. It's outside the scope of this patch but it needs fixing. If you agree with the above fix, I'll roll it into a v5 and append another patch for this issue. -- Mel Gorman SUSE Labs
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v4
On Thu 01-12-16 00:24:40, Mel Gorman wrote: > Changelog since v3 > o Allow high-order atomic allocations to use reserves > > Changelog since v2 > o Correct initialisation to avoid -Woverflow warning > > SLUB has been the default small kernel object allocator for quite some time > but it is not universally used due to performance concerns and a reliance > on high-order pages. The high-order concerns has two major components -- > high-order pages are not always available and high-order page allocations > potentially contend on the zone->lock. This patch addresses some concerns > about the zone lock contention by extending the per-cpu page allocator to > cache high-order pages. The patch makes the following modifications > > o New per-cpu lists are added to cache the high-order pages. This increases > the cache footprint of the per-cpu allocator and overall usage but for > some workloads, this will be offset by reduced contention on zone->lock. > The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The > remaining are high-order caches up to and including > PAGE_ALLOC_COSTLY_ORDER > > o pcp accounting during free is now confined to free_pcppages_bulk as it's > impossible for the caller to know exactly how many pages were freed. > Due to the high-order caches, the number of pages drained for a request > is no longer precise. > > o The high watermark for per-cpu pages is increased to reduce the probability > that a single refill causes a drain on the next free. > > The benefit depends on both the workload and the machine as ultimately the > determining factor is whether cache line bounces on zone->lock or contention > is a problem. The patch was tested on a variety of workloads and machines, > some of which are reported here. > > This is the result from netperf running UDP_STREAM on localhost. It was > selected on the basis that it is slab-intensive and has been the subject > of previous SLAB vs SLUB comparisons with the caveat that this is not > testing between two physical hosts. > > 2-socket modern machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v4 > Hmeansend-64 178.38 ( 0.00%) 259.91 ( 45.70%) > Hmeansend-128351.49 ( 0.00%) 532.46 ( 51.49%) > Hmeansend-256671.23 ( 0.00%) 1022.99 ( 52.40%) > Hmeansend-1024 2663.60 ( 0.00%) 3995.17 ( 49.99%) > Hmeansend-2048 5126.53 ( 0.00%) 7587.05 ( 48.00%) > Hmeansend-3312 7949.99 ( 0.00%)11697.16 ( 47.13%) > Hmeansend-4096 9433.56 ( 0.00%)13188.14 ( 39.80%) > Hmeansend-8192 15940.64 ( 0.00%)22023.98 ( 38.16%) > Hmeansend-1638426699.54 ( 0.00%)32703.27 ( 22.49%) > Hmeanrecv-64 178.38 ( 0.00%) 259.89 ( 45.70%) > Hmeanrecv-128351.49 ( 0.00%) 532.43 ( 51.48%) > Hmeanrecv-256671.20 ( 0.00%) 1022.75 ( 52.38%) > Hmeanrecv-1024 2663.45 ( 0.00%) 3994.49 ( 49.97%) > Hmeanrecv-2048 5126.26 ( 0.00%) 7585.38 ( 47.97%) > Hmeanrecv-3312 7949.50 ( 0.00%)11695.43 ( 47.12%) > Hmeanrecv-4096 9433.04 ( 0.00%)13186.34 ( 39.79%) > Hmeanrecv-8192 15939.64 ( 0.00%)22021.41 ( 38.15%) > Hmeanrecv-1638426698.44 ( 0.00%)32699.75 ( 22.48%) > > 1-socket 6 year old machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v4 > Hmeansend-64 87.47 ( 0.00%) 129.07 ( 47.56%) > Hmeansend-128174.36 ( 0.00%) 258.53 ( 48.27%) > Hmeansend-256347.52 ( 0.00%) 511.02 ( 47.05%) > Hmeansend-1024 1363.03 ( 0.00%) 1990.02 ( 46.00%) > Hmeansend-2048 2632.68 ( 0.00%) 3775.89 ( 43.42%) > Hmeansend-3312 4123.19 ( 0.00%) 5915.77 ( 43.48%) > Hmeansend-4096 5056.48 ( 0.00%) 7152.31 ( 41.45%) > Hmeansend-8192 8784.22 ( 0.00%)12089.53 ( 37.63%) > Hmeansend-1638415081.60 ( 0.00%)19650.42 ( 30.29%) > Hmeanrecv-64 86.19 ( 0.00%) 128.43 ( 49.01%) > Hmeanrecv-128173.93 ( 0.00%) 257.28 ( 47.92%) > Hmeanrecv-256346.19 ( 0.00%) 508.40 ( 46.86%) > Hmeanrecv-1024 1358.28 ( 0.00%) 1979.21 ( 45.71%) > Hmeanrecv-2048 2623.45 ( 0.00%) 3747.30 ( 42.84%) > Hmeanrecv-3312 4108.63 ( 0.00%) 5873.42 ( 42.95%) > Hmeanrecv-4096 5037.25 ( 0.00%) 7101.54 ( 40.98%) > Hmeanrecv-8192 8762.32 ( 0.00%)12009.88 ( 37.06%) > Hmeanrecv-1638415042.36 ( 0.00%)19513.27 ( 29.72%) > > This is somewhat dramatic but it's also not universal. For example, it was > observed on an older HP machine using pcc-cpufreq that there was almost > no difference but pcc-cpufreq is also a known performance hazard. > > These are quite dif
Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v4
On 12/01/2016 01:24 AM, Mel Gorman wrote: ... @@ -1096,28 +1097,29 @@ static void free_pcppages_bulk(struct zone *zone, int count, if (nr_scanned) __mod_node_page_state(zone->zone_pgdat, NR_PAGES_SCANNED, -nr_scanned); - while (count) { + while (count > 0) { struct page *page; struct list_head *list; + unsigned int order; /* * Remove pages from lists in a round-robin fashion. A * batch_free count is maintained that is incremented when an -* empty list is encountered. This is so more pages are freed -* off fuller lists instead of spinning excessively around empty -* lists +* empty list is encountered. This is not exact due to +* high-order but percision is not required. */ do { batch_free++; - if (++migratetype == MIGRATE_PCPTYPES) - migratetype = 0; - list = &pcp->lists[migratetype]; + if (++pindex == NR_PCP_LISTS) + pindex = 0; + list = &pcp->lists[pindex]; } while (list_empty(list)); /* This is the only non-empty list. Free them all. */ - if (batch_free == MIGRATE_PCPTYPES) + if (batch_free == NR_PCP_LISTS) batch_free = count; + order = pindex_to_order(pindex); do { int mt; /* migratetype of the to-be-freed page */ @@ -1135,11 +1137,14 @@ static void free_pcppages_bulk(struct zone *zone, int count, if (bulkfree_pcp_prepare(page)) continue; Hmm I think that if this hits, we don't decrease count/increase nr_freed and pcp->count will become wrong. And if we are unlucky/doing full drain, all lists will get empty, but as count stays e.g. 1, we loop forever on the outer while()? BTW, I think there's a similar problem (but not introduced by this patch) in rmqueue_bulk() and its if (unlikely(check_pcp_refill(page))) continue; This might result in pcp->count being higher than actual pages. That one would be introduced by 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP"). - __free_one_page(page, page_to_pfn(page), zone, 0, mt); - trace_mm_page_pcpu_drain(page, 0, mt); - } while (--count && --batch_free && !list_empty(list)); + __free_one_page(page, page_to_pfn(page), zone, order, mt); + trace_mm_page_pcpu_drain(page, order, mt); + nr_freed += (1 << order); + count -= (1 << order); + } while (count > 0 && --batch_free && !list_empty(list)); } spin_unlock(&zone->lock); + pcp->count -= nr_freed; } static void free_one_page(struct zone *zone,