Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-06 Thread Mel Gorman
On Tue, Dec 06, 2016 at 11:43:45AM +0900, Joonsoo Kim wrote:
> > actually clear at all it's an unfair situation, particularly given that the
> > vanilla code is also unfair -- the vanilla code can artifically preserve
> > MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
> > The only deciding factor there was a fault-intensive workload would mask
> > overhead of the page allocator due to page zeroing cost which UNMOVABLE
> > allocations may or may not require. Even that is vague considering that
> > page-table allocations are zeroing even if many kernel allocations are not.
> 
> "Vanilla works like that" doesn't seem to be reasonable to justify
> this change.  Vanilla code works with three lists and it now become
> six lists and each list can have different size of page. We need to
> think that previous approach will also work fine with current one. I
> think that there is a problem although it's not permanent and would be
> minor. However, it's better to fix it when it is found.
> 

This is going in circles. I prototyped the modification which increases
the per-cpu structure slightly and will evaluate. It takes about a day
to run through the full set of tests. If it causes no harm, I'll release
another version.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-06 Thread Mel Gorman
On Tue, Dec 06, 2016 at 11:43:45AM +0900, Joonsoo Kim wrote:
> > actually clear at all it's an unfair situation, particularly given that the
> > vanilla code is also unfair -- the vanilla code can artifically preserve
> > MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
> > The only deciding factor there was a fault-intensive workload would mask
> > overhead of the page allocator due to page zeroing cost which UNMOVABLE
> > allocations may or may not require. Even that is vague considering that
> > page-table allocations are zeroing even if many kernel allocations are not.
> 
> "Vanilla works like that" doesn't seem to be reasonable to justify
> this change.  Vanilla code works with three lists and it now become
> six lists and each list can have different size of page. We need to
> think that previous approach will also work fine with current one. I
> think that there is a problem although it's not permanent and would be
> minor. However, it's better to fix it when it is found.
> 

This is going in circles. I prototyped the modification which increases
the per-cpu structure slightly and will evaluate. It takes about a day
to run through the full set of tests. If it causes no harm, I'll release
another version.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-05 Thread Joonsoo Kim
On Mon, Dec 05, 2016 at 09:57:39AM +, Mel Gorman wrote:
> On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote:
> > On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote:
> > > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone 
> > > > > *zone, int count,
> > > > >   if (unlikely(isolated_pageblocks))
> > > > >   mt = get_pageblock_migratetype(page);
> > > > >  
> > > > > + nr_freed += (1 << order);
> > > > > + count -= (1 << order);
> > > > >   if (bulkfree_pcp_prepare(page))
> > > > >   continue;
> > > > >  
> > > > > - __free_one_page(page, page_to_pfn(page), zone, 
> > > > > 0, mt);
> > > > > - trace_mm_page_pcpu_drain(page, 0, mt);
> > > > > - } while (--count && --batch_free && !list_empty(list));
> > > > > + __free_one_page(page, page_to_pfn(page), zone, 
> > > > > order, mt);
> > > > > + trace_mm_page_pcpu_drain(page, order, mt);
> > > > > + } while (count > 0 && --batch_free && 
> > > > > !list_empty(list));
> > > > >   }
> > > > >   spin_unlock(>lock);
> > > > > + pcp->count -= nr_freed;
> > > > >  }
> > > > 
> > > > I guess that this patch would cause following problems.
> > > > 
> > > > 1. If pcp->batch is too small, high order page will not be freed
> > > > easily and survive longer. Think about following situation.
> > > > 
> > > > Batch count: 7
> > > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > > > -> order 2...
> > > > 
> > > > free count: 1 + 1 + 1 + 2 + 4 = 9
> > > > so order 3 would not be freed.
> > > > 
> > > 
> > > You're relying on the batch count to be 7 where in a lot of cases it's
> > > 31. Even if low batch counts are common on another platform or you 
> > > adjusted
> > > the other counts to be higher values until they equal 30, it would be for
> > > this drain that no order-3 pages were freed. It's not a permanent 
> > > situation.
> > > 
> > > When or if it gets freed depends on the allocation request stream but the
> > > same applies to the existing caches. If a high-order request arrives, 
> > > it'll
> > > be used. If all the requests are for the other orders, then eventually
> > > the frees will hit the high watermark enough that the round-robin batch
> > > freeing fill free the order-3 entry in the cache.
> > 
> > I know that it isn't a permanent situation and it depends on workload.
> > However, it is clearly an unfair situation. We don't have any good reason
> > to cache higher order freepage longer. Even, batch count 7 means
> > that it is a small system. In this kind of system, there is no reason
> > to keep high order freepage longer in the cache.
> > 
> 
> Without knowing the future allocation request stream, there is no reason
> to favour one part of the per-cpu cache over another. To me, it's not

What I suggest is that. Don't favour one part of the per-cpu cache over
another.

> actually clear at all it's an unfair situation, particularly given that the
> vanilla code is also unfair -- the vanilla code can artifically preserve
> MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
> The only deciding factor there was a fault-intensive workload would mask
> overhead of the page allocator due to page zeroing cost which UNMOVABLE
> allocations may or may not require. Even that is vague considering that
> page-table allocations are zeroing even if many kernel allocations are not.

"Vanilla works like that" doesn't seem to be reasonable to justify
this change.  Vanilla code works with three lists and it now become
six lists and each list can have different size of page. We need to
think that previous approach will also work fine with current one. I
think that there is a problem although it's not permanent and would be
minor. However, it's better to fix it when it is found.

> > The other potential problem is that if we change
> > PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
> > doesn't guarantee that free_pcppages_bulk() will work fairly and we
> > will not notice it easily.
> > 
> 
> In the event the high-order cache is increased, then the high watermark
> would also need to be adjusted to account for that just as this patch
> does.

pcp->high will be adjusted automatically when high-order cache is
increased by your change. What we miss is pcp->batch and there is no
information about that the number of high-order cache and pcp->batch
has some association.

> > I think that it can be simply solved by maintaining a last pindex in
> > pcp. How about it?
> > 
> 
> That would rely on the previous allocation stream to drive the freeing
> which is slightly related to the fact the per-cpu cache contents are
> related to the previous request 

Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-05 Thread Joonsoo Kim
On Mon, Dec 05, 2016 at 09:57:39AM +, Mel Gorman wrote:
> On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote:
> > On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote:
> > > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone 
> > > > > *zone, int count,
> > > > >   if (unlikely(isolated_pageblocks))
> > > > >   mt = get_pageblock_migratetype(page);
> > > > >  
> > > > > + nr_freed += (1 << order);
> > > > > + count -= (1 << order);
> > > > >   if (bulkfree_pcp_prepare(page))
> > > > >   continue;
> > > > >  
> > > > > - __free_one_page(page, page_to_pfn(page), zone, 
> > > > > 0, mt);
> > > > > - trace_mm_page_pcpu_drain(page, 0, mt);
> > > > > - } while (--count && --batch_free && !list_empty(list));
> > > > > + __free_one_page(page, page_to_pfn(page), zone, 
> > > > > order, mt);
> > > > > + trace_mm_page_pcpu_drain(page, order, mt);
> > > > > + } while (count > 0 && --batch_free && 
> > > > > !list_empty(list));
> > > > >   }
> > > > >   spin_unlock(>lock);
> > > > > + pcp->count -= nr_freed;
> > > > >  }
> > > > 
> > > > I guess that this patch would cause following problems.
> > > > 
> > > > 1. If pcp->batch is too small, high order page will not be freed
> > > > easily and survive longer. Think about following situation.
> > > > 
> > > > Batch count: 7
> > > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > > > -> order 2...
> > > > 
> > > > free count: 1 + 1 + 1 + 2 + 4 = 9
> > > > so order 3 would not be freed.
> > > > 
> > > 
> > > You're relying on the batch count to be 7 where in a lot of cases it's
> > > 31. Even if low batch counts are common on another platform or you 
> > > adjusted
> > > the other counts to be higher values until they equal 30, it would be for
> > > this drain that no order-3 pages were freed. It's not a permanent 
> > > situation.
> > > 
> > > When or if it gets freed depends on the allocation request stream but the
> > > same applies to the existing caches. If a high-order request arrives, 
> > > it'll
> > > be used. If all the requests are for the other orders, then eventually
> > > the frees will hit the high watermark enough that the round-robin batch
> > > freeing fill free the order-3 entry in the cache.
> > 
> > I know that it isn't a permanent situation and it depends on workload.
> > However, it is clearly an unfair situation. We don't have any good reason
> > to cache higher order freepage longer. Even, batch count 7 means
> > that it is a small system. In this kind of system, there is no reason
> > to keep high order freepage longer in the cache.
> > 
> 
> Without knowing the future allocation request stream, there is no reason
> to favour one part of the per-cpu cache over another. To me, it's not

What I suggest is that. Don't favour one part of the per-cpu cache over
another.

> actually clear at all it's an unfair situation, particularly given that the
> vanilla code is also unfair -- the vanilla code can artifically preserve
> MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
> The only deciding factor there was a fault-intensive workload would mask
> overhead of the page allocator due to page zeroing cost which UNMOVABLE
> allocations may or may not require. Even that is vague considering that
> page-table allocations are zeroing even if many kernel allocations are not.

"Vanilla works like that" doesn't seem to be reasonable to justify
this change.  Vanilla code works with three lists and it now become
six lists and each list can have different size of page. We need to
think that previous approach will also work fine with current one. I
think that there is a problem although it's not permanent and would be
minor. However, it's better to fix it when it is found.

> > The other potential problem is that if we change
> > PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
> > doesn't guarantee that free_pcppages_bulk() will work fairly and we
> > will not notice it easily.
> > 
> 
> In the event the high-order cache is increased, then the high watermark
> would also need to be adjusted to account for that just as this patch
> does.

pcp->high will be adjusted automatically when high-order cache is
increased by your change. What we miss is pcp->batch and there is no
information about that the number of high-order cache and pcp->batch
has some association.

> > I think that it can be simply solved by maintaining a last pindex in
> > pcp. How about it?
> > 
> 
> That would rely on the previous allocation stream to drive the freeing
> which is slightly related to the fact the per-cpu cache contents are
> related to the previous request 

Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-05 Thread Mel Gorman
On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote:
> On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote:
> > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone 
> > > > *zone, int count,
> > > > if (unlikely(isolated_pageblocks))
> > > > mt = get_pageblock_migratetype(page);
> > > >  
> > > > +   nr_freed += (1 << order);
> > > > +   count -= (1 << order);
> > > > if (bulkfree_pcp_prepare(page))
> > > > continue;
> > > >  
> > > > -   __free_one_page(page, page_to_pfn(page), zone, 
> > > > 0, mt);
> > > > -   trace_mm_page_pcpu_drain(page, 0, mt);
> > > > -   } while (--count && --batch_free && !list_empty(list));
> > > > +   __free_one_page(page, page_to_pfn(page), zone, 
> > > > order, mt);
> > > > +   trace_mm_page_pcpu_drain(page, order, mt);
> > > > +   } while (count > 0 && --batch_free && 
> > > > !list_empty(list));
> > > > }
> > > > spin_unlock(>lock);
> > > > +   pcp->count -= nr_freed;
> > > >  }
> > > 
> > > I guess that this patch would cause following problems.
> > > 
> > > 1. If pcp->batch is too small, high order page will not be freed
> > > easily and survive longer. Think about following situation.
> > > 
> > > Batch count: 7
> > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > > -> order 2...
> > > 
> > > free count: 1 + 1 + 1 + 2 + 4 = 9
> > > so order 3 would not be freed.
> > > 
> > 
> > You're relying on the batch count to be 7 where in a lot of cases it's
> > 31. Even if low batch counts are common on another platform or you adjusted
> > the other counts to be higher values until they equal 30, it would be for
> > this drain that no order-3 pages were freed. It's not a permanent situation.
> > 
> > When or if it gets freed depends on the allocation request stream but the
> > same applies to the existing caches. If a high-order request arrives, it'll
> > be used. If all the requests are for the other orders, then eventually
> > the frees will hit the high watermark enough that the round-robin batch
> > freeing fill free the order-3 entry in the cache.
> 
> I know that it isn't a permanent situation and it depends on workload.
> However, it is clearly an unfair situation. We don't have any good reason
> to cache higher order freepage longer. Even, batch count 7 means
> that it is a small system. In this kind of system, there is no reason
> to keep high order freepage longer in the cache.
> 

Without knowing the future allocation request stream, there is no reason
to favour one part of the per-cpu cache over another. To me, it's not
actually clear at all it's an unfair situation, particularly given that the
vanilla code is also unfair -- the vanilla code can artifically preserve
MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
The only deciding factor there was a fault-intensive workload would mask
overhead of the page allocator due to page zeroing cost which UNMOVABLE
allocations may or may not require. Even that is vague considering that
page-table allocations are zeroing even if many kernel allocations are not.

> The other potential problem is that if we change
> PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
> doesn't guarantee that free_pcppages_bulk() will work fairly and we
> will not notice it easily.
> 

In the event the high-order cache is increased, then the high watermark
would also need to be adjusted to account for that just as this patch
does.

> I think that it can be simply solved by maintaining a last pindex in
> pcp. How about it?
> 

That would rely on the previous allocation stream to drive the freeing
which is slightly related to the fact the per-cpu cache contents are
related to the previous request stream. It's still not guaranteed to be
related to the future request stream.

Adding a new pindex cache adds complexity to the free path without any
guarantee it benefits anything. The use of such a heuristic should be
driven by a workload demonstrating it's a problem. Granted, half of the
cost of a free operations is due to irq enable/disable but there is no
reason to make it unnecessarily expensive.

> > > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > > workloads. If this case happen, it invalidates effect of high order
> > > cache in pcp since cached high order pages would be also freed to the
> > > buddy when burst order-0 free happens.
> > > 
> > 
> > A large burst of order-0 frees will free the high-order cache if it's not
> > being used but I don't see what your point is or why that is a problem.
> > It is pretty much guaranteed that there will be workloads that benefit
> > from 

Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-05 Thread Mel Gorman
On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote:
> On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote:
> > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone 
> > > > *zone, int count,
> > > > if (unlikely(isolated_pageblocks))
> > > > mt = get_pageblock_migratetype(page);
> > > >  
> > > > +   nr_freed += (1 << order);
> > > > +   count -= (1 << order);
> > > > if (bulkfree_pcp_prepare(page))
> > > > continue;
> > > >  
> > > > -   __free_one_page(page, page_to_pfn(page), zone, 
> > > > 0, mt);
> > > > -   trace_mm_page_pcpu_drain(page, 0, mt);
> > > > -   } while (--count && --batch_free && !list_empty(list));
> > > > +   __free_one_page(page, page_to_pfn(page), zone, 
> > > > order, mt);
> > > > +   trace_mm_page_pcpu_drain(page, order, mt);
> > > > +   } while (count > 0 && --batch_free && 
> > > > !list_empty(list));
> > > > }
> > > > spin_unlock(>lock);
> > > > +   pcp->count -= nr_freed;
> > > >  }
> > > 
> > > I guess that this patch would cause following problems.
> > > 
> > > 1. If pcp->batch is too small, high order page will not be freed
> > > easily and survive longer. Think about following situation.
> > > 
> > > Batch count: 7
> > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > > -> order 2...
> > > 
> > > free count: 1 + 1 + 1 + 2 + 4 = 9
> > > so order 3 would not be freed.
> > > 
> > 
> > You're relying on the batch count to be 7 where in a lot of cases it's
> > 31. Even if low batch counts are common on another platform or you adjusted
> > the other counts to be higher values until they equal 30, it would be for
> > this drain that no order-3 pages were freed. It's not a permanent situation.
> > 
> > When or if it gets freed depends on the allocation request stream but the
> > same applies to the existing caches. If a high-order request arrives, it'll
> > be used. If all the requests are for the other orders, then eventually
> > the frees will hit the high watermark enough that the round-robin batch
> > freeing fill free the order-3 entry in the cache.
> 
> I know that it isn't a permanent situation and it depends on workload.
> However, it is clearly an unfair situation. We don't have any good reason
> to cache higher order freepage longer. Even, batch count 7 means
> that it is a small system. In this kind of system, there is no reason
> to keep high order freepage longer in the cache.
> 

Without knowing the future allocation request stream, there is no reason
to favour one part of the per-cpu cache over another. To me, it's not
actually clear at all it's an unfair situation, particularly given that the
vanilla code is also unfair -- the vanilla code can artifically preserve
MIGRATE_UNMOVABLE without any clear indication that it is a universal win.
The only deciding factor there was a fault-intensive workload would mask
overhead of the page allocator due to page zeroing cost which UNMOVABLE
allocations may or may not require. Even that is vague considering that
page-table allocations are zeroing even if many kernel allocations are not.

> The other potential problem is that if we change
> PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
> doesn't guarantee that free_pcppages_bulk() will work fairly and we
> will not notice it easily.
> 

In the event the high-order cache is increased, then the high watermark
would also need to be adjusted to account for that just as this patch
does.

> I think that it can be simply solved by maintaining a last pindex in
> pcp. How about it?
> 

That would rely on the previous allocation stream to drive the freeing
which is slightly related to the fact the per-cpu cache contents are
related to the previous request stream. It's still not guaranteed to be
related to the future request stream.

Adding a new pindex cache adds complexity to the free path without any
guarantee it benefits anything. The use of such a heuristic should be
driven by a workload demonstrating it's a problem. Granted, half of the
cost of a free operations is due to irq enable/disable but there is no
reason to make it unnecessarily expensive.

> > > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > > workloads. If this case happen, it invalidates effect of high order
> > > cache in pcp since cached high order pages would be also freed to the
> > > buddy when burst order-0 free happens.
> > > 
> > 
> > A large burst of order-0 frees will free the high-order cache if it's not
> > being used but I don't see what your point is or why that is a problem.
> > It is pretty much guaranteed that there will be workloads that benefit
> > from 

Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-04 Thread Joonsoo Kim
On Fri, Dec 02, 2016 at 09:21:08AM +0100, Michal Hocko wrote:
> On Fri 02-12-16 15:03:46, Joonsoo Kim wrote:
> [...]
> > > o pcp accounting during free is now confined to free_pcppages_bulk as it's
> > >   impossible for the caller to know exactly how many pages were freed.
> > >   Due to the high-order caches, the number of pages drained for a request
> > >   is no longer precise.
> > > 
> > > o The high watermark for per-cpu pages is increased to reduce the 
> > > probability
> > >   that a single refill causes a drain on the next free.
> [...]
> > I guess that this patch would cause following problems.
> > 
> > 1. If pcp->batch is too small, high order page will not be freed
> > easily and survive longer. Think about following situation.
> > 
> > Batch count: 7
> > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > -> order 2...
> > 
> > free count: 1 + 1 + 1 + 2 + 4 = 9
> > so order 3 would not be freed.
> 
> I guess the second paragraph above in the changelog tries to clarify
> that...

It doesn't perfectly clarify my concern. This is a different problem.

>  
> > 2. And, It seems that this logic penalties high order pages. One free
> > to high order page means 1 << order pages free rather than just
> > one page free. This logic do round-robin to choose the target page so
> > amount of freed page will be different by the order.
> 
> Yes this is indeed possible. The first paragraph above mentions this
> problem.

Yes, it is mentioned simply but we cannot easily notice that the above
penalty for high order page is there.

Thanks.



Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-04 Thread Joonsoo Kim
On Fri, Dec 02, 2016 at 09:21:08AM +0100, Michal Hocko wrote:
> On Fri 02-12-16 15:03:46, Joonsoo Kim wrote:
> [...]
> > > o pcp accounting during free is now confined to free_pcppages_bulk as it's
> > >   impossible for the caller to know exactly how many pages were freed.
> > >   Due to the high-order caches, the number of pages drained for a request
> > >   is no longer precise.
> > > 
> > > o The high watermark for per-cpu pages is increased to reduce the 
> > > probability
> > >   that a single refill causes a drain on the next free.
> [...]
> > I guess that this patch would cause following problems.
> > 
> > 1. If pcp->batch is too small, high order page will not be freed
> > easily and survive longer. Think about following situation.
> > 
> > Batch count: 7
> > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > -> order 2...
> > 
> > free count: 1 + 1 + 1 + 2 + 4 = 9
> > so order 3 would not be freed.
> 
> I guess the second paragraph above in the changelog tries to clarify
> that...

It doesn't perfectly clarify my concern. This is a different problem.

>  
> > 2. And, It seems that this logic penalties high order pages. One free
> > to high order page means 1 << order pages free rather than just
> > one page free. This logic do round-robin to choose the target page so
> > amount of freed page will be different by the order.
> 
> Yes this is indeed possible. The first paragraph above mentions this
> problem.

Yes, it is mentioned simply but we cannot easily notice that the above
penalty for high order page is there.

Thanks.



Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-04 Thread Joonsoo Kim
On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote:
> On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, 
> > > int count,
> > >   if (unlikely(isolated_pageblocks))
> > >   mt = get_pageblock_migratetype(page);
> > >  
> > > + nr_freed += (1 << order);
> > > + count -= (1 << order);
> > >   if (bulkfree_pcp_prepare(page))
> > >   continue;
> > >  
> > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > > - trace_mm_page_pcpu_drain(page, 0, mt);
> > > - } while (--count && --batch_free && !list_empty(list));
> > > + __free_one_page(page, page_to_pfn(page), zone, order, 
> > > mt);
> > > + trace_mm_page_pcpu_drain(page, order, mt);
> > > + } while (count > 0 && --batch_free && !list_empty(list));
> > >   }
> > >   spin_unlock(>lock);
> > > + pcp->count -= nr_freed;
> > >  }
> > 
> > I guess that this patch would cause following problems.
> > 
> > 1. If pcp->batch is too small, high order page will not be freed
> > easily and survive longer. Think about following situation.
> > 
> > Batch count: 7
> > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > -> order 2...
> > 
> > free count: 1 + 1 + 1 + 2 + 4 = 9
> > so order 3 would not be freed.
> > 
> 
> You're relying on the batch count to be 7 where in a lot of cases it's
> 31. Even if low batch counts are common on another platform or you adjusted
> the other counts to be higher values until they equal 30, it would be for
> this drain that no order-3 pages were freed. It's not a permanent situation.
> 
> When or if it gets freed depends on the allocation request stream but the
> same applies to the existing caches. If a high-order request arrives, it'll
> be used. If all the requests are for the other orders, then eventually
> the frees will hit the high watermark enough that the round-robin batch
> freeing fill free the order-3 entry in the cache.

I know that it isn't a permanent situation and it depends on workload.
However, it is clearly an unfair situation. We don't have any good reason
to cache higher order freepage longer. Even, batch count 7 means
that it is a small system. In this kind of system, there is no reason
to keep high order freepage longer in the cache.

The other potential problem is that if we change
PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
doesn't guarantee that free_pcppages_bulk() will work fairly and we
will not notice it easily.

I think that it can be simply solved by maintaining a last pindex in
pcp. How about it?

> 
> > 2. And, It seems that this logic penalties high order pages. One free
> > to high order page means 1 << order pages free rather than just
> > one page free. This logic do round-robin to choose the target page so
> > amount of freed page will be different by the order. I think that it
> > makes some sense because high order page are less important to cache
> > in pcp than lower order but I'd like to know if it is intended or not.
> > If intended, it deserves the comment.
> > 
> 
> It's intended but I'm not sure what else you want me to explain outside
> the code itself in this case. The round-robin nature of the bulk drain
> already doesn't attach any special important to the migrate type of the
> list and there is no good reason to assume that high-order pages in the
> cache when the high watermark is reached deserve special protection.

Non-trivial part is that round-robin approach penalties high-order
pages caching. We usually think that round-robin is fair, but, in this
case, it isn't. Some people can notice that amount of freepage in turn is
different but some may not. It is a different situation in the past that
amount of freepage in turn is same even if migratetype is different. I
think that it deserve some comment but I don't feel it strongly.

> > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > workloads. If this case happen, it invalidates effect of high order
> > cache in pcp since cached high order pages would be also freed to the
> > buddy when burst order-0 free happens.
> > 
> 
> A large burst of order-0 frees will free the high-order cache if it's not
> being used but I don't see what your point is or why that is a problem.
> It is pretty much guaranteed that there will be workloads that benefit
> from protecting the high-order cache (SLUB-intensive alloc/free
> intensive workloads) while others suffer (Fault-intensive map/unmap
> workloads).
> 
> What's there at the moment behaves reasonably on a variety of workloads
> across 8 machines.

Yes, I see that this patch improves some workloads. What I like to say
is that I find some weakness and if it is fixed, we can get better
result. This patch implement unified pcp cache for 

Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-04 Thread Joonsoo Kim
On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote:
> On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, 
> > > int count,
> > >   if (unlikely(isolated_pageblocks))
> > >   mt = get_pageblock_migratetype(page);
> > >  
> > > + nr_freed += (1 << order);
> > > + count -= (1 << order);
> > >   if (bulkfree_pcp_prepare(page))
> > >   continue;
> > >  
> > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > > - trace_mm_page_pcpu_drain(page, 0, mt);
> > > - } while (--count && --batch_free && !list_empty(list));
> > > + __free_one_page(page, page_to_pfn(page), zone, order, 
> > > mt);
> > > + trace_mm_page_pcpu_drain(page, order, mt);
> > > + } while (count > 0 && --batch_free && !list_empty(list));
> > >   }
> > >   spin_unlock(>lock);
> > > + pcp->count -= nr_freed;
> > >  }
> > 
> > I guess that this patch would cause following problems.
> > 
> > 1. If pcp->batch is too small, high order page will not be freed
> > easily and survive longer. Think about following situation.
> > 
> > Batch count: 7
> > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> > -> order 2...
> > 
> > free count: 1 + 1 + 1 + 2 + 4 = 9
> > so order 3 would not be freed.
> > 
> 
> You're relying on the batch count to be 7 where in a lot of cases it's
> 31. Even if low batch counts are common on another platform or you adjusted
> the other counts to be higher values until they equal 30, it would be for
> this drain that no order-3 pages were freed. It's not a permanent situation.
> 
> When or if it gets freed depends on the allocation request stream but the
> same applies to the existing caches. If a high-order request arrives, it'll
> be used. If all the requests are for the other orders, then eventually
> the frees will hit the high watermark enough that the round-robin batch
> freeing fill free the order-3 entry in the cache.

I know that it isn't a permanent situation and it depends on workload.
However, it is clearly an unfair situation. We don't have any good reason
to cache higher order freepage longer. Even, batch count 7 means
that it is a small system. In this kind of system, there is no reason
to keep high order freepage longer in the cache.

The other potential problem is that if we change
PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also
doesn't guarantee that free_pcppages_bulk() will work fairly and we
will not notice it easily.

I think that it can be simply solved by maintaining a last pindex in
pcp. How about it?

> 
> > 2. And, It seems that this logic penalties high order pages. One free
> > to high order page means 1 << order pages free rather than just
> > one page free. This logic do round-robin to choose the target page so
> > amount of freed page will be different by the order. I think that it
> > makes some sense because high order page are less important to cache
> > in pcp than lower order but I'd like to know if it is intended or not.
> > If intended, it deserves the comment.
> > 
> 
> It's intended but I'm not sure what else you want me to explain outside
> the code itself in this case. The round-robin nature of the bulk drain
> already doesn't attach any special important to the migrate type of the
> list and there is no good reason to assume that high-order pages in the
> cache when the high watermark is reached deserve special protection.

Non-trivial part is that round-robin approach penalties high-order
pages caching. We usually think that round-robin is fair, but, in this
case, it isn't. Some people can notice that amount of freepage in turn is
different but some may not. It is a different situation in the past that
amount of freepage in turn is same even if migratetype is different. I
think that it deserve some comment but I don't feel it strongly.

> > 3. I guess that order-0 file/anon page alloc/free is dominent in many
> > workloads. If this case happen, it invalidates effect of high order
> > cache in pcp since cached high order pages would be also freed to the
> > buddy when burst order-0 free happens.
> > 
> 
> A large burst of order-0 frees will free the high-order cache if it's not
> being used but I don't see what your point is or why that is a problem.
> It is pretty much guaranteed that there will be workloads that benefit
> from protecting the high-order cache (SLUB-intensive alloc/free
> intensive workloads) while others suffer (Fault-intensive map/unmap
> workloads).
> 
> What's there at the moment behaves reasonably on a variety of workloads
> across 8 machines.

Yes, I see that this patch improves some workloads. What I like to say
is that I find some weakness and if it is fixed, we can get better
result. This patch implement unified pcp cache for 

Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-02 Thread Mel Gorman
On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, 
> > int count,
> > if (unlikely(isolated_pageblocks))
> > mt = get_pageblock_migratetype(page);
> >  
> > +   nr_freed += (1 << order);
> > +   count -= (1 << order);
> > if (bulkfree_pcp_prepare(page))
> > continue;
> >  
> > -   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > -   trace_mm_page_pcpu_drain(page, 0, mt);
> > -   } while (--count && --batch_free && !list_empty(list));
> > +   __free_one_page(page, page_to_pfn(page), zone, order, 
> > mt);
> > +   trace_mm_page_pcpu_drain(page, order, mt);
> > +   } while (count > 0 && --batch_free && !list_empty(list));
> > }
> > spin_unlock(>lock);
> > +   pcp->count -= nr_freed;
> >  }
> 
> I guess that this patch would cause following problems.
> 
> 1. If pcp->batch is too small, high order page will not be freed
> easily and survive longer. Think about following situation.
> 
> Batch count: 7
> MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> -> order 2...
> 
> free count: 1 + 1 + 1 + 2 + 4 = 9
> so order 3 would not be freed.
> 

You're relying on the batch count to be 7 where in a lot of cases it's
31. Even if low batch counts are common on another platform or you adjusted
the other counts to be higher values until they equal 30, it would be for
this drain that no order-3 pages were freed. It's not a permanent situation.

When or if it gets freed depends on the allocation request stream but the
same applies to the existing caches. If a high-order request arrives, it'll
be used. If all the requests are for the other orders, then eventually
the frees will hit the high watermark enough that the round-robin batch
freeing fill free the order-3 entry in the cache.

> 2. And, It seems that this logic penalties high order pages. One free
> to high order page means 1 << order pages free rather than just
> one page free. This logic do round-robin to choose the target page so
> amount of freed page will be different by the order. I think that it
> makes some sense because high order page are less important to cache
> in pcp than lower order but I'd like to know if it is intended or not.
> If intended, it deserves the comment.
> 

It's intended but I'm not sure what else you want me to explain outside
the code itself in this case. The round-robin nature of the bulk drain
already doesn't attach any special important to the migrate type of the
list and there is no good reason to assume that high-order pages in the
cache when the high watermark is reached deserve special protection.

> 3. I guess that order-0 file/anon page alloc/free is dominent in many
> workloads. If this case happen, it invalidates effect of high order
> cache in pcp since cached high order pages would be also freed to the
> buddy when burst order-0 free happens.
> 

A large burst of order-0 frees will free the high-order cache if it's not
being used but I don't see what your point is or why that is a problem.
It is pretty much guaranteed that there will be workloads that benefit
from protecting the high-order cache (SLUB-intensive alloc/free
intensive workloads) while others suffer (Fault-intensive map/unmap
workloads).

What's there at the moment behaves reasonably on a variety of workloads
across 8 machines.

> > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone 
> > *preferred_zone,
> > struct page *page;
> > bool cold = ((gfp_flags & __GFP_COLD) != 0);
> >  
> > -   if (likely(order == 0)) {
> > +   if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > struct per_cpu_pages *pcp;
> > struct list_head *list;
> >  
> > local_irq_save(flags);
> > do {
> > +   unsigned int pindex;
> > +
> > +   pindex = order_to_pindex(migratetype, order);
> > pcp = _cpu_ptr(zone->pageset)->pcp;
> > -   list = >lists[migratetype];
> > +   list = >lists[pindex];
> > if (list_empty(list)) {
> > -   pcp->count += rmqueue_bulk(zone, 0,
> > +   int nr_pages = rmqueue_bulk(zone, order,
> > pcp->batch, list,
> > migratetype, cold);
> 
> Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> * (1 << order) pages and pcp->count can easily overflow pcp->high
> * because list empty here doesn't mean that pcp->count is zero.
> 

Potentially a refill can cause a drain on another list. However, I adjusted
the high watermark in pageset_set_batch to make it unlikely that a single
refill will cause a drain and added a 

Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-02 Thread Mel Gorman
On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote:
> > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, 
> > int count,
> > if (unlikely(isolated_pageblocks))
> > mt = get_pageblock_migratetype(page);
> >  
> > +   nr_freed += (1 << order);
> > +   count -= (1 << order);
> > if (bulkfree_pcp_prepare(page))
> > continue;
> >  
> > -   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > -   trace_mm_page_pcpu_drain(page, 0, mt);
> > -   } while (--count && --batch_free && !list_empty(list));
> > +   __free_one_page(page, page_to_pfn(page), zone, order, 
> > mt);
> > +   trace_mm_page_pcpu_drain(page, order, mt);
> > +   } while (count > 0 && --batch_free && !list_empty(list));
> > }
> > spin_unlock(>lock);
> > +   pcp->count -= nr_freed;
> >  }
> 
> I guess that this patch would cause following problems.
> 
> 1. If pcp->batch is too small, high order page will not be freed
> easily and survive longer. Think about following situation.
> 
> Batch count: 7
> MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> -> order 2...
> 
> free count: 1 + 1 + 1 + 2 + 4 = 9
> so order 3 would not be freed.
> 

You're relying on the batch count to be 7 where in a lot of cases it's
31. Even if low batch counts are common on another platform or you adjusted
the other counts to be higher values until they equal 30, it would be for
this drain that no order-3 pages were freed. It's not a permanent situation.

When or if it gets freed depends on the allocation request stream but the
same applies to the existing caches. If a high-order request arrives, it'll
be used. If all the requests are for the other orders, then eventually
the frees will hit the high watermark enough that the round-robin batch
freeing fill free the order-3 entry in the cache.

> 2. And, It seems that this logic penalties high order pages. One free
> to high order page means 1 << order pages free rather than just
> one page free. This logic do round-robin to choose the target page so
> amount of freed page will be different by the order. I think that it
> makes some sense because high order page are less important to cache
> in pcp than lower order but I'd like to know if it is intended or not.
> If intended, it deserves the comment.
> 

It's intended but I'm not sure what else you want me to explain outside
the code itself in this case. The round-robin nature of the bulk drain
already doesn't attach any special important to the migrate type of the
list and there is no good reason to assume that high-order pages in the
cache when the high watermark is reached deserve special protection.

> 3. I guess that order-0 file/anon page alloc/free is dominent in many
> workloads. If this case happen, it invalidates effect of high order
> cache in pcp since cached high order pages would be also freed to the
> buddy when burst order-0 free happens.
> 

A large burst of order-0 frees will free the high-order cache if it's not
being used but I don't see what your point is or why that is a problem.
It is pretty much guaranteed that there will be workloads that benefit
from protecting the high-order cache (SLUB-intensive alloc/free
intensive workloads) while others suffer (Fault-intensive map/unmap
workloads).

What's there at the moment behaves reasonably on a variety of workloads
across 8 machines.

> > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone 
> > *preferred_zone,
> > struct page *page;
> > bool cold = ((gfp_flags & __GFP_COLD) != 0);
> >  
> > -   if (likely(order == 0)) {
> > +   if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) {
> > struct per_cpu_pages *pcp;
> > struct list_head *list;
> >  
> > local_irq_save(flags);
> > do {
> > +   unsigned int pindex;
> > +
> > +   pindex = order_to_pindex(migratetype, order);
> > pcp = _cpu_ptr(zone->pageset)->pcp;
> > -   list = >lists[migratetype];
> > +   list = >lists[pindex];
> > if (list_empty(list)) {
> > -   pcp->count += rmqueue_bulk(zone, 0,
> > +   int nr_pages = rmqueue_bulk(zone, order,
> > pcp->batch, list,
> > migratetype, cold);
> 
> Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch
> * (1 << order) pages and pcp->count can easily overflow pcp->high
> * because list empty here doesn't mean that pcp->count is zero.
> 

Potentially a refill can cause a drain on another list. However, I adjusted
the high watermark in pageset_set_batch to make it unlikely that a single
refill will cause a drain and added a 

Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-02 Thread Michal Hocko
On Fri 02-12-16 00:22:44, Mel Gorman wrote:
> Changelog since v4
> o Avoid pcp->count getting out of sync if struct page gets corrupted
> 
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
> 
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
> 
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.
>   The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
>   remaining are high-order caches up to and including
>   PAGE_ALLOC_COSTLY_ORDER
> 
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
>   impossible for the caller to know exactly how many pages were freed.
>   Due to the high-order caches, the number of pages drained for a request
>   is no longer precise.
> 
> o The high watermark for per-cpu pages is increased to reduce the probability
>   that a single refill causes a drain on the next free.
> 
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
> 
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
> 
> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v5
> Hmeansend-64 178.38 (  0.00%)  260.54 ( 46.06%)
> Hmeansend-128351.49 (  0.00%)  518.56 ( 47.53%)
> Hmeansend-256671.23 (  0.00%) 1005.72 ( 49.83%)
> Hmeansend-1024  2663.60 (  0.00%) 3880.54 ( 45.69%)
> Hmeansend-2048  5126.53 (  0.00%) 7545.38 ( 47.18%)
> Hmeansend-3312  7949.99 (  0.00%)11324.34 ( 42.44%)
> Hmeansend-4096  9433.56 (  0.00%)12818.85 ( 35.89%)
> Hmeansend-8192 15940.64 (  0.00%)21404.98 ( 34.28%)
> Hmeansend-1638426699.54 (  0.00%)32810.08 ( 22.89%)
> Hmeanrecv-64 178.38 (  0.00%)  260.52 ( 46.05%)
> Hmeanrecv-128351.49 (  0.00%)  518.53 ( 47.53%)
> Hmeanrecv-256671.20 (  0.00%) 1005.42 ( 49.79%)
> Hmeanrecv-1024  2663.45 (  0.00%) 3879.75 ( 45.67%)
> Hmeanrecv-2048  5126.26 (  0.00%) 7544.23 ( 47.17%)
> Hmeanrecv-3312  7949.50 (  0.00%)11322.52 ( 42.43%)
> Hmeanrecv-4096  9433.04 (  0.00%)12816.68 ( 35.87%)
> Hmeanrecv-8192 15939.64 (  0.00%)21402.75 ( 34.27%)
> Hmeanrecv-1638426698.44 (  0.00%)32806.81 ( 22.88%)
> 
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v4
> Hmeansend-64  87.47 (  0.00%)  127.01 ( 45.21%)
> Hmeansend-128174.36 (  0.00%)  254.86 ( 46.17%)
> Hmeansend-256347.52 (  0.00%)  505.91 ( 45.58%)
> Hmeansend-1024  1363.03 (  0.00%) 1962.49 ( 43.98%)
> Hmeansend-2048  2632.68 (  0.00%) 3731.74 ( 41.75%)
> Hmeansend-3312  4123.19 (  0.00%) 5859.08 ( 42.10%)
> Hmeansend-4096  5056.48 (  0.00%) 7058.00 ( 39.58%)
> Hmeansend-8192  8784.22 (  0.00%)12134.53 ( 38.14%)
> Hmeansend-1638415081.60 (  0.00%)19638.90 ( 30.22%)
> Hmeanrecv-64  86.19 (  0.00%)  126.34 ( 46.58%)
> Hmeanrecv-128173.93 (  0.00%)  253.51 ( 45.75%)
> Hmeanrecv-256346.19 (  0.00%)  503.34 ( 45.40%)
> Hmeanrecv-1024  1358.28 (  0.00%) 1951.63 ( 43.68%)
> Hmeanrecv-2048  2623.45 (  0.00%) 3701.67 ( 41.10%)
> Hmeanrecv-3312  4108.63 (  0.00%) 5817.75 ( 41.60%)
> Hmeanrecv-4096  5037.25 (  0.00%) 7004.79 ( 39.06%)
> Hmeanrecv-8192  8762.32 (  0.00%)12059.83 ( 37.63%)
> Hmeanrecv-1638415042.36 (  0.00%)19514.33 ( 29.73%)
> 
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was 

Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-02 Thread Michal Hocko
On Fri 02-12-16 00:22:44, Mel Gorman wrote:
> Changelog since v4
> o Avoid pcp->count getting out of sync if struct page gets corrupted
> 
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
> 
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
> 
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.
>   The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
>   remaining are high-order caches up to and including
>   PAGE_ALLOC_COSTLY_ORDER
> 
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
>   impossible for the caller to know exactly how many pages were freed.
>   Due to the high-order caches, the number of pages drained for a request
>   is no longer precise.
> 
> o The high watermark for per-cpu pages is increased to reduce the probability
>   that a single refill causes a drain on the next free.
> 
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
> 
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
> 
> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v5
> Hmeansend-64 178.38 (  0.00%)  260.54 ( 46.06%)
> Hmeansend-128351.49 (  0.00%)  518.56 ( 47.53%)
> Hmeansend-256671.23 (  0.00%) 1005.72 ( 49.83%)
> Hmeansend-1024  2663.60 (  0.00%) 3880.54 ( 45.69%)
> Hmeansend-2048  5126.53 (  0.00%) 7545.38 ( 47.18%)
> Hmeansend-3312  7949.99 (  0.00%)11324.34 ( 42.44%)
> Hmeansend-4096  9433.56 (  0.00%)12818.85 ( 35.89%)
> Hmeansend-8192 15940.64 (  0.00%)21404.98 ( 34.28%)
> Hmeansend-1638426699.54 (  0.00%)32810.08 ( 22.89%)
> Hmeanrecv-64 178.38 (  0.00%)  260.52 ( 46.05%)
> Hmeanrecv-128351.49 (  0.00%)  518.53 ( 47.53%)
> Hmeanrecv-256671.20 (  0.00%) 1005.42 ( 49.79%)
> Hmeanrecv-1024  2663.45 (  0.00%) 3879.75 ( 45.67%)
> Hmeanrecv-2048  5126.26 (  0.00%) 7544.23 ( 47.17%)
> Hmeanrecv-3312  7949.50 (  0.00%)11322.52 ( 42.43%)
> Hmeanrecv-4096  9433.04 (  0.00%)12816.68 ( 35.87%)
> Hmeanrecv-8192 15939.64 (  0.00%)21402.75 ( 34.27%)
> Hmeanrecv-1638426698.44 (  0.00%)32806.81 ( 22.88%)
> 
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v4
> Hmeansend-64  87.47 (  0.00%)  127.01 ( 45.21%)
> Hmeansend-128174.36 (  0.00%)  254.86 ( 46.17%)
> Hmeansend-256347.52 (  0.00%)  505.91 ( 45.58%)
> Hmeansend-1024  1363.03 (  0.00%) 1962.49 ( 43.98%)
> Hmeansend-2048  2632.68 (  0.00%) 3731.74 ( 41.75%)
> Hmeansend-3312  4123.19 (  0.00%) 5859.08 ( 42.10%)
> Hmeansend-4096  5056.48 (  0.00%) 7058.00 ( 39.58%)
> Hmeansend-8192  8784.22 (  0.00%)12134.53 ( 38.14%)
> Hmeansend-1638415081.60 (  0.00%)19638.90 ( 30.22%)
> Hmeanrecv-64  86.19 (  0.00%)  126.34 ( 46.58%)
> Hmeanrecv-128173.93 (  0.00%)  253.51 ( 45.75%)
> Hmeanrecv-256346.19 (  0.00%)  503.34 ( 45.40%)
> Hmeanrecv-1024  1358.28 (  0.00%) 1951.63 ( 43.68%)
> Hmeanrecv-2048  2623.45 (  0.00%) 3701.67 ( 41.10%)
> Hmeanrecv-3312  4108.63 (  0.00%) 5817.75 ( 41.60%)
> Hmeanrecv-4096  5037.25 (  0.00%) 7004.79 ( 39.06%)
> Hmeanrecv-8192  8762.32 (  0.00%)12059.83 ( 37.63%)
> Hmeanrecv-1638415042.36 (  0.00%)19514.33 ( 29.73%)
> 
> This is somewhat dramatic but it's also not universal. For example, it was
> observed on an older HP machine using pcc-cpufreq that there was 

Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-02 Thread Michal Hocko
On Fri 02-12-16 15:03:46, Joonsoo Kim wrote:
[...]
> > o pcp accounting during free is now confined to free_pcppages_bulk as it's
> >   impossible for the caller to know exactly how many pages were freed.
> >   Due to the high-order caches, the number of pages drained for a request
> >   is no longer precise.
> > 
> > o The high watermark for per-cpu pages is increased to reduce the 
> > probability
> >   that a single refill causes a drain on the next free.
[...]
> I guess that this patch would cause following problems.
> 
> 1. If pcp->batch is too small, high order page will not be freed
> easily and survive longer. Think about following situation.
> 
> Batch count: 7
> MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> -> order 2...
> 
> free count: 1 + 1 + 1 + 2 + 4 = 9
> so order 3 would not be freed.

I guess the second paragraph above in the changelog tries to clarify
that...
 
> 2. And, It seems that this logic penalties high order pages. One free
> to high order page means 1 << order pages free rather than just
> one page free. This logic do round-robin to choose the target page so
> amount of freed page will be different by the order.

Yes this is indeed possible. The first paragraph above mentions this
problem.

> I think that it
> makes some sense because high order page are less important to cache
> in pcp than lower order but I'd like to know if it is intended or not.
> If intended, it deserves the comment.
> 
> 3. I guess that order-0 file/anon page alloc/free is dominent in many
> workloads. If this case happen, it invalidates effect of high order
> cache in pcp since cached high order pages would be also freed to the
> buddy when burst order-0 free happens.

Yes this is true and I was wondering the same but I believe this can be
enahanced later on. E.g. we can check the order when crossing pcp->high
mark and only the given order portion of the batch. I just wouldn't over
optimize at this stage.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-02 Thread Michal Hocko
On Fri 02-12-16 15:03:46, Joonsoo Kim wrote:
[...]
> > o pcp accounting during free is now confined to free_pcppages_bulk as it's
> >   impossible for the caller to know exactly how many pages were freed.
> >   Due to the high-order caches, the number of pages drained for a request
> >   is no longer precise.
> > 
> > o The high watermark for per-cpu pages is increased to reduce the 
> > probability
> >   that a single refill causes a drain on the next free.
[...]
> I guess that this patch would cause following problems.
> 
> 1. If pcp->batch is too small, high order page will not be freed
> easily and survive longer. Think about following situation.
> 
> Batch count: 7
> MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1
> -> order 2...
> 
> free count: 1 + 1 + 1 + 2 + 4 = 9
> so order 3 would not be freed.

I guess the second paragraph above in the changelog tries to clarify
that...
 
> 2. And, It seems that this logic penalties high order pages. One free
> to high order page means 1 << order pages free rather than just
> one page free. This logic do round-robin to choose the target page so
> amount of freed page will be different by the order.

Yes this is indeed possible. The first paragraph above mentions this
problem.

> I think that it
> makes some sense because high order page are less important to cache
> in pcp than lower order but I'd like to know if it is intended or not.
> If intended, it deserves the comment.
> 
> 3. I guess that order-0 file/anon page alloc/free is dominent in many
> workloads. If this case happen, it invalidates effect of high order
> cache in pcp since cached high order pages would be also freed to the
> buddy when burst order-0 free happens.

Yes this is true and I was wondering the same but I believe this can be
enahanced later on. E.g. we can check the order when crossing pcp->high
mark and only the given order portion of the batch. I just wouldn't over
optimize at this stage.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-01 Thread Joonsoo Kim
Hello, Mel.

I didn't follow up previous discussion so what I raise here would be
duplicated. Please let me know the link if it is answered before.

On Fri, Dec 02, 2016 at 12:22:44AM +, Mel Gorman wrote:
> Changelog since v4
> o Avoid pcp->count getting out of sync if struct page gets corrupted
> 
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
> 
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
> 
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.
>   The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
>   remaining are high-order caches up to and including
>   PAGE_ALLOC_COSTLY_ORDER
> 
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
>   impossible for the caller to know exactly how many pages were freed.
>   Due to the high-order caches, the number of pages drained for a request
>   is no longer precise.
> 
> o The high watermark for per-cpu pages is increased to reduce the probability
>   that a single refill causes a drain on the next free.
> 
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
> 
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
> 
> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v5
> Hmeansend-64 178.38 (  0.00%)  260.54 ( 46.06%)
> Hmeansend-128351.49 (  0.00%)  518.56 ( 47.53%)
> Hmeansend-256671.23 (  0.00%) 1005.72 ( 49.83%)
> Hmeansend-1024  2663.60 (  0.00%) 3880.54 ( 45.69%)
> Hmeansend-2048  5126.53 (  0.00%) 7545.38 ( 47.18%)
> Hmeansend-3312  7949.99 (  0.00%)11324.34 ( 42.44%)
> Hmeansend-4096  9433.56 (  0.00%)12818.85 ( 35.89%)
> Hmeansend-8192 15940.64 (  0.00%)21404.98 ( 34.28%)
> Hmeansend-1638426699.54 (  0.00%)32810.08 ( 22.89%)
> Hmeanrecv-64 178.38 (  0.00%)  260.52 ( 46.05%)
> Hmeanrecv-128351.49 (  0.00%)  518.53 ( 47.53%)
> Hmeanrecv-256671.20 (  0.00%) 1005.42 ( 49.79%)
> Hmeanrecv-1024  2663.45 (  0.00%) 3879.75 ( 45.67%)
> Hmeanrecv-2048  5126.26 (  0.00%) 7544.23 ( 47.17%)
> Hmeanrecv-3312  7949.50 (  0.00%)11322.52 ( 42.43%)
> Hmeanrecv-4096  9433.04 (  0.00%)12816.68 ( 35.87%)
> Hmeanrecv-8192 15939.64 (  0.00%)21402.75 ( 34.27%)
> Hmeanrecv-1638426698.44 (  0.00%)32806.81 ( 22.88%)
> 
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v4
> Hmeansend-64  87.47 (  0.00%)  127.01 ( 45.21%)
> Hmeansend-128174.36 (  0.00%)  254.86 ( 46.17%)
> Hmeansend-256347.52 (  0.00%)  505.91 ( 45.58%)
> Hmeansend-1024  1363.03 (  0.00%) 1962.49 ( 43.98%)
> Hmeansend-2048  2632.68 (  0.00%) 3731.74 ( 41.75%)
> Hmeansend-3312  4123.19 (  0.00%) 5859.08 ( 42.10%)
> Hmeansend-4096  5056.48 (  0.00%) 7058.00 ( 39.58%)
> Hmeansend-8192  8784.22 (  0.00%)12134.53 ( 38.14%)
> Hmeansend-1638415081.60 (  0.00%)19638.90 ( 30.22%)
> Hmeanrecv-64  86.19 (  0.00%)  126.34 ( 46.58%)
> Hmeanrecv-128173.93 (  0.00%)  253.51 ( 45.75%)
> Hmeanrecv-256346.19 (  0.00%)  503.34 ( 45.40%)
> Hmeanrecv-1024  1358.28 (  0.00%) 1951.63 ( 43.68%)
> Hmeanrecv-2048  2623.45 (  0.00%) 3701.67 ( 41.10%)
> Hmeanrecv-3312  4108.63 (  0.00%) 5817.75 ( 41.60%)
> Hmeanrecv-4096  5037.25 (  0.00%) 7004.79 ( 39.06%)
> Hmeanrecv-8192  8762.32 (  0.00%)12059.83 ( 37.63%)
> Hmeanrecv-1638415042.36 (  0.00%)

Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-01 Thread Joonsoo Kim
Hello, Mel.

I didn't follow up previous discussion so what I raise here would be
duplicated. Please let me know the link if it is answered before.

On Fri, Dec 02, 2016 at 12:22:44AM +, Mel Gorman wrote:
> Changelog since v4
> o Avoid pcp->count getting out of sync if struct page gets corrupted
> 
> Changelog since v3
> o Allow high-order atomic allocations to use reserves
> 
> Changelog since v2
> o Correct initialisation to avoid -Woverflow warning
> 
> SLUB has been the default small kernel object allocator for quite some time
> but it is not universally used due to performance concerns and a reliance
> on high-order pages. The high-order concerns has two major components --
> high-order pages are not always available and high-order page allocations
> potentially contend on the zone->lock. This patch addresses some concerns
> about the zone lock contention by extending the per-cpu page allocator to
> cache high-order pages. The patch makes the following modifications
> 
> o New per-cpu lists are added to cache the high-order pages. This increases
>   the cache footprint of the per-cpu allocator and overall usage but for
>   some workloads, this will be offset by reduced contention on zone->lock.
>   The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
>   remaining are high-order caches up to and including
>   PAGE_ALLOC_COSTLY_ORDER
> 
> o pcp accounting during free is now confined to free_pcppages_bulk as it's
>   impossible for the caller to know exactly how many pages were freed.
>   Due to the high-order caches, the number of pages drained for a request
>   is no longer precise.
> 
> o The high watermark for per-cpu pages is increased to reduce the probability
>   that a single refill causes a drain on the next free.
> 
> The benefit depends on both the workload and the machine as ultimately the
> determining factor is whether cache line bounces on zone->lock or contention
> is a problem. The patch was tested on a variety of workloads and machines,
> some of which are reported here.
> 
> This is the result from netperf running UDP_STREAM on localhost. It was
> selected on the basis that it is slab-intensive and has been the subject
> of previous SLAB vs SLUB comparisons with the caveat that this is not
> testing between two physical hosts.
> 
> 2-socket modern machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v5
> Hmeansend-64 178.38 (  0.00%)  260.54 ( 46.06%)
> Hmeansend-128351.49 (  0.00%)  518.56 ( 47.53%)
> Hmeansend-256671.23 (  0.00%) 1005.72 ( 49.83%)
> Hmeansend-1024  2663.60 (  0.00%) 3880.54 ( 45.69%)
> Hmeansend-2048  5126.53 (  0.00%) 7545.38 ( 47.18%)
> Hmeansend-3312  7949.99 (  0.00%)11324.34 ( 42.44%)
> Hmeansend-4096  9433.56 (  0.00%)12818.85 ( 35.89%)
> Hmeansend-8192 15940.64 (  0.00%)21404.98 ( 34.28%)
> Hmeansend-1638426699.54 (  0.00%)32810.08 ( 22.89%)
> Hmeanrecv-64 178.38 (  0.00%)  260.52 ( 46.05%)
> Hmeanrecv-128351.49 (  0.00%)  518.53 ( 47.53%)
> Hmeanrecv-256671.20 (  0.00%) 1005.42 ( 49.79%)
> Hmeanrecv-1024  2663.45 (  0.00%) 3879.75 ( 45.67%)
> Hmeanrecv-2048  5126.26 (  0.00%) 7544.23 ( 47.17%)
> Hmeanrecv-3312  7949.50 (  0.00%)11322.52 ( 42.43%)
> Hmeanrecv-4096  9433.04 (  0.00%)12816.68 ( 35.87%)
> Hmeanrecv-8192 15939.64 (  0.00%)21402.75 ( 34.27%)
> Hmeanrecv-1638426698.44 (  0.00%)32806.81 ( 22.88%)
> 
> 1-socket 6 year old machine
> 4.9.0-rc5 4.9.0-rc5
>   vanilla hopcpu-v4
> Hmeansend-64  87.47 (  0.00%)  127.01 ( 45.21%)
> Hmeansend-128174.36 (  0.00%)  254.86 ( 46.17%)
> Hmeansend-256347.52 (  0.00%)  505.91 ( 45.58%)
> Hmeansend-1024  1363.03 (  0.00%) 1962.49 ( 43.98%)
> Hmeansend-2048  2632.68 (  0.00%) 3731.74 ( 41.75%)
> Hmeansend-3312  4123.19 (  0.00%) 5859.08 ( 42.10%)
> Hmeansend-4096  5056.48 (  0.00%) 7058.00 ( 39.58%)
> Hmeansend-8192  8784.22 (  0.00%)12134.53 ( 38.14%)
> Hmeansend-1638415081.60 (  0.00%)19638.90 ( 30.22%)
> Hmeanrecv-64  86.19 (  0.00%)  126.34 ( 46.58%)
> Hmeanrecv-128173.93 (  0.00%)  253.51 ( 45.75%)
> Hmeanrecv-256346.19 (  0.00%)  503.34 ( 45.40%)
> Hmeanrecv-1024  1358.28 (  0.00%) 1951.63 ( 43.68%)
> Hmeanrecv-2048  2623.45 (  0.00%) 3701.67 ( 41.10%)
> Hmeanrecv-3312  4108.63 (  0.00%) 5817.75 ( 41.60%)
> Hmeanrecv-4096  5037.25 (  0.00%) 7004.79 ( 39.06%)
> Hmeanrecv-8192  8762.32 (  0.00%)12059.83 ( 37.63%)
> Hmeanrecv-1638415042.36 (  0.00%)

[PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-01 Thread Mel Gorman
Changelog since v4
o Avoid pcp->count getting out of sync if struct page gets corrupted

Changelog since v3
o Allow high-order atomic allocations to use reserves

Changelog since v2
o Correct initialisation to avoid -Woverflow warning

SLUB has been the default small kernel object allocator for quite some time
but it is not universally used due to performance concerns and a reliance
on high-order pages. The high-order concerns has two major components --
high-order pages are not always available and high-order page allocations
potentially contend on the zone->lock. This patch addresses some concerns
about the zone lock contention by extending the per-cpu page allocator to
cache high-order pages. The patch makes the following modifications

o New per-cpu lists are added to cache the high-order pages. This increases
  the cache footprint of the per-cpu allocator and overall usage but for
  some workloads, this will be offset by reduced contention on zone->lock.
  The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
  remaining are high-order caches up to and including
  PAGE_ALLOC_COSTLY_ORDER

o pcp accounting during free is now confined to free_pcppages_bulk as it's
  impossible for the caller to know exactly how many pages were freed.
  Due to the high-order caches, the number of pages drained for a request
  is no longer precise.

o The high watermark for per-cpu pages is increased to reduce the probability
  that a single refill causes a drain on the next free.

The benefit depends on both the workload and the machine as ultimately the
determining factor is whether cache line bounces on zone->lock or contention
is a problem. The patch was tested on a variety of workloads and machines,
some of which are reported here.

This is the result from netperf running UDP_STREAM on localhost. It was
selected on the basis that it is slab-intensive and has been the subject
of previous SLAB vs SLUB comparisons with the caveat that this is not
testing between two physical hosts.

2-socket modern machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v5
Hmeansend-64 178.38 (  0.00%)  260.54 ( 46.06%)
Hmeansend-128351.49 (  0.00%)  518.56 ( 47.53%)
Hmeansend-256671.23 (  0.00%) 1005.72 ( 49.83%)
Hmeansend-1024  2663.60 (  0.00%) 3880.54 ( 45.69%)
Hmeansend-2048  5126.53 (  0.00%) 7545.38 ( 47.18%)
Hmeansend-3312  7949.99 (  0.00%)11324.34 ( 42.44%)
Hmeansend-4096  9433.56 (  0.00%)12818.85 ( 35.89%)
Hmeansend-8192 15940.64 (  0.00%)21404.98 ( 34.28%)
Hmeansend-1638426699.54 (  0.00%)32810.08 ( 22.89%)
Hmeanrecv-64 178.38 (  0.00%)  260.52 ( 46.05%)
Hmeanrecv-128351.49 (  0.00%)  518.53 ( 47.53%)
Hmeanrecv-256671.20 (  0.00%) 1005.42 ( 49.79%)
Hmeanrecv-1024  2663.45 (  0.00%) 3879.75 ( 45.67%)
Hmeanrecv-2048  5126.26 (  0.00%) 7544.23 ( 47.17%)
Hmeanrecv-3312  7949.50 (  0.00%)11322.52 ( 42.43%)
Hmeanrecv-4096  9433.04 (  0.00%)12816.68 ( 35.87%)
Hmeanrecv-8192 15939.64 (  0.00%)21402.75 ( 34.27%)
Hmeanrecv-1638426698.44 (  0.00%)32806.81 ( 22.88%)

1-socket 6 year old machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v4
Hmeansend-64  87.47 (  0.00%)  127.01 ( 45.21%)
Hmeansend-128174.36 (  0.00%)  254.86 ( 46.17%)
Hmeansend-256347.52 (  0.00%)  505.91 ( 45.58%)
Hmeansend-1024  1363.03 (  0.00%) 1962.49 ( 43.98%)
Hmeansend-2048  2632.68 (  0.00%) 3731.74 ( 41.75%)
Hmeansend-3312  4123.19 (  0.00%) 5859.08 ( 42.10%)
Hmeansend-4096  5056.48 (  0.00%) 7058.00 ( 39.58%)
Hmeansend-8192  8784.22 (  0.00%)12134.53 ( 38.14%)
Hmeansend-1638415081.60 (  0.00%)19638.90 ( 30.22%)
Hmeanrecv-64  86.19 (  0.00%)  126.34 ( 46.58%)
Hmeanrecv-128173.93 (  0.00%)  253.51 ( 45.75%)
Hmeanrecv-256346.19 (  0.00%)  503.34 ( 45.40%)
Hmeanrecv-1024  1358.28 (  0.00%) 1951.63 ( 43.68%)
Hmeanrecv-2048  2623.45 (  0.00%) 3701.67 ( 41.10%)
Hmeanrecv-3312  4108.63 (  0.00%) 5817.75 ( 41.60%)
Hmeanrecv-4096  5037.25 (  0.00%) 7004.79 ( 39.06%)
Hmeanrecv-8192  8762.32 (  0.00%)12059.83 ( 37.63%)
Hmeanrecv-1638415042.36 (  0.00%)19514.33 ( 29.73%)

This is somewhat dramatic but it's also not universal. For example, it was
observed on an older HP machine using pcc-cpufreq that there was almost
no difference but pcc-cpufreq is also a known performance hazard.

These are quite different results but illustrate that the patch is
dependent on the CPU. The results are similar for TCP_STREAM on
the two-socket 

[PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5

2016-12-01 Thread Mel Gorman
Changelog since v4
o Avoid pcp->count getting out of sync if struct page gets corrupted

Changelog since v3
o Allow high-order atomic allocations to use reserves

Changelog since v2
o Correct initialisation to avoid -Woverflow warning

SLUB has been the default small kernel object allocator for quite some time
but it is not universally used due to performance concerns and a reliance
on high-order pages. The high-order concerns has two major components --
high-order pages are not always available and high-order page allocations
potentially contend on the zone->lock. This patch addresses some concerns
about the zone lock contention by extending the per-cpu page allocator to
cache high-order pages. The patch makes the following modifications

o New per-cpu lists are added to cache the high-order pages. This increases
  the cache footprint of the per-cpu allocator and overall usage but for
  some workloads, this will be offset by reduced contention on zone->lock.
  The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The
  remaining are high-order caches up to and including
  PAGE_ALLOC_COSTLY_ORDER

o pcp accounting during free is now confined to free_pcppages_bulk as it's
  impossible for the caller to know exactly how many pages were freed.
  Due to the high-order caches, the number of pages drained for a request
  is no longer precise.

o The high watermark for per-cpu pages is increased to reduce the probability
  that a single refill causes a drain on the next free.

The benefit depends on both the workload and the machine as ultimately the
determining factor is whether cache line bounces on zone->lock or contention
is a problem. The patch was tested on a variety of workloads and machines,
some of which are reported here.

This is the result from netperf running UDP_STREAM on localhost. It was
selected on the basis that it is slab-intensive and has been the subject
of previous SLAB vs SLUB comparisons with the caveat that this is not
testing between two physical hosts.

2-socket modern machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v5
Hmeansend-64 178.38 (  0.00%)  260.54 ( 46.06%)
Hmeansend-128351.49 (  0.00%)  518.56 ( 47.53%)
Hmeansend-256671.23 (  0.00%) 1005.72 ( 49.83%)
Hmeansend-1024  2663.60 (  0.00%) 3880.54 ( 45.69%)
Hmeansend-2048  5126.53 (  0.00%) 7545.38 ( 47.18%)
Hmeansend-3312  7949.99 (  0.00%)11324.34 ( 42.44%)
Hmeansend-4096  9433.56 (  0.00%)12818.85 ( 35.89%)
Hmeansend-8192 15940.64 (  0.00%)21404.98 ( 34.28%)
Hmeansend-1638426699.54 (  0.00%)32810.08 ( 22.89%)
Hmeanrecv-64 178.38 (  0.00%)  260.52 ( 46.05%)
Hmeanrecv-128351.49 (  0.00%)  518.53 ( 47.53%)
Hmeanrecv-256671.20 (  0.00%) 1005.42 ( 49.79%)
Hmeanrecv-1024  2663.45 (  0.00%) 3879.75 ( 45.67%)
Hmeanrecv-2048  5126.26 (  0.00%) 7544.23 ( 47.17%)
Hmeanrecv-3312  7949.50 (  0.00%)11322.52 ( 42.43%)
Hmeanrecv-4096  9433.04 (  0.00%)12816.68 ( 35.87%)
Hmeanrecv-8192 15939.64 (  0.00%)21402.75 ( 34.27%)
Hmeanrecv-1638426698.44 (  0.00%)32806.81 ( 22.88%)

1-socket 6 year old machine
4.9.0-rc5 4.9.0-rc5
  vanilla hopcpu-v4
Hmeansend-64  87.47 (  0.00%)  127.01 ( 45.21%)
Hmeansend-128174.36 (  0.00%)  254.86 ( 46.17%)
Hmeansend-256347.52 (  0.00%)  505.91 ( 45.58%)
Hmeansend-1024  1363.03 (  0.00%) 1962.49 ( 43.98%)
Hmeansend-2048  2632.68 (  0.00%) 3731.74 ( 41.75%)
Hmeansend-3312  4123.19 (  0.00%) 5859.08 ( 42.10%)
Hmeansend-4096  5056.48 (  0.00%) 7058.00 ( 39.58%)
Hmeansend-8192  8784.22 (  0.00%)12134.53 ( 38.14%)
Hmeansend-1638415081.60 (  0.00%)19638.90 ( 30.22%)
Hmeanrecv-64  86.19 (  0.00%)  126.34 ( 46.58%)
Hmeanrecv-128173.93 (  0.00%)  253.51 ( 45.75%)
Hmeanrecv-256346.19 (  0.00%)  503.34 ( 45.40%)
Hmeanrecv-1024  1358.28 (  0.00%) 1951.63 ( 43.68%)
Hmeanrecv-2048  2623.45 (  0.00%) 3701.67 ( 41.10%)
Hmeanrecv-3312  4108.63 (  0.00%) 5817.75 ( 41.60%)
Hmeanrecv-4096  5037.25 (  0.00%) 7004.79 ( 39.06%)
Hmeanrecv-8192  8762.32 (  0.00%)12059.83 ( 37.63%)
Hmeanrecv-1638415042.36 (  0.00%)19514.33 ( 29.73%)

This is somewhat dramatic but it's also not universal. For example, it was
observed on an older HP machine using pcc-cpufreq that there was almost
no difference but pcc-cpufreq is also a known performance hazard.

These are quite different results but illustrate that the patch is
dependent on the CPU. The results are similar for TCP_STREAM on
the two-socket