Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Tue, Dec 06, 2016 at 11:43:45AM +0900, Joonsoo Kim wrote: > > actually clear at all it's an unfair situation, particularly given that the > > vanilla code is also unfair -- the vanilla code can artifically preserve > > MIGRATE_UNMOVABLE without any clear indication that it is a universal win. > > The only deciding factor there was a fault-intensive workload would mask > > overhead of the page allocator due to page zeroing cost which UNMOVABLE > > allocations may or may not require. Even that is vague considering that > > page-table allocations are zeroing even if many kernel allocations are not. > > "Vanilla works like that" doesn't seem to be reasonable to justify > this change. Vanilla code works with three lists and it now become > six lists and each list can have different size of page. We need to > think that previous approach will also work fine with current one. I > think that there is a problem although it's not permanent and would be > minor. However, it's better to fix it when it is found. > This is going in circles. I prototyped the modification which increases the per-cpu structure slightly and will evaluate. It takes about a day to run through the full set of tests. If it causes no harm, I'll release another version. -- Mel Gorman SUSE Labs
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Tue, Dec 06, 2016 at 11:43:45AM +0900, Joonsoo Kim wrote: > > actually clear at all it's an unfair situation, particularly given that the > > vanilla code is also unfair -- the vanilla code can artifically preserve > > MIGRATE_UNMOVABLE without any clear indication that it is a universal win. > > The only deciding factor there was a fault-intensive workload would mask > > overhead of the page allocator due to page zeroing cost which UNMOVABLE > > allocations may or may not require. Even that is vague considering that > > page-table allocations are zeroing even if many kernel allocations are not. > > "Vanilla works like that" doesn't seem to be reasonable to justify > this change. Vanilla code works with three lists and it now become > six lists and each list can have different size of page. We need to > think that previous approach will also work fine with current one. I > think that there is a problem although it's not permanent and would be > minor. However, it's better to fix it when it is found. > This is going in circles. I prototyped the modification which increases the per-cpu structure slightly and will evaluate. It takes about a day to run through the full set of tests. If it causes no harm, I'll release another version. -- Mel Gorman SUSE Labs
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Mon, Dec 05, 2016 at 09:57:39AM +, Mel Gorman wrote: > On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote: > > On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote: > > > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote: > > > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone > > > > > *zone, int count, > > > > > if (unlikely(isolated_pageblocks)) > > > > > mt = get_pageblock_migratetype(page); > > > > > > > > > > + nr_freed += (1 << order); > > > > > + count -= (1 << order); > > > > > if (bulkfree_pcp_prepare(page)) > > > > > continue; > > > > > > > > > > - __free_one_page(page, page_to_pfn(page), zone, > > > > > 0, mt); > > > > > - trace_mm_page_pcpu_drain(page, 0, mt); > > > > > - } while (--count && --batch_free && !list_empty(list)); > > > > > + __free_one_page(page, page_to_pfn(page), zone, > > > > > order, mt); > > > > > + trace_mm_page_pcpu_drain(page, order, mt); > > > > > + } while (count > 0 && --batch_free && > > > > > !list_empty(list)); > > > > > } > > > > > spin_unlock(>lock); > > > > > + pcp->count -= nr_freed; > > > > > } > > > > > > > > I guess that this patch would cause following problems. > > > > > > > > 1. If pcp->batch is too small, high order page will not be freed > > > > easily and survive longer. Think about following situation. > > > > > > > > Batch count: 7 > > > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > > > > -> order 2... > > > > > > > > free count: 1 + 1 + 1 + 2 + 4 = 9 > > > > so order 3 would not be freed. > > > > > > > > > > You're relying on the batch count to be 7 where in a lot of cases it's > > > 31. Even if low batch counts are common on another platform or you > > > adjusted > > > the other counts to be higher values until they equal 30, it would be for > > > this drain that no order-3 pages were freed. It's not a permanent > > > situation. > > > > > > When or if it gets freed depends on the allocation request stream but the > > > same applies to the existing caches. If a high-order request arrives, > > > it'll > > > be used. If all the requests are for the other orders, then eventually > > > the frees will hit the high watermark enough that the round-robin batch > > > freeing fill free the order-3 entry in the cache. > > > > I know that it isn't a permanent situation and it depends on workload. > > However, it is clearly an unfair situation. We don't have any good reason > > to cache higher order freepage longer. Even, batch count 7 means > > that it is a small system. In this kind of system, there is no reason > > to keep high order freepage longer in the cache. > > > > Without knowing the future allocation request stream, there is no reason > to favour one part of the per-cpu cache over another. To me, it's not What I suggest is that. Don't favour one part of the per-cpu cache over another. > actually clear at all it's an unfair situation, particularly given that the > vanilla code is also unfair -- the vanilla code can artifically preserve > MIGRATE_UNMOVABLE without any clear indication that it is a universal win. > The only deciding factor there was a fault-intensive workload would mask > overhead of the page allocator due to page zeroing cost which UNMOVABLE > allocations may or may not require. Even that is vague considering that > page-table allocations are zeroing even if many kernel allocations are not. "Vanilla works like that" doesn't seem to be reasonable to justify this change. Vanilla code works with three lists and it now become six lists and each list can have different size of page. We need to think that previous approach will also work fine with current one. I think that there is a problem although it's not permanent and would be minor. However, it's better to fix it when it is found. > > The other potential problem is that if we change > > PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also > > doesn't guarantee that free_pcppages_bulk() will work fairly and we > > will not notice it easily. > > > > In the event the high-order cache is increased, then the high watermark > would also need to be adjusted to account for that just as this patch > does. pcp->high will be adjusted automatically when high-order cache is increased by your change. What we miss is pcp->batch and there is no information about that the number of high-order cache and pcp->batch has some association. > > I think that it can be simply solved by maintaining a last pindex in > > pcp. How about it? > > > > That would rely on the previous allocation stream to drive the freeing > which is slightly related to the fact the per-cpu cache contents are > related to the previous request
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Mon, Dec 05, 2016 at 09:57:39AM +, Mel Gorman wrote: > On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote: > > On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote: > > > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote: > > > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone > > > > > *zone, int count, > > > > > if (unlikely(isolated_pageblocks)) > > > > > mt = get_pageblock_migratetype(page); > > > > > > > > > > + nr_freed += (1 << order); > > > > > + count -= (1 << order); > > > > > if (bulkfree_pcp_prepare(page)) > > > > > continue; > > > > > > > > > > - __free_one_page(page, page_to_pfn(page), zone, > > > > > 0, mt); > > > > > - trace_mm_page_pcpu_drain(page, 0, mt); > > > > > - } while (--count && --batch_free && !list_empty(list)); > > > > > + __free_one_page(page, page_to_pfn(page), zone, > > > > > order, mt); > > > > > + trace_mm_page_pcpu_drain(page, order, mt); > > > > > + } while (count > 0 && --batch_free && > > > > > !list_empty(list)); > > > > > } > > > > > spin_unlock(>lock); > > > > > + pcp->count -= nr_freed; > > > > > } > > > > > > > > I guess that this patch would cause following problems. > > > > > > > > 1. If pcp->batch is too small, high order page will not be freed > > > > easily and survive longer. Think about following situation. > > > > > > > > Batch count: 7 > > > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > > > > -> order 2... > > > > > > > > free count: 1 + 1 + 1 + 2 + 4 = 9 > > > > so order 3 would not be freed. > > > > > > > > > > You're relying on the batch count to be 7 where in a lot of cases it's > > > 31. Even if low batch counts are common on another platform or you > > > adjusted > > > the other counts to be higher values until they equal 30, it would be for > > > this drain that no order-3 pages were freed. It's not a permanent > > > situation. > > > > > > When or if it gets freed depends on the allocation request stream but the > > > same applies to the existing caches. If a high-order request arrives, > > > it'll > > > be used. If all the requests are for the other orders, then eventually > > > the frees will hit the high watermark enough that the round-robin batch > > > freeing fill free the order-3 entry in the cache. > > > > I know that it isn't a permanent situation and it depends on workload. > > However, it is clearly an unfair situation. We don't have any good reason > > to cache higher order freepage longer. Even, batch count 7 means > > that it is a small system. In this kind of system, there is no reason > > to keep high order freepage longer in the cache. > > > > Without knowing the future allocation request stream, there is no reason > to favour one part of the per-cpu cache over another. To me, it's not What I suggest is that. Don't favour one part of the per-cpu cache over another. > actually clear at all it's an unfair situation, particularly given that the > vanilla code is also unfair -- the vanilla code can artifically preserve > MIGRATE_UNMOVABLE without any clear indication that it is a universal win. > The only deciding factor there was a fault-intensive workload would mask > overhead of the page allocator due to page zeroing cost which UNMOVABLE > allocations may or may not require. Even that is vague considering that > page-table allocations are zeroing even if many kernel allocations are not. "Vanilla works like that" doesn't seem to be reasonable to justify this change. Vanilla code works with three lists and it now become six lists and each list can have different size of page. We need to think that previous approach will also work fine with current one. I think that there is a problem although it's not permanent and would be minor. However, it's better to fix it when it is found. > > The other potential problem is that if we change > > PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also > > doesn't guarantee that free_pcppages_bulk() will work fairly and we > > will not notice it easily. > > > > In the event the high-order cache is increased, then the high watermark > would also need to be adjusted to account for that just as this patch > does. pcp->high will be adjusted automatically when high-order cache is increased by your change. What we miss is pcp->batch and there is no information about that the number of high-order cache and pcp->batch has some association. > > I think that it can be simply solved by maintaining a last pindex in > > pcp. How about it? > > > > That would rely on the previous allocation stream to drive the freeing > which is slightly related to the fact the per-cpu cache contents are > related to the previous request
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote: > On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote: > > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote: > > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone > > > > *zone, int count, > > > > if (unlikely(isolated_pageblocks)) > > > > mt = get_pageblock_migratetype(page); > > > > > > > > + nr_freed += (1 << order); > > > > + count -= (1 << order); > > > > if (bulkfree_pcp_prepare(page)) > > > > continue; > > > > > > > > - __free_one_page(page, page_to_pfn(page), zone, > > > > 0, mt); > > > > - trace_mm_page_pcpu_drain(page, 0, mt); > > > > - } while (--count && --batch_free && !list_empty(list)); > > > > + __free_one_page(page, page_to_pfn(page), zone, > > > > order, mt); > > > > + trace_mm_page_pcpu_drain(page, order, mt); > > > > + } while (count > 0 && --batch_free && > > > > !list_empty(list)); > > > > } > > > > spin_unlock(>lock); > > > > + pcp->count -= nr_freed; > > > > } > > > > > > I guess that this patch would cause following problems. > > > > > > 1. If pcp->batch is too small, high order page will not be freed > > > easily and survive longer. Think about following situation. > > > > > > Batch count: 7 > > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > > > -> order 2... > > > > > > free count: 1 + 1 + 1 + 2 + 4 = 9 > > > so order 3 would not be freed. > > > > > > > You're relying on the batch count to be 7 where in a lot of cases it's > > 31. Even if low batch counts are common on another platform or you adjusted > > the other counts to be higher values until they equal 30, it would be for > > this drain that no order-3 pages were freed. It's not a permanent situation. > > > > When or if it gets freed depends on the allocation request stream but the > > same applies to the existing caches. If a high-order request arrives, it'll > > be used. If all the requests are for the other orders, then eventually > > the frees will hit the high watermark enough that the round-robin batch > > freeing fill free the order-3 entry in the cache. > > I know that it isn't a permanent situation and it depends on workload. > However, it is clearly an unfair situation. We don't have any good reason > to cache higher order freepage longer. Even, batch count 7 means > that it is a small system. In this kind of system, there is no reason > to keep high order freepage longer in the cache. > Without knowing the future allocation request stream, there is no reason to favour one part of the per-cpu cache over another. To me, it's not actually clear at all it's an unfair situation, particularly given that the vanilla code is also unfair -- the vanilla code can artifically preserve MIGRATE_UNMOVABLE without any clear indication that it is a universal win. The only deciding factor there was a fault-intensive workload would mask overhead of the page allocator due to page zeroing cost which UNMOVABLE allocations may or may not require. Even that is vague considering that page-table allocations are zeroing even if many kernel allocations are not. > The other potential problem is that if we change > PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also > doesn't guarantee that free_pcppages_bulk() will work fairly and we > will not notice it easily. > In the event the high-order cache is increased, then the high watermark would also need to be adjusted to account for that just as this patch does. > I think that it can be simply solved by maintaining a last pindex in > pcp. How about it? > That would rely on the previous allocation stream to drive the freeing which is slightly related to the fact the per-cpu cache contents are related to the previous request stream. It's still not guaranteed to be related to the future request stream. Adding a new pindex cache adds complexity to the free path without any guarantee it benefits anything. The use of such a heuristic should be driven by a workload demonstrating it's a problem. Granted, half of the cost of a free operations is due to irq enable/disable but there is no reason to make it unnecessarily expensive. > > > 3. I guess that order-0 file/anon page alloc/free is dominent in many > > > workloads. If this case happen, it invalidates effect of high order > > > cache in pcp since cached high order pages would be also freed to the > > > buddy when burst order-0 free happens. > > > > > > > A large burst of order-0 frees will free the high-order cache if it's not > > being used but I don't see what your point is or why that is a problem. > > It is pretty much guaranteed that there will be workloads that benefit > > from
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Mon, Dec 05, 2016 at 12:06:19PM +0900, Joonsoo Kim wrote: > On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote: > > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote: > > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone > > > > *zone, int count, > > > > if (unlikely(isolated_pageblocks)) > > > > mt = get_pageblock_migratetype(page); > > > > > > > > + nr_freed += (1 << order); > > > > + count -= (1 << order); > > > > if (bulkfree_pcp_prepare(page)) > > > > continue; > > > > > > > > - __free_one_page(page, page_to_pfn(page), zone, > > > > 0, mt); > > > > - trace_mm_page_pcpu_drain(page, 0, mt); > > > > - } while (--count && --batch_free && !list_empty(list)); > > > > + __free_one_page(page, page_to_pfn(page), zone, > > > > order, mt); > > > > + trace_mm_page_pcpu_drain(page, order, mt); > > > > + } while (count > 0 && --batch_free && > > > > !list_empty(list)); > > > > } > > > > spin_unlock(>lock); > > > > + pcp->count -= nr_freed; > > > > } > > > > > > I guess that this patch would cause following problems. > > > > > > 1. If pcp->batch is too small, high order page will not be freed > > > easily and survive longer. Think about following situation. > > > > > > Batch count: 7 > > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > > > -> order 2... > > > > > > free count: 1 + 1 + 1 + 2 + 4 = 9 > > > so order 3 would not be freed. > > > > > > > You're relying on the batch count to be 7 where in a lot of cases it's > > 31. Even if low batch counts are common on another platform or you adjusted > > the other counts to be higher values until they equal 30, it would be for > > this drain that no order-3 pages were freed. It's not a permanent situation. > > > > When or if it gets freed depends on the allocation request stream but the > > same applies to the existing caches. If a high-order request arrives, it'll > > be used. If all the requests are for the other orders, then eventually > > the frees will hit the high watermark enough that the round-robin batch > > freeing fill free the order-3 entry in the cache. > > I know that it isn't a permanent situation and it depends on workload. > However, it is clearly an unfair situation. We don't have any good reason > to cache higher order freepage longer. Even, batch count 7 means > that it is a small system. In this kind of system, there is no reason > to keep high order freepage longer in the cache. > Without knowing the future allocation request stream, there is no reason to favour one part of the per-cpu cache over another. To me, it's not actually clear at all it's an unfair situation, particularly given that the vanilla code is also unfair -- the vanilla code can artifically preserve MIGRATE_UNMOVABLE without any clear indication that it is a universal win. The only deciding factor there was a fault-intensive workload would mask overhead of the page allocator due to page zeroing cost which UNMOVABLE allocations may or may not require. Even that is vague considering that page-table allocations are zeroing even if many kernel allocations are not. > The other potential problem is that if we change > PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also > doesn't guarantee that free_pcppages_bulk() will work fairly and we > will not notice it easily. > In the event the high-order cache is increased, then the high watermark would also need to be adjusted to account for that just as this patch does. > I think that it can be simply solved by maintaining a last pindex in > pcp. How about it? > That would rely on the previous allocation stream to drive the freeing which is slightly related to the fact the per-cpu cache contents are related to the previous request stream. It's still not guaranteed to be related to the future request stream. Adding a new pindex cache adds complexity to the free path without any guarantee it benefits anything. The use of such a heuristic should be driven by a workload demonstrating it's a problem. Granted, half of the cost of a free operations is due to irq enable/disable but there is no reason to make it unnecessarily expensive. > > > 3. I guess that order-0 file/anon page alloc/free is dominent in many > > > workloads. If this case happen, it invalidates effect of high order > > > cache in pcp since cached high order pages would be also freed to the > > > buddy when burst order-0 free happens. > > > > > > > A large burst of order-0 frees will free the high-order cache if it's not > > being used but I don't see what your point is or why that is a problem. > > It is pretty much guaranteed that there will be workloads that benefit > > from
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Fri, Dec 02, 2016 at 09:21:08AM +0100, Michal Hocko wrote: > On Fri 02-12-16 15:03:46, Joonsoo Kim wrote: > [...] > > > o pcp accounting during free is now confined to free_pcppages_bulk as it's > > > impossible for the caller to know exactly how many pages were freed. > > > Due to the high-order caches, the number of pages drained for a request > > > is no longer precise. > > > > > > o The high watermark for per-cpu pages is increased to reduce the > > > probability > > > that a single refill causes a drain on the next free. > [...] > > I guess that this patch would cause following problems. > > > > 1. If pcp->batch is too small, high order page will not be freed > > easily and survive longer. Think about following situation. > > > > Batch count: 7 > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > > -> order 2... > > > > free count: 1 + 1 + 1 + 2 + 4 = 9 > > so order 3 would not be freed. > > I guess the second paragraph above in the changelog tries to clarify > that... It doesn't perfectly clarify my concern. This is a different problem. > > > 2. And, It seems that this logic penalties high order pages. One free > > to high order page means 1 << order pages free rather than just > > one page free. This logic do round-robin to choose the target page so > > amount of freed page will be different by the order. > > Yes this is indeed possible. The first paragraph above mentions this > problem. Yes, it is mentioned simply but we cannot easily notice that the above penalty for high order page is there. Thanks.
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Fri, Dec 02, 2016 at 09:21:08AM +0100, Michal Hocko wrote: > On Fri 02-12-16 15:03:46, Joonsoo Kim wrote: > [...] > > > o pcp accounting during free is now confined to free_pcppages_bulk as it's > > > impossible for the caller to know exactly how many pages were freed. > > > Due to the high-order caches, the number of pages drained for a request > > > is no longer precise. > > > > > > o The high watermark for per-cpu pages is increased to reduce the > > > probability > > > that a single refill causes a drain on the next free. > [...] > > I guess that this patch would cause following problems. > > > > 1. If pcp->batch is too small, high order page will not be freed > > easily and survive longer. Think about following situation. > > > > Batch count: 7 > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > > -> order 2... > > > > free count: 1 + 1 + 1 + 2 + 4 = 9 > > so order 3 would not be freed. > > I guess the second paragraph above in the changelog tries to clarify > that... It doesn't perfectly clarify my concern. This is a different problem. > > > 2. And, It seems that this logic penalties high order pages. One free > > to high order page means 1 << order pages free rather than just > > one page free. This logic do round-robin to choose the target page so > > amount of freed page will be different by the order. > > Yes this is indeed possible. The first paragraph above mentions this > problem. Yes, it is mentioned simply but we cannot easily notice that the above penalty for high order page is there. Thanks.
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote: > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote: > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, > > > int count, > > > if (unlikely(isolated_pageblocks)) > > > mt = get_pageblock_migratetype(page); > > > > > > + nr_freed += (1 << order); > > > + count -= (1 << order); > > > if (bulkfree_pcp_prepare(page)) > > > continue; > > > > > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt); > > > - trace_mm_page_pcpu_drain(page, 0, mt); > > > - } while (--count && --batch_free && !list_empty(list)); > > > + __free_one_page(page, page_to_pfn(page), zone, order, > > > mt); > > > + trace_mm_page_pcpu_drain(page, order, mt); > > > + } while (count > 0 && --batch_free && !list_empty(list)); > > > } > > > spin_unlock(>lock); > > > + pcp->count -= nr_freed; > > > } > > > > I guess that this patch would cause following problems. > > > > 1. If pcp->batch is too small, high order page will not be freed > > easily and survive longer. Think about following situation. > > > > Batch count: 7 > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > > -> order 2... > > > > free count: 1 + 1 + 1 + 2 + 4 = 9 > > so order 3 would not be freed. > > > > You're relying on the batch count to be 7 where in a lot of cases it's > 31. Even if low batch counts are common on another platform or you adjusted > the other counts to be higher values until they equal 30, it would be for > this drain that no order-3 pages were freed. It's not a permanent situation. > > When or if it gets freed depends on the allocation request stream but the > same applies to the existing caches. If a high-order request arrives, it'll > be used. If all the requests are for the other orders, then eventually > the frees will hit the high watermark enough that the round-robin batch > freeing fill free the order-3 entry in the cache. I know that it isn't a permanent situation and it depends on workload. However, it is clearly an unfair situation. We don't have any good reason to cache higher order freepage longer. Even, batch count 7 means that it is a small system. In this kind of system, there is no reason to keep high order freepage longer in the cache. The other potential problem is that if we change PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also doesn't guarantee that free_pcppages_bulk() will work fairly and we will not notice it easily. I think that it can be simply solved by maintaining a last pindex in pcp. How about it? > > > 2. And, It seems that this logic penalties high order pages. One free > > to high order page means 1 << order pages free rather than just > > one page free. This logic do round-robin to choose the target page so > > amount of freed page will be different by the order. I think that it > > makes some sense because high order page are less important to cache > > in pcp than lower order but I'd like to know if it is intended or not. > > If intended, it deserves the comment. > > > > It's intended but I'm not sure what else you want me to explain outside > the code itself in this case. The round-robin nature of the bulk drain > already doesn't attach any special important to the migrate type of the > list and there is no good reason to assume that high-order pages in the > cache when the high watermark is reached deserve special protection. Non-trivial part is that round-robin approach penalties high-order pages caching. We usually think that round-robin is fair, but, in this case, it isn't. Some people can notice that amount of freepage in turn is different but some may not. It is a different situation in the past that amount of freepage in turn is same even if migratetype is different. I think that it deserve some comment but I don't feel it strongly. > > 3. I guess that order-0 file/anon page alloc/free is dominent in many > > workloads. If this case happen, it invalidates effect of high order > > cache in pcp since cached high order pages would be also freed to the > > buddy when burst order-0 free happens. > > > > A large burst of order-0 frees will free the high-order cache if it's not > being used but I don't see what your point is or why that is a problem. > It is pretty much guaranteed that there will be workloads that benefit > from protecting the high-order cache (SLUB-intensive alloc/free > intensive workloads) while others suffer (Fault-intensive map/unmap > workloads). > > What's there at the moment behaves reasonably on a variety of workloads > across 8 machines. Yes, I see that this patch improves some workloads. What I like to say is that I find some weakness and if it is fixed, we can get better result. This patch implement unified pcp cache for
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Fri, Dec 02, 2016 at 09:04:49AM +, Mel Gorman wrote: > On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote: > > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, > > > int count, > > > if (unlikely(isolated_pageblocks)) > > > mt = get_pageblock_migratetype(page); > > > > > > + nr_freed += (1 << order); > > > + count -= (1 << order); > > > if (bulkfree_pcp_prepare(page)) > > > continue; > > > > > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt); > > > - trace_mm_page_pcpu_drain(page, 0, mt); > > > - } while (--count && --batch_free && !list_empty(list)); > > > + __free_one_page(page, page_to_pfn(page), zone, order, > > > mt); > > > + trace_mm_page_pcpu_drain(page, order, mt); > > > + } while (count > 0 && --batch_free && !list_empty(list)); > > > } > > > spin_unlock(>lock); > > > + pcp->count -= nr_freed; > > > } > > > > I guess that this patch would cause following problems. > > > > 1. If pcp->batch is too small, high order page will not be freed > > easily and survive longer. Think about following situation. > > > > Batch count: 7 > > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > > -> order 2... > > > > free count: 1 + 1 + 1 + 2 + 4 = 9 > > so order 3 would not be freed. > > > > You're relying on the batch count to be 7 where in a lot of cases it's > 31. Even if low batch counts are common on another platform or you adjusted > the other counts to be higher values until they equal 30, it would be for > this drain that no order-3 pages were freed. It's not a permanent situation. > > When or if it gets freed depends on the allocation request stream but the > same applies to the existing caches. If a high-order request arrives, it'll > be used. If all the requests are for the other orders, then eventually > the frees will hit the high watermark enough that the round-robin batch > freeing fill free the order-3 entry in the cache. I know that it isn't a permanent situation and it depends on workload. However, it is clearly an unfair situation. We don't have any good reason to cache higher order freepage longer. Even, batch count 7 means that it is a small system. In this kind of system, there is no reason to keep high order freepage longer in the cache. The other potential problem is that if we change PAGE_ALLOC_COSTLY_ORDER to 5 in the future, this 31 batch count also doesn't guarantee that free_pcppages_bulk() will work fairly and we will not notice it easily. I think that it can be simply solved by maintaining a last pindex in pcp. How about it? > > > 2. And, It seems that this logic penalties high order pages. One free > > to high order page means 1 << order pages free rather than just > > one page free. This logic do round-robin to choose the target page so > > amount of freed page will be different by the order. I think that it > > makes some sense because high order page are less important to cache > > in pcp than lower order but I'd like to know if it is intended or not. > > If intended, it deserves the comment. > > > > It's intended but I'm not sure what else you want me to explain outside > the code itself in this case. The round-robin nature of the bulk drain > already doesn't attach any special important to the migrate type of the > list and there is no good reason to assume that high-order pages in the > cache when the high watermark is reached deserve special protection. Non-trivial part is that round-robin approach penalties high-order pages caching. We usually think that round-robin is fair, but, in this case, it isn't. Some people can notice that amount of freepage in turn is different but some may not. It is a different situation in the past that amount of freepage in turn is same even if migratetype is different. I think that it deserve some comment but I don't feel it strongly. > > 3. I guess that order-0 file/anon page alloc/free is dominent in many > > workloads. If this case happen, it invalidates effect of high order > > cache in pcp since cached high order pages would be also freed to the > > buddy when burst order-0 free happens. > > > > A large burst of order-0 frees will free the high-order cache if it's not > being used but I don't see what your point is or why that is a problem. > It is pretty much guaranteed that there will be workloads that benefit > from protecting the high-order cache (SLUB-intensive alloc/free > intensive workloads) while others suffer (Fault-intensive map/unmap > workloads). > > What's there at the moment behaves reasonably on a variety of workloads > across 8 machines. Yes, I see that this patch improves some workloads. What I like to say is that I find some weakness and if it is fixed, we can get better result. This patch implement unified pcp cache for
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote: > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, > > int count, > > if (unlikely(isolated_pageblocks)) > > mt = get_pageblock_migratetype(page); > > > > + nr_freed += (1 << order); > > + count -= (1 << order); > > if (bulkfree_pcp_prepare(page)) > > continue; > > > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt); > > - trace_mm_page_pcpu_drain(page, 0, mt); > > - } while (--count && --batch_free && !list_empty(list)); > > + __free_one_page(page, page_to_pfn(page), zone, order, > > mt); > > + trace_mm_page_pcpu_drain(page, order, mt); > > + } while (count > 0 && --batch_free && !list_empty(list)); > > } > > spin_unlock(>lock); > > + pcp->count -= nr_freed; > > } > > I guess that this patch would cause following problems. > > 1. If pcp->batch is too small, high order page will not be freed > easily and survive longer. Think about following situation. > > Batch count: 7 > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > -> order 2... > > free count: 1 + 1 + 1 + 2 + 4 = 9 > so order 3 would not be freed. > You're relying on the batch count to be 7 where in a lot of cases it's 31. Even if low batch counts are common on another platform or you adjusted the other counts to be higher values until they equal 30, it would be for this drain that no order-3 pages were freed. It's not a permanent situation. When or if it gets freed depends on the allocation request stream but the same applies to the existing caches. If a high-order request arrives, it'll be used. If all the requests are for the other orders, then eventually the frees will hit the high watermark enough that the round-robin batch freeing fill free the order-3 entry in the cache. > 2. And, It seems that this logic penalties high order pages. One free > to high order page means 1 << order pages free rather than just > one page free. This logic do round-robin to choose the target page so > amount of freed page will be different by the order. I think that it > makes some sense because high order page are less important to cache > in pcp than lower order but I'd like to know if it is intended or not. > If intended, it deserves the comment. > It's intended but I'm not sure what else you want me to explain outside the code itself in this case. The round-robin nature of the bulk drain already doesn't attach any special important to the migrate type of the list and there is no good reason to assume that high-order pages in the cache when the high watermark is reached deserve special protection. > 3. I guess that order-0 file/anon page alloc/free is dominent in many > workloads. If this case happen, it invalidates effect of high order > cache in pcp since cached high order pages would be also freed to the > buddy when burst order-0 free happens. > A large burst of order-0 frees will free the high-order cache if it's not being used but I don't see what your point is or why that is a problem. It is pretty much guaranteed that there will be workloads that benefit from protecting the high-order cache (SLUB-intensive alloc/free intensive workloads) while others suffer (Fault-intensive map/unmap workloads). What's there at the moment behaves reasonably on a variety of workloads across 8 machines. > > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone > > *preferred_zone, > > struct page *page; > > bool cold = ((gfp_flags & __GFP_COLD) != 0); > > > > - if (likely(order == 0)) { > > + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) { > > struct per_cpu_pages *pcp; > > struct list_head *list; > > > > local_irq_save(flags); > > do { > > + unsigned int pindex; > > + > > + pindex = order_to_pindex(migratetype, order); > > pcp = _cpu_ptr(zone->pageset)->pcp; > > - list = >lists[migratetype]; > > + list = >lists[pindex]; > > if (list_empty(list)) { > > - pcp->count += rmqueue_bulk(zone, 0, > > + int nr_pages = rmqueue_bulk(zone, order, > > pcp->batch, list, > > migratetype, cold); > > Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch > * (1 << order) pages and pcp->count can easily overflow pcp->high > * because list empty here doesn't mean that pcp->count is zero. > Potentially a refill can cause a drain on another list. However, I adjusted the high watermark in pageset_set_batch to make it unlikely that a single refill will cause a drain and added a
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Fri, Dec 02, 2016 at 03:03:46PM +0900, Joonsoo Kim wrote: > > @@ -1132,14 +1134,17 @@ static void free_pcppages_bulk(struct zone *zone, > > int count, > > if (unlikely(isolated_pageblocks)) > > mt = get_pageblock_migratetype(page); > > > > + nr_freed += (1 << order); > > + count -= (1 << order); > > if (bulkfree_pcp_prepare(page)) > > continue; > > > > - __free_one_page(page, page_to_pfn(page), zone, 0, mt); > > - trace_mm_page_pcpu_drain(page, 0, mt); > > - } while (--count && --batch_free && !list_empty(list)); > > + __free_one_page(page, page_to_pfn(page), zone, order, > > mt); > > + trace_mm_page_pcpu_drain(page, order, mt); > > + } while (count > 0 && --batch_free && !list_empty(list)); > > } > > spin_unlock(>lock); > > + pcp->count -= nr_freed; > > } > > I guess that this patch would cause following problems. > > 1. If pcp->batch is too small, high order page will not be freed > easily and survive longer. Think about following situation. > > Batch count: 7 > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > -> order 2... > > free count: 1 + 1 + 1 + 2 + 4 = 9 > so order 3 would not be freed. > You're relying on the batch count to be 7 where in a lot of cases it's 31. Even if low batch counts are common on another platform or you adjusted the other counts to be higher values until they equal 30, it would be for this drain that no order-3 pages were freed. It's not a permanent situation. When or if it gets freed depends on the allocation request stream but the same applies to the existing caches. If a high-order request arrives, it'll be used. If all the requests are for the other orders, then eventually the frees will hit the high watermark enough that the round-robin batch freeing fill free the order-3 entry in the cache. > 2. And, It seems that this logic penalties high order pages. One free > to high order page means 1 << order pages free rather than just > one page free. This logic do round-robin to choose the target page so > amount of freed page will be different by the order. I think that it > makes some sense because high order page are less important to cache > in pcp than lower order but I'd like to know if it is intended or not. > If intended, it deserves the comment. > It's intended but I'm not sure what else you want me to explain outside the code itself in this case. The round-robin nature of the bulk drain already doesn't attach any special important to the migrate type of the list and there is no good reason to assume that high-order pages in the cache when the high watermark is reached deserve special protection. > 3. I guess that order-0 file/anon page alloc/free is dominent in many > workloads. If this case happen, it invalidates effect of high order > cache in pcp since cached high order pages would be also freed to the > buddy when burst order-0 free happens. > A large burst of order-0 frees will free the high-order cache if it's not being used but I don't see what your point is or why that is a problem. It is pretty much guaranteed that there will be workloads that benefit from protecting the high-order cache (SLUB-intensive alloc/free intensive workloads) while others suffer (Fault-intensive map/unmap workloads). What's there at the moment behaves reasonably on a variety of workloads across 8 machines. > > @@ -2589,20 +2595,33 @@ struct page *buffered_rmqueue(struct zone > > *preferred_zone, > > struct page *page; > > bool cold = ((gfp_flags & __GFP_COLD) != 0); > > > > - if (likely(order == 0)) { > > + if (likely(order <= PAGE_ALLOC_COSTLY_ORDER)) { > > struct per_cpu_pages *pcp; > > struct list_head *list; > > > > local_irq_save(flags); > > do { > > + unsigned int pindex; > > + > > + pindex = order_to_pindex(migratetype, order); > > pcp = _cpu_ptr(zone->pageset)->pcp; > > - list = >lists[migratetype]; > > + list = >lists[pindex]; > > if (list_empty(list)) { > > - pcp->count += rmqueue_bulk(zone, 0, > > + int nr_pages = rmqueue_bulk(zone, order, > > pcp->batch, list, > > migratetype, cold); > > Maybe, you need to fix rmqueue_bulk(). rmqueue_bulk() allocates batch > * (1 << order) pages and pcp->count can easily overflow pcp->high > * because list empty here doesn't mean that pcp->count is zero. > Potentially a refill can cause a drain on another list. However, I adjusted the high watermark in pageset_set_batch to make it unlikely that a single refill will cause a drain and added a
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Fri 02-12-16 00:22:44, Mel Gorman wrote: > Changelog since v4 > o Avoid pcp->count getting out of sync if struct page gets corrupted > > Changelog since v3 > o Allow high-order atomic allocations to use reserves > > Changelog since v2 > o Correct initialisation to avoid -Woverflow warning > > SLUB has been the default small kernel object allocator for quite some time > but it is not universally used due to performance concerns and a reliance > on high-order pages. The high-order concerns has two major components -- > high-order pages are not always available and high-order page allocations > potentially contend on the zone->lock. This patch addresses some concerns > about the zone lock contention by extending the per-cpu page allocator to > cache high-order pages. The patch makes the following modifications > > o New per-cpu lists are added to cache the high-order pages. This increases > the cache footprint of the per-cpu allocator and overall usage but for > some workloads, this will be offset by reduced contention on zone->lock. > The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The > remaining are high-order caches up to and including > PAGE_ALLOC_COSTLY_ORDER > > o pcp accounting during free is now confined to free_pcppages_bulk as it's > impossible for the caller to know exactly how many pages were freed. > Due to the high-order caches, the number of pages drained for a request > is no longer precise. > > o The high watermark for per-cpu pages is increased to reduce the probability > that a single refill causes a drain on the next free. > > The benefit depends on both the workload and the machine as ultimately the > determining factor is whether cache line bounces on zone->lock or contention > is a problem. The patch was tested on a variety of workloads and machines, > some of which are reported here. > > This is the result from netperf running UDP_STREAM on localhost. It was > selected on the basis that it is slab-intensive and has been the subject > of previous SLAB vs SLUB comparisons with the caveat that this is not > testing between two physical hosts. > > 2-socket modern machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v5 > Hmeansend-64 178.38 ( 0.00%) 260.54 ( 46.06%) > Hmeansend-128351.49 ( 0.00%) 518.56 ( 47.53%) > Hmeansend-256671.23 ( 0.00%) 1005.72 ( 49.83%) > Hmeansend-1024 2663.60 ( 0.00%) 3880.54 ( 45.69%) > Hmeansend-2048 5126.53 ( 0.00%) 7545.38 ( 47.18%) > Hmeansend-3312 7949.99 ( 0.00%)11324.34 ( 42.44%) > Hmeansend-4096 9433.56 ( 0.00%)12818.85 ( 35.89%) > Hmeansend-8192 15940.64 ( 0.00%)21404.98 ( 34.28%) > Hmeansend-1638426699.54 ( 0.00%)32810.08 ( 22.89%) > Hmeanrecv-64 178.38 ( 0.00%) 260.52 ( 46.05%) > Hmeanrecv-128351.49 ( 0.00%) 518.53 ( 47.53%) > Hmeanrecv-256671.20 ( 0.00%) 1005.42 ( 49.79%) > Hmeanrecv-1024 2663.45 ( 0.00%) 3879.75 ( 45.67%) > Hmeanrecv-2048 5126.26 ( 0.00%) 7544.23 ( 47.17%) > Hmeanrecv-3312 7949.50 ( 0.00%)11322.52 ( 42.43%) > Hmeanrecv-4096 9433.04 ( 0.00%)12816.68 ( 35.87%) > Hmeanrecv-8192 15939.64 ( 0.00%)21402.75 ( 34.27%) > Hmeanrecv-1638426698.44 ( 0.00%)32806.81 ( 22.88%) > > 1-socket 6 year old machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v4 > Hmeansend-64 87.47 ( 0.00%) 127.01 ( 45.21%) > Hmeansend-128174.36 ( 0.00%) 254.86 ( 46.17%) > Hmeansend-256347.52 ( 0.00%) 505.91 ( 45.58%) > Hmeansend-1024 1363.03 ( 0.00%) 1962.49 ( 43.98%) > Hmeansend-2048 2632.68 ( 0.00%) 3731.74 ( 41.75%) > Hmeansend-3312 4123.19 ( 0.00%) 5859.08 ( 42.10%) > Hmeansend-4096 5056.48 ( 0.00%) 7058.00 ( 39.58%) > Hmeansend-8192 8784.22 ( 0.00%)12134.53 ( 38.14%) > Hmeansend-1638415081.60 ( 0.00%)19638.90 ( 30.22%) > Hmeanrecv-64 86.19 ( 0.00%) 126.34 ( 46.58%) > Hmeanrecv-128173.93 ( 0.00%) 253.51 ( 45.75%) > Hmeanrecv-256346.19 ( 0.00%) 503.34 ( 45.40%) > Hmeanrecv-1024 1358.28 ( 0.00%) 1951.63 ( 43.68%) > Hmeanrecv-2048 2623.45 ( 0.00%) 3701.67 ( 41.10%) > Hmeanrecv-3312 4108.63 ( 0.00%) 5817.75 ( 41.60%) > Hmeanrecv-4096 5037.25 ( 0.00%) 7004.79 ( 39.06%) > Hmeanrecv-8192 8762.32 ( 0.00%)12059.83 ( 37.63%) > Hmeanrecv-1638415042.36 ( 0.00%)19514.33 ( 29.73%) > > This is somewhat dramatic but it's also not universal. For example, it was > observed on an older HP machine using pcc-cpufreq that there was
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Fri 02-12-16 00:22:44, Mel Gorman wrote: > Changelog since v4 > o Avoid pcp->count getting out of sync if struct page gets corrupted > > Changelog since v3 > o Allow high-order atomic allocations to use reserves > > Changelog since v2 > o Correct initialisation to avoid -Woverflow warning > > SLUB has been the default small kernel object allocator for quite some time > but it is not universally used due to performance concerns and a reliance > on high-order pages. The high-order concerns has two major components -- > high-order pages are not always available and high-order page allocations > potentially contend on the zone->lock. This patch addresses some concerns > about the zone lock contention by extending the per-cpu page allocator to > cache high-order pages. The patch makes the following modifications > > o New per-cpu lists are added to cache the high-order pages. This increases > the cache footprint of the per-cpu allocator and overall usage but for > some workloads, this will be offset by reduced contention on zone->lock. > The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The > remaining are high-order caches up to and including > PAGE_ALLOC_COSTLY_ORDER > > o pcp accounting during free is now confined to free_pcppages_bulk as it's > impossible for the caller to know exactly how many pages were freed. > Due to the high-order caches, the number of pages drained for a request > is no longer precise. > > o The high watermark for per-cpu pages is increased to reduce the probability > that a single refill causes a drain on the next free. > > The benefit depends on both the workload and the machine as ultimately the > determining factor is whether cache line bounces on zone->lock or contention > is a problem. The patch was tested on a variety of workloads and machines, > some of which are reported here. > > This is the result from netperf running UDP_STREAM on localhost. It was > selected on the basis that it is slab-intensive and has been the subject > of previous SLAB vs SLUB comparisons with the caveat that this is not > testing between two physical hosts. > > 2-socket modern machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v5 > Hmeansend-64 178.38 ( 0.00%) 260.54 ( 46.06%) > Hmeansend-128351.49 ( 0.00%) 518.56 ( 47.53%) > Hmeansend-256671.23 ( 0.00%) 1005.72 ( 49.83%) > Hmeansend-1024 2663.60 ( 0.00%) 3880.54 ( 45.69%) > Hmeansend-2048 5126.53 ( 0.00%) 7545.38 ( 47.18%) > Hmeansend-3312 7949.99 ( 0.00%)11324.34 ( 42.44%) > Hmeansend-4096 9433.56 ( 0.00%)12818.85 ( 35.89%) > Hmeansend-8192 15940.64 ( 0.00%)21404.98 ( 34.28%) > Hmeansend-1638426699.54 ( 0.00%)32810.08 ( 22.89%) > Hmeanrecv-64 178.38 ( 0.00%) 260.52 ( 46.05%) > Hmeanrecv-128351.49 ( 0.00%) 518.53 ( 47.53%) > Hmeanrecv-256671.20 ( 0.00%) 1005.42 ( 49.79%) > Hmeanrecv-1024 2663.45 ( 0.00%) 3879.75 ( 45.67%) > Hmeanrecv-2048 5126.26 ( 0.00%) 7544.23 ( 47.17%) > Hmeanrecv-3312 7949.50 ( 0.00%)11322.52 ( 42.43%) > Hmeanrecv-4096 9433.04 ( 0.00%)12816.68 ( 35.87%) > Hmeanrecv-8192 15939.64 ( 0.00%)21402.75 ( 34.27%) > Hmeanrecv-1638426698.44 ( 0.00%)32806.81 ( 22.88%) > > 1-socket 6 year old machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v4 > Hmeansend-64 87.47 ( 0.00%) 127.01 ( 45.21%) > Hmeansend-128174.36 ( 0.00%) 254.86 ( 46.17%) > Hmeansend-256347.52 ( 0.00%) 505.91 ( 45.58%) > Hmeansend-1024 1363.03 ( 0.00%) 1962.49 ( 43.98%) > Hmeansend-2048 2632.68 ( 0.00%) 3731.74 ( 41.75%) > Hmeansend-3312 4123.19 ( 0.00%) 5859.08 ( 42.10%) > Hmeansend-4096 5056.48 ( 0.00%) 7058.00 ( 39.58%) > Hmeansend-8192 8784.22 ( 0.00%)12134.53 ( 38.14%) > Hmeansend-1638415081.60 ( 0.00%)19638.90 ( 30.22%) > Hmeanrecv-64 86.19 ( 0.00%) 126.34 ( 46.58%) > Hmeanrecv-128173.93 ( 0.00%) 253.51 ( 45.75%) > Hmeanrecv-256346.19 ( 0.00%) 503.34 ( 45.40%) > Hmeanrecv-1024 1358.28 ( 0.00%) 1951.63 ( 43.68%) > Hmeanrecv-2048 2623.45 ( 0.00%) 3701.67 ( 41.10%) > Hmeanrecv-3312 4108.63 ( 0.00%) 5817.75 ( 41.60%) > Hmeanrecv-4096 5037.25 ( 0.00%) 7004.79 ( 39.06%) > Hmeanrecv-8192 8762.32 ( 0.00%)12059.83 ( 37.63%) > Hmeanrecv-1638415042.36 ( 0.00%)19514.33 ( 29.73%) > > This is somewhat dramatic but it's also not universal. For example, it was > observed on an older HP machine using pcc-cpufreq that there was
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Fri 02-12-16 15:03:46, Joonsoo Kim wrote: [...] > > o pcp accounting during free is now confined to free_pcppages_bulk as it's > > impossible for the caller to know exactly how many pages were freed. > > Due to the high-order caches, the number of pages drained for a request > > is no longer precise. > > > > o The high watermark for per-cpu pages is increased to reduce the > > probability > > that a single refill causes a drain on the next free. [...] > I guess that this patch would cause following problems. > > 1. If pcp->batch is too small, high order page will not be freed > easily and survive longer. Think about following situation. > > Batch count: 7 > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > -> order 2... > > free count: 1 + 1 + 1 + 2 + 4 = 9 > so order 3 would not be freed. I guess the second paragraph above in the changelog tries to clarify that... > 2. And, It seems that this logic penalties high order pages. One free > to high order page means 1 << order pages free rather than just > one page free. This logic do round-robin to choose the target page so > amount of freed page will be different by the order. Yes this is indeed possible. The first paragraph above mentions this problem. > I think that it > makes some sense because high order page are less important to cache > in pcp than lower order but I'd like to know if it is intended or not. > If intended, it deserves the comment. > > 3. I guess that order-0 file/anon page alloc/free is dominent in many > workloads. If this case happen, it invalidates effect of high order > cache in pcp since cached high order pages would be also freed to the > buddy when burst order-0 free happens. Yes this is true and I was wondering the same but I believe this can be enahanced later on. E.g. we can check the order when crossing pcp->high mark and only the given order portion of the batch. I just wouldn't over optimize at this stage. -- Michal Hocko SUSE Labs
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
On Fri 02-12-16 15:03:46, Joonsoo Kim wrote: [...] > > o pcp accounting during free is now confined to free_pcppages_bulk as it's > > impossible for the caller to know exactly how many pages were freed. > > Due to the high-order caches, the number of pages drained for a request > > is no longer precise. > > > > o The high watermark for per-cpu pages is increased to reduce the > > probability > > that a single refill causes a drain on the next free. [...] > I guess that this patch would cause following problems. > > 1. If pcp->batch is too small, high order page will not be freed > easily and survive longer. Think about following situation. > > Batch count: 7 > MIGRATE_UNMOVABLE -> MIGRATE_MOVABLE -> MIGRATE_RECLAIMABLE -> order 1 > -> order 2... > > free count: 1 + 1 + 1 + 2 + 4 = 9 > so order 3 would not be freed. I guess the second paragraph above in the changelog tries to clarify that... > 2. And, It seems that this logic penalties high order pages. One free > to high order page means 1 << order pages free rather than just > one page free. This logic do round-robin to choose the target page so > amount of freed page will be different by the order. Yes this is indeed possible. The first paragraph above mentions this problem. > I think that it > makes some sense because high order page are less important to cache > in pcp than lower order but I'd like to know if it is intended or not. > If intended, it deserves the comment. > > 3. I guess that order-0 file/anon page alloc/free is dominent in many > workloads. If this case happen, it invalidates effect of high order > cache in pcp since cached high order pages would be also freed to the > buddy when burst order-0 free happens. Yes this is true and I was wondering the same but I believe this can be enahanced later on. E.g. we can check the order when crossing pcp->high mark and only the given order portion of the batch. I just wouldn't over optimize at this stage. -- Michal Hocko SUSE Labs
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
Hello, Mel. I didn't follow up previous discussion so what I raise here would be duplicated. Please let me know the link if it is answered before. On Fri, Dec 02, 2016 at 12:22:44AM +, Mel Gorman wrote: > Changelog since v4 > o Avoid pcp->count getting out of sync if struct page gets corrupted > > Changelog since v3 > o Allow high-order atomic allocations to use reserves > > Changelog since v2 > o Correct initialisation to avoid -Woverflow warning > > SLUB has been the default small kernel object allocator for quite some time > but it is not universally used due to performance concerns and a reliance > on high-order pages. The high-order concerns has two major components -- > high-order pages are not always available and high-order page allocations > potentially contend on the zone->lock. This patch addresses some concerns > about the zone lock contention by extending the per-cpu page allocator to > cache high-order pages. The patch makes the following modifications > > o New per-cpu lists are added to cache the high-order pages. This increases > the cache footprint of the per-cpu allocator and overall usage but for > some workloads, this will be offset by reduced contention on zone->lock. > The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The > remaining are high-order caches up to and including > PAGE_ALLOC_COSTLY_ORDER > > o pcp accounting during free is now confined to free_pcppages_bulk as it's > impossible for the caller to know exactly how many pages were freed. > Due to the high-order caches, the number of pages drained for a request > is no longer precise. > > o The high watermark for per-cpu pages is increased to reduce the probability > that a single refill causes a drain on the next free. > > The benefit depends on both the workload and the machine as ultimately the > determining factor is whether cache line bounces on zone->lock or contention > is a problem. The patch was tested on a variety of workloads and machines, > some of which are reported here. > > This is the result from netperf running UDP_STREAM on localhost. It was > selected on the basis that it is slab-intensive and has been the subject > of previous SLAB vs SLUB comparisons with the caveat that this is not > testing between two physical hosts. > > 2-socket modern machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v5 > Hmeansend-64 178.38 ( 0.00%) 260.54 ( 46.06%) > Hmeansend-128351.49 ( 0.00%) 518.56 ( 47.53%) > Hmeansend-256671.23 ( 0.00%) 1005.72 ( 49.83%) > Hmeansend-1024 2663.60 ( 0.00%) 3880.54 ( 45.69%) > Hmeansend-2048 5126.53 ( 0.00%) 7545.38 ( 47.18%) > Hmeansend-3312 7949.99 ( 0.00%)11324.34 ( 42.44%) > Hmeansend-4096 9433.56 ( 0.00%)12818.85 ( 35.89%) > Hmeansend-8192 15940.64 ( 0.00%)21404.98 ( 34.28%) > Hmeansend-1638426699.54 ( 0.00%)32810.08 ( 22.89%) > Hmeanrecv-64 178.38 ( 0.00%) 260.52 ( 46.05%) > Hmeanrecv-128351.49 ( 0.00%) 518.53 ( 47.53%) > Hmeanrecv-256671.20 ( 0.00%) 1005.42 ( 49.79%) > Hmeanrecv-1024 2663.45 ( 0.00%) 3879.75 ( 45.67%) > Hmeanrecv-2048 5126.26 ( 0.00%) 7544.23 ( 47.17%) > Hmeanrecv-3312 7949.50 ( 0.00%)11322.52 ( 42.43%) > Hmeanrecv-4096 9433.04 ( 0.00%)12816.68 ( 35.87%) > Hmeanrecv-8192 15939.64 ( 0.00%)21402.75 ( 34.27%) > Hmeanrecv-1638426698.44 ( 0.00%)32806.81 ( 22.88%) > > 1-socket 6 year old machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v4 > Hmeansend-64 87.47 ( 0.00%) 127.01 ( 45.21%) > Hmeansend-128174.36 ( 0.00%) 254.86 ( 46.17%) > Hmeansend-256347.52 ( 0.00%) 505.91 ( 45.58%) > Hmeansend-1024 1363.03 ( 0.00%) 1962.49 ( 43.98%) > Hmeansend-2048 2632.68 ( 0.00%) 3731.74 ( 41.75%) > Hmeansend-3312 4123.19 ( 0.00%) 5859.08 ( 42.10%) > Hmeansend-4096 5056.48 ( 0.00%) 7058.00 ( 39.58%) > Hmeansend-8192 8784.22 ( 0.00%)12134.53 ( 38.14%) > Hmeansend-1638415081.60 ( 0.00%)19638.90 ( 30.22%) > Hmeanrecv-64 86.19 ( 0.00%) 126.34 ( 46.58%) > Hmeanrecv-128173.93 ( 0.00%) 253.51 ( 45.75%) > Hmeanrecv-256346.19 ( 0.00%) 503.34 ( 45.40%) > Hmeanrecv-1024 1358.28 ( 0.00%) 1951.63 ( 43.68%) > Hmeanrecv-2048 2623.45 ( 0.00%) 3701.67 ( 41.10%) > Hmeanrecv-3312 4108.63 ( 0.00%) 5817.75 ( 41.60%) > Hmeanrecv-4096 5037.25 ( 0.00%) 7004.79 ( 39.06%) > Hmeanrecv-8192 8762.32 ( 0.00%)12059.83 ( 37.63%) > Hmeanrecv-1638415042.36 ( 0.00%)
Re: [PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
Hello, Mel. I didn't follow up previous discussion so what I raise here would be duplicated. Please let me know the link if it is answered before. On Fri, Dec 02, 2016 at 12:22:44AM +, Mel Gorman wrote: > Changelog since v4 > o Avoid pcp->count getting out of sync if struct page gets corrupted > > Changelog since v3 > o Allow high-order atomic allocations to use reserves > > Changelog since v2 > o Correct initialisation to avoid -Woverflow warning > > SLUB has been the default small kernel object allocator for quite some time > but it is not universally used due to performance concerns and a reliance > on high-order pages. The high-order concerns has two major components -- > high-order pages are not always available and high-order page allocations > potentially contend on the zone->lock. This patch addresses some concerns > about the zone lock contention by extending the per-cpu page allocator to > cache high-order pages. The patch makes the following modifications > > o New per-cpu lists are added to cache the high-order pages. This increases > the cache footprint of the per-cpu allocator and overall usage but for > some workloads, this will be offset by reduced contention on zone->lock. > The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The > remaining are high-order caches up to and including > PAGE_ALLOC_COSTLY_ORDER > > o pcp accounting during free is now confined to free_pcppages_bulk as it's > impossible for the caller to know exactly how many pages were freed. > Due to the high-order caches, the number of pages drained for a request > is no longer precise. > > o The high watermark for per-cpu pages is increased to reduce the probability > that a single refill causes a drain on the next free. > > The benefit depends on both the workload and the machine as ultimately the > determining factor is whether cache line bounces on zone->lock or contention > is a problem. The patch was tested on a variety of workloads and machines, > some of which are reported here. > > This is the result from netperf running UDP_STREAM on localhost. It was > selected on the basis that it is slab-intensive and has been the subject > of previous SLAB vs SLUB comparisons with the caveat that this is not > testing between two physical hosts. > > 2-socket modern machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v5 > Hmeansend-64 178.38 ( 0.00%) 260.54 ( 46.06%) > Hmeansend-128351.49 ( 0.00%) 518.56 ( 47.53%) > Hmeansend-256671.23 ( 0.00%) 1005.72 ( 49.83%) > Hmeansend-1024 2663.60 ( 0.00%) 3880.54 ( 45.69%) > Hmeansend-2048 5126.53 ( 0.00%) 7545.38 ( 47.18%) > Hmeansend-3312 7949.99 ( 0.00%)11324.34 ( 42.44%) > Hmeansend-4096 9433.56 ( 0.00%)12818.85 ( 35.89%) > Hmeansend-8192 15940.64 ( 0.00%)21404.98 ( 34.28%) > Hmeansend-1638426699.54 ( 0.00%)32810.08 ( 22.89%) > Hmeanrecv-64 178.38 ( 0.00%) 260.52 ( 46.05%) > Hmeanrecv-128351.49 ( 0.00%) 518.53 ( 47.53%) > Hmeanrecv-256671.20 ( 0.00%) 1005.42 ( 49.79%) > Hmeanrecv-1024 2663.45 ( 0.00%) 3879.75 ( 45.67%) > Hmeanrecv-2048 5126.26 ( 0.00%) 7544.23 ( 47.17%) > Hmeanrecv-3312 7949.50 ( 0.00%)11322.52 ( 42.43%) > Hmeanrecv-4096 9433.04 ( 0.00%)12816.68 ( 35.87%) > Hmeanrecv-8192 15939.64 ( 0.00%)21402.75 ( 34.27%) > Hmeanrecv-1638426698.44 ( 0.00%)32806.81 ( 22.88%) > > 1-socket 6 year old machine > 4.9.0-rc5 4.9.0-rc5 > vanilla hopcpu-v4 > Hmeansend-64 87.47 ( 0.00%) 127.01 ( 45.21%) > Hmeansend-128174.36 ( 0.00%) 254.86 ( 46.17%) > Hmeansend-256347.52 ( 0.00%) 505.91 ( 45.58%) > Hmeansend-1024 1363.03 ( 0.00%) 1962.49 ( 43.98%) > Hmeansend-2048 2632.68 ( 0.00%) 3731.74 ( 41.75%) > Hmeansend-3312 4123.19 ( 0.00%) 5859.08 ( 42.10%) > Hmeansend-4096 5056.48 ( 0.00%) 7058.00 ( 39.58%) > Hmeansend-8192 8784.22 ( 0.00%)12134.53 ( 38.14%) > Hmeansend-1638415081.60 ( 0.00%)19638.90 ( 30.22%) > Hmeanrecv-64 86.19 ( 0.00%) 126.34 ( 46.58%) > Hmeanrecv-128173.93 ( 0.00%) 253.51 ( 45.75%) > Hmeanrecv-256346.19 ( 0.00%) 503.34 ( 45.40%) > Hmeanrecv-1024 1358.28 ( 0.00%) 1951.63 ( 43.68%) > Hmeanrecv-2048 2623.45 ( 0.00%) 3701.67 ( 41.10%) > Hmeanrecv-3312 4108.63 ( 0.00%) 5817.75 ( 41.60%) > Hmeanrecv-4096 5037.25 ( 0.00%) 7004.79 ( 39.06%) > Hmeanrecv-8192 8762.32 ( 0.00%)12059.83 ( 37.63%) > Hmeanrecv-1638415042.36 ( 0.00%)
[PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
Changelog since v4 o Avoid pcp->count getting out of sync if struct page gets corrupted Changelog since v3 o Allow high-order atomic allocations to use reserves Changelog since v2 o Correct initialisation to avoid -Woverflow warning SLUB has been the default small kernel object allocator for quite some time but it is not universally used due to performance concerns and a reliance on high-order pages. The high-order concerns has two major components -- high-order pages are not always available and high-order page allocations potentially contend on the zone->lock. This patch addresses some concerns about the zone lock contention by extending the per-cpu page allocator to cache high-order pages. The patch makes the following modifications o New per-cpu lists are added to cache the high-order pages. This increases the cache footprint of the per-cpu allocator and overall usage but for some workloads, this will be offset by reduced contention on zone->lock. The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The remaining are high-order caches up to and including PAGE_ALLOC_COSTLY_ORDER o pcp accounting during free is now confined to free_pcppages_bulk as it's impossible for the caller to know exactly how many pages were freed. Due to the high-order caches, the number of pages drained for a request is no longer precise. o The high watermark for per-cpu pages is increased to reduce the probability that a single refill causes a drain on the next free. The benefit depends on both the workload and the machine as ultimately the determining factor is whether cache line bounces on zone->lock or contention is a problem. The patch was tested on a variety of workloads and machines, some of which are reported here. This is the result from netperf running UDP_STREAM on localhost. It was selected on the basis that it is slab-intensive and has been the subject of previous SLAB vs SLUB comparisons with the caveat that this is not testing between two physical hosts. 2-socket modern machine 4.9.0-rc5 4.9.0-rc5 vanilla hopcpu-v5 Hmeansend-64 178.38 ( 0.00%) 260.54 ( 46.06%) Hmeansend-128351.49 ( 0.00%) 518.56 ( 47.53%) Hmeansend-256671.23 ( 0.00%) 1005.72 ( 49.83%) Hmeansend-1024 2663.60 ( 0.00%) 3880.54 ( 45.69%) Hmeansend-2048 5126.53 ( 0.00%) 7545.38 ( 47.18%) Hmeansend-3312 7949.99 ( 0.00%)11324.34 ( 42.44%) Hmeansend-4096 9433.56 ( 0.00%)12818.85 ( 35.89%) Hmeansend-8192 15940.64 ( 0.00%)21404.98 ( 34.28%) Hmeansend-1638426699.54 ( 0.00%)32810.08 ( 22.89%) Hmeanrecv-64 178.38 ( 0.00%) 260.52 ( 46.05%) Hmeanrecv-128351.49 ( 0.00%) 518.53 ( 47.53%) Hmeanrecv-256671.20 ( 0.00%) 1005.42 ( 49.79%) Hmeanrecv-1024 2663.45 ( 0.00%) 3879.75 ( 45.67%) Hmeanrecv-2048 5126.26 ( 0.00%) 7544.23 ( 47.17%) Hmeanrecv-3312 7949.50 ( 0.00%)11322.52 ( 42.43%) Hmeanrecv-4096 9433.04 ( 0.00%)12816.68 ( 35.87%) Hmeanrecv-8192 15939.64 ( 0.00%)21402.75 ( 34.27%) Hmeanrecv-1638426698.44 ( 0.00%)32806.81 ( 22.88%) 1-socket 6 year old machine 4.9.0-rc5 4.9.0-rc5 vanilla hopcpu-v4 Hmeansend-64 87.47 ( 0.00%) 127.01 ( 45.21%) Hmeansend-128174.36 ( 0.00%) 254.86 ( 46.17%) Hmeansend-256347.52 ( 0.00%) 505.91 ( 45.58%) Hmeansend-1024 1363.03 ( 0.00%) 1962.49 ( 43.98%) Hmeansend-2048 2632.68 ( 0.00%) 3731.74 ( 41.75%) Hmeansend-3312 4123.19 ( 0.00%) 5859.08 ( 42.10%) Hmeansend-4096 5056.48 ( 0.00%) 7058.00 ( 39.58%) Hmeansend-8192 8784.22 ( 0.00%)12134.53 ( 38.14%) Hmeansend-1638415081.60 ( 0.00%)19638.90 ( 30.22%) Hmeanrecv-64 86.19 ( 0.00%) 126.34 ( 46.58%) Hmeanrecv-128173.93 ( 0.00%) 253.51 ( 45.75%) Hmeanrecv-256346.19 ( 0.00%) 503.34 ( 45.40%) Hmeanrecv-1024 1358.28 ( 0.00%) 1951.63 ( 43.68%) Hmeanrecv-2048 2623.45 ( 0.00%) 3701.67 ( 41.10%) Hmeanrecv-3312 4108.63 ( 0.00%) 5817.75 ( 41.60%) Hmeanrecv-4096 5037.25 ( 0.00%) 7004.79 ( 39.06%) Hmeanrecv-8192 8762.32 ( 0.00%)12059.83 ( 37.63%) Hmeanrecv-1638415042.36 ( 0.00%)19514.33 ( 29.73%) This is somewhat dramatic but it's also not universal. For example, it was observed on an older HP machine using pcc-cpufreq that there was almost no difference but pcc-cpufreq is also a known performance hazard. These are quite different results but illustrate that the patch is dependent on the CPU. The results are similar for TCP_STREAM on the two-socket
[PATCH 2/2] mm: page_alloc: High-order per-cpu page allocator v5
Changelog since v4 o Avoid pcp->count getting out of sync if struct page gets corrupted Changelog since v3 o Allow high-order atomic allocations to use reserves Changelog since v2 o Correct initialisation to avoid -Woverflow warning SLUB has been the default small kernel object allocator for quite some time but it is not universally used due to performance concerns and a reliance on high-order pages. The high-order concerns has two major components -- high-order pages are not always available and high-order page allocations potentially contend on the zone->lock. This patch addresses some concerns about the zone lock contention by extending the per-cpu page allocator to cache high-order pages. The patch makes the following modifications o New per-cpu lists are added to cache the high-order pages. This increases the cache footprint of the per-cpu allocator and overall usage but for some workloads, this will be offset by reduced contention on zone->lock. The first MIGRATE_PCPTYPE entries in the list are per-migratetype. The remaining are high-order caches up to and including PAGE_ALLOC_COSTLY_ORDER o pcp accounting during free is now confined to free_pcppages_bulk as it's impossible for the caller to know exactly how many pages were freed. Due to the high-order caches, the number of pages drained for a request is no longer precise. o The high watermark for per-cpu pages is increased to reduce the probability that a single refill causes a drain on the next free. The benefit depends on both the workload and the machine as ultimately the determining factor is whether cache line bounces on zone->lock or contention is a problem. The patch was tested on a variety of workloads and machines, some of which are reported here. This is the result from netperf running UDP_STREAM on localhost. It was selected on the basis that it is slab-intensive and has been the subject of previous SLAB vs SLUB comparisons with the caveat that this is not testing between two physical hosts. 2-socket modern machine 4.9.0-rc5 4.9.0-rc5 vanilla hopcpu-v5 Hmeansend-64 178.38 ( 0.00%) 260.54 ( 46.06%) Hmeansend-128351.49 ( 0.00%) 518.56 ( 47.53%) Hmeansend-256671.23 ( 0.00%) 1005.72 ( 49.83%) Hmeansend-1024 2663.60 ( 0.00%) 3880.54 ( 45.69%) Hmeansend-2048 5126.53 ( 0.00%) 7545.38 ( 47.18%) Hmeansend-3312 7949.99 ( 0.00%)11324.34 ( 42.44%) Hmeansend-4096 9433.56 ( 0.00%)12818.85 ( 35.89%) Hmeansend-8192 15940.64 ( 0.00%)21404.98 ( 34.28%) Hmeansend-1638426699.54 ( 0.00%)32810.08 ( 22.89%) Hmeanrecv-64 178.38 ( 0.00%) 260.52 ( 46.05%) Hmeanrecv-128351.49 ( 0.00%) 518.53 ( 47.53%) Hmeanrecv-256671.20 ( 0.00%) 1005.42 ( 49.79%) Hmeanrecv-1024 2663.45 ( 0.00%) 3879.75 ( 45.67%) Hmeanrecv-2048 5126.26 ( 0.00%) 7544.23 ( 47.17%) Hmeanrecv-3312 7949.50 ( 0.00%)11322.52 ( 42.43%) Hmeanrecv-4096 9433.04 ( 0.00%)12816.68 ( 35.87%) Hmeanrecv-8192 15939.64 ( 0.00%)21402.75 ( 34.27%) Hmeanrecv-1638426698.44 ( 0.00%)32806.81 ( 22.88%) 1-socket 6 year old machine 4.9.0-rc5 4.9.0-rc5 vanilla hopcpu-v4 Hmeansend-64 87.47 ( 0.00%) 127.01 ( 45.21%) Hmeansend-128174.36 ( 0.00%) 254.86 ( 46.17%) Hmeansend-256347.52 ( 0.00%) 505.91 ( 45.58%) Hmeansend-1024 1363.03 ( 0.00%) 1962.49 ( 43.98%) Hmeansend-2048 2632.68 ( 0.00%) 3731.74 ( 41.75%) Hmeansend-3312 4123.19 ( 0.00%) 5859.08 ( 42.10%) Hmeansend-4096 5056.48 ( 0.00%) 7058.00 ( 39.58%) Hmeansend-8192 8784.22 ( 0.00%)12134.53 ( 38.14%) Hmeansend-1638415081.60 ( 0.00%)19638.90 ( 30.22%) Hmeanrecv-64 86.19 ( 0.00%) 126.34 ( 46.58%) Hmeanrecv-128173.93 ( 0.00%) 253.51 ( 45.75%) Hmeanrecv-256346.19 ( 0.00%) 503.34 ( 45.40%) Hmeanrecv-1024 1358.28 ( 0.00%) 1951.63 ( 43.68%) Hmeanrecv-2048 2623.45 ( 0.00%) 3701.67 ( 41.10%) Hmeanrecv-3312 4108.63 ( 0.00%) 5817.75 ( 41.60%) Hmeanrecv-4096 5037.25 ( 0.00%) 7004.79 ( 39.06%) Hmeanrecv-8192 8762.32 ( 0.00%)12059.83 ( 37.63%) Hmeanrecv-1638415042.36 ( 0.00%)19514.33 ( 29.73%) This is somewhat dramatic but it's also not universal. For example, it was observed on an older HP machine using pcc-cpufreq that there was almost no difference but pcc-cpufreq is also a known performance hazard. These are quite different results but illustrate that the patch is dependent on the CPU. The results are similar for TCP_STREAM on the two-socket