Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-25 Thread Aaron Lu
On Thu, Mar 22, 2018 at 08:17:19AM -0700, Matthew Wilcox wrote:
> On Tue, Mar 13, 2018 at 11:34:53AM +0800, Aaron Lu wrote:
> > I wish there is a data structure that has the flexibility of list while
> > at the same time we can locate the Nth element in the list without the
> > need to iterate. That's what I'm looking for when developing clustered
> > allocation for order 0 pages. In the end, I had to use another place to
> > record where the Nth element is. I hope to send out v2 of that RFC
> > series soon but I'm still collecting data for it. I would appreciate if
> > people could take a look then :-)
> 
> Sorry, I missed this.  There is such a data structure -- the IDR, or
> possibly a bare radix tree, or we can build a better data structure on
> top of the radix tree (I talked about one called the XQueue a while ago).
> 
> The IDR will automatically grow to whatever needed size, it stores
> pointers, you can find out quickly where the last allocated index is,
> you can remove from the middle of the array.  Disadvantage is that it
> requires memory allocation to store the array of pointers, *but* it
> can always hold at least one entry.  So if you have no memory, you can
> always return the one element in your IDR to the free pool and allocate
> from that page.

Thanks for the pointer, will take a look later.
Currently I'm focusing on finding real workloads that have zone lock
contention issue.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-25 Thread Aaron Lu
On Thu, Mar 22, 2018 at 08:17:19AM -0700, Matthew Wilcox wrote:
> On Tue, Mar 13, 2018 at 11:34:53AM +0800, Aaron Lu wrote:
> > I wish there is a data structure that has the flexibility of list while
> > at the same time we can locate the Nth element in the list without the
> > need to iterate. That's what I'm looking for when developing clustered
> > allocation for order 0 pages. In the end, I had to use another place to
> > record where the Nth element is. I hope to send out v2 of that RFC
> > series soon but I'm still collecting data for it. I would appreciate if
> > people could take a look then :-)
> 
> Sorry, I missed this.  There is such a data structure -- the IDR, or
> possibly a bare radix tree, or we can build a better data structure on
> top of the radix tree (I talked about one called the XQueue a while ago).
> 
> The IDR will automatically grow to whatever needed size, it stores
> pointers, you can find out quickly where the last allocated index is,
> you can remove from the middle of the array.  Disadvantage is that it
> requires memory allocation to store the array of pointers, *but* it
> can always hold at least one entry.  So if you have no memory, you can
> always return the one element in your IDR to the free pool and allocate
> from that page.

Thanks for the pointer, will take a look later.
Currently I'm focusing on finding real workloads that have zone lock
contention issue.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-22 Thread Matthew Wilcox
On Tue, Mar 13, 2018 at 11:34:53AM +0800, Aaron Lu wrote:
> I wish there is a data structure that has the flexibility of list while
> at the same time we can locate the Nth element in the list without the
> need to iterate. That's what I'm looking for when developing clustered
> allocation for order 0 pages. In the end, I had to use another place to
> record where the Nth element is. I hope to send out v2 of that RFC
> series soon but I'm still collecting data for it. I would appreciate if
> people could take a look then :-)

Sorry, I missed this.  There is such a data structure -- the IDR, or
possibly a bare radix tree, or we can build a better data structure on
top of the radix tree (I talked about one called the XQueue a while ago).

The IDR will automatically grow to whatever needed size, it stores
pointers, you can find out quickly where the last allocated index is,
you can remove from the middle of the array.  Disadvantage is that it
requires memory allocation to store the array of pointers, *but* it
can always hold at least one entry.  So if you have no memory, you can
always return the one element in your IDR to the free pool and allocate
from that page.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-22 Thread Matthew Wilcox
On Tue, Mar 13, 2018 at 11:34:53AM +0800, Aaron Lu wrote:
> I wish there is a data structure that has the flexibility of list while
> at the same time we can locate the Nth element in the list without the
> need to iterate. That's what I'm looking for when developing clustered
> allocation for order 0 pages. In the end, I had to use another place to
> record where the Nth element is. I hope to send out v2 of that RFC
> series soon but I'm still collecting data for it. I would appreciate if
> people could take a look then :-)

Sorry, I missed this.  There is such a data structure -- the IDR, or
possibly a bare radix tree, or we can build a better data structure on
top of the radix tree (I talked about one called the XQueue a while ago).

The IDR will automatically grow to whatever needed size, it stores
pointers, you can find out quickly where the last allocated index is,
you can remove from the middle of the array.  Disadvantage is that it
requires memory allocation to store the array of pointers, *but* it
can always hold at least one entry.  So if you have no memory, you can
always return the one element in your IDR to the free pool and allocate
from that page.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-12 Thread Aaron Lu
On Mon, Mar 12, 2018 at 03:22:53PM +0100, Vlastimil Babka wrote:
> On 03/01/2018 07:28 AM, Aaron Lu wrote:
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> > 
> > Moving this part outside could reduce lock held time and improve
> > performance. Test with will-it-scale/page_fault1 full load:
> > 
> > kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > v4.16-rc2+  90342157971818   13667135   15677465
> > this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > 
> > What the test does is: starts $nr_cpu processes and each will repeatedly
> > do the following for 5 minutes:
> > 1 mmap 128M anonymouse space;
> > 2 write access to that space;
> > 3 munmap.
> > The score is the aggregated iteration.
> > 
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> > 
> > Acked-by: Mel Gorman 
> > Signed-off-by: Aaron Lu 
> > ---
> >  mm/page_alloc.c | 39 +++
> >  1 file changed, 23 insertions(+), 16 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index faa33eac1635..dafdcdec9c1f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, 
> > int count,
> > int migratetype = 0;
> > int batch_free = 0;
> > bool isolated_pageblocks;
> > -
> > -   spin_lock(>lock);
> > -   isolated_pageblocks = has_isolate_pageblock(zone);
> > +   struct page *page, *tmp;
> > +   LIST_HEAD(head);
> >  
> > while (count) {
> > -   struct page *page;
> > struct list_head *list;
> >  
> > /*
> > @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, 
> > int count,
> > batch_free = count;
> >  
> > do {
> > -   int mt; /* migratetype of the to-be-freed page */
> > -
> > page = list_last_entry(list, struct page, lru);
> > -   /* must delete as __free_one_page list manipulates */
> > +   /* must delete to avoid corrupting pcp list */
> > list_del(>lru);
> 
> Well, since bulkfree_pcp_prepare() doesn't care about page->lru, you
> could maybe use list_move_tail() instead of list_del() +
> list_add_tail()? That avoids temporarily writing poison values.

Good point, except bulkfree_pcp_prepare() could return error and then
the page will need to be removed from the to-be-freed list, like this:

do {
page = list_last_entry(list, struct page, lru);
list_move_tail(>lru, );
pcp->count--;

if (bulkfree_pcp_prepare(page))
list_del(>lru);
} while (--count && --batch_free && !list_empty(list));

Considering bulkfree_pcp_prepare() returning error is the rare case,
this list_del() should rarely happen. At the same time, this part is
outside of zone->lock and can hardly impact performance...so I'm not
sure.
 
> Hm actually, you are reversing the list in the process, because page is
> obtained by list_last_entry and you use list_add_tail. That could have
> unintended performance consequences?

True the order is changed when these to-be-freed pages are in this
temporary list, but then they are iterated and freed one by one from
head to tail so the order they landed in free_list is the same as
before the patch(also the same as they are in pcp list).

> 
> Also maybe list_cut_position() could be faster than shuffling pages one
> by one? I guess not really, because batch_free will be generally low?

We will need to know where to cut if list_cut_position() is to be used
and to find that out, the list will need to be iterated first. I guess
that's too much trouble.

Since this part of code is per-cpu(outside of zone->lock) and these
pages are in pcp(meaning their cachelines are not likely in remote),
I didn't worry too much about not being able to list_cut_position() but
iterate. On allocation side though, when manipulating the global
free_list under zone->lock, this is a big problem since pages there are
freed from different CPUs and the cache could be cold for the allocating
CPU. That is why I'm proposing clusted allocation sometime ago as an RFC
patch where list_cut_position() is so good that it could eliminate the
cacheline miss issue since we do not need to iterate cold pages one by
one.

I wish there is a data structure that has the flexibility of list while
at the same time we can locate the Nth element in the list without the
need to iterate. That's what I'm looking for when 

Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-12 Thread Aaron Lu
On Mon, Mar 12, 2018 at 03:22:53PM +0100, Vlastimil Babka wrote:
> On 03/01/2018 07:28 AM, Aaron Lu wrote:
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> > 
> > Moving this part outside could reduce lock held time and improve
> > performance. Test with will-it-scale/page_fault1 full load:
> > 
> > kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > v4.16-rc2+  90342157971818   13667135   15677465
> > this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > 
> > What the test does is: starts $nr_cpu processes and each will repeatedly
> > do the following for 5 minutes:
> > 1 mmap 128M anonymouse space;
> > 2 write access to that space;
> > 3 munmap.
> > The score is the aggregated iteration.
> > 
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> > 
> > Acked-by: Mel Gorman 
> > Signed-off-by: Aaron Lu 
> > ---
> >  mm/page_alloc.c | 39 +++
> >  1 file changed, 23 insertions(+), 16 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index faa33eac1635..dafdcdec9c1f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, 
> > int count,
> > int migratetype = 0;
> > int batch_free = 0;
> > bool isolated_pageblocks;
> > -
> > -   spin_lock(>lock);
> > -   isolated_pageblocks = has_isolate_pageblock(zone);
> > +   struct page *page, *tmp;
> > +   LIST_HEAD(head);
> >  
> > while (count) {
> > -   struct page *page;
> > struct list_head *list;
> >  
> > /*
> > @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, 
> > int count,
> > batch_free = count;
> >  
> > do {
> > -   int mt; /* migratetype of the to-be-freed page */
> > -
> > page = list_last_entry(list, struct page, lru);
> > -   /* must delete as __free_one_page list manipulates */
> > +   /* must delete to avoid corrupting pcp list */
> > list_del(>lru);
> 
> Well, since bulkfree_pcp_prepare() doesn't care about page->lru, you
> could maybe use list_move_tail() instead of list_del() +
> list_add_tail()? That avoids temporarily writing poison values.

Good point, except bulkfree_pcp_prepare() could return error and then
the page will need to be removed from the to-be-freed list, like this:

do {
page = list_last_entry(list, struct page, lru);
list_move_tail(>lru, );
pcp->count--;

if (bulkfree_pcp_prepare(page))
list_del(>lru);
} while (--count && --batch_free && !list_empty(list));

Considering bulkfree_pcp_prepare() returning error is the rare case,
this list_del() should rarely happen. At the same time, this part is
outside of zone->lock and can hardly impact performance...so I'm not
sure.
 
> Hm actually, you are reversing the list in the process, because page is
> obtained by list_last_entry and you use list_add_tail. That could have
> unintended performance consequences?

True the order is changed when these to-be-freed pages are in this
temporary list, but then they are iterated and freed one by one from
head to tail so the order they landed in free_list is the same as
before the patch(also the same as they are in pcp list).

> 
> Also maybe list_cut_position() could be faster than shuffling pages one
> by one? I guess not really, because batch_free will be generally low?

We will need to know where to cut if list_cut_position() is to be used
and to find that out, the list will need to be iterated first. I guess
that's too much trouble.

Since this part of code is per-cpu(outside of zone->lock) and these
pages are in pcp(meaning their cachelines are not likely in remote),
I didn't worry too much about not being able to list_cut_position() but
iterate. On allocation side though, when manipulating the global
free_list under zone->lock, this is a big problem since pages there are
freed from different CPUs and the cache could be cold for the allocating
CPU. That is why I'm proposing clusted allocation sometime ago as an RFC
patch where list_cut_position() is so good that it could eliminate the
cacheline miss issue since we do not need to iterate cold pages one by
one.

I wish there is a data structure that has the flexibility of list while
at the same time we can locate the Nth element in the list without the
need to iterate. That's what I'm looking for when developing clustered
allocation for order 0 pages. In 

Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-12 Thread Vlastimil Babka
On 03/01/2018 07:28 AM, Aaron Lu wrote:
> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.
> 
> Moving this part outside could reduce lock held time and improve
> performance. Test with will-it-scale/page_fault1 full load:
> 
> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> v4.16-rc2+  90342157971818   13667135   15677465
> this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> 
> What the test does is: starts $nr_cpu processes and each will repeatedly
> do the following for 5 minutes:
> 1 mmap 128M anonymouse space;
> 2 write access to that space;
> 3 munmap.
> The score is the aggregated iteration.
> 
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> 
> Acked-by: Mel Gorman 
> Signed-off-by: Aaron Lu 
> ---
>  mm/page_alloc.c | 39 +++
>  1 file changed, 23 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index faa33eac1635..dafdcdec9c1f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   int migratetype = 0;
>   int batch_free = 0;
>   bool isolated_pageblocks;
> -
> - spin_lock(>lock);
> - isolated_pageblocks = has_isolate_pageblock(zone);
> + struct page *page, *tmp;
> + LIST_HEAD(head);
>  
>   while (count) {
> - struct page *page;
>   struct list_head *list;
>  
>   /*
> @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   batch_free = count;
>  
>   do {
> - int mt; /* migratetype of the to-be-freed page */
> -
>   page = list_last_entry(list, struct page, lru);
> - /* must delete as __free_one_page list manipulates */
> + /* must delete to avoid corrupting pcp list */
>   list_del(>lru);

Well, since bulkfree_pcp_prepare() doesn't care about page->lru, you
could maybe use list_move_tail() instead of list_del() +
list_add_tail()? That avoids temporarily writing poison values.

Hm actually, you are reversing the list in the process, because page is
obtained by list_last_entry and you use list_add_tail. That could have
unintended performance consequences?

Also maybe list_cut_position() could be faster than shuffling pages one
by one? I guess not really, because batch_free will be generally low?

>   pcp->count--;
>  
> - mt = get_pcppage_migratetype(page);
> - /* MIGRATE_ISOLATE page should not go to pcplists */
> - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> - /* Pageblock could have been isolated meanwhile */
> - if (unlikely(isolated_pageblocks))
> - mt = get_pageblock_migratetype(page);
> -
>   if (bulkfree_pcp_prepare(page))
>   continue;
>  
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> - trace_mm_page_pcpu_drain(page, 0, mt);
> + list_add_tail(>lru, );
>   } while (--count && --batch_free && !list_empty(list));
>   }
> +
> + spin_lock(>lock);
> + isolated_pageblocks = has_isolate_pageblock(zone);
> +
> + /*
> +  * Use safe version since after __free_one_page(),
> +  * page->lru.next will not point to original list.
> +  */
> + list_for_each_entry_safe(page, tmp, , lru) {
> + int mt = get_pcppage_migratetype(page);
> + /* MIGRATE_ISOLATE page should not go to pcplists */
> + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> + /* Pageblock could have been isolated meanwhile */
> + if (unlikely(isolated_pageblocks))
> + mt = get_pageblock_migratetype(page);
> +
> + __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> + trace_mm_page_pcpu_drain(page, 0, mt);
> + }
>   spin_unlock(>lock);
>  }
>  
> 



Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-12 Thread Vlastimil Babka
On 03/01/2018 07:28 AM, Aaron Lu wrote:
> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.
> 
> Moving this part outside could reduce lock held time and improve
> performance. Test with will-it-scale/page_fault1 full load:
> 
> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> v4.16-rc2+  90342157971818   13667135   15677465
> this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> 
> What the test does is: starts $nr_cpu processes and each will repeatedly
> do the following for 5 minutes:
> 1 mmap 128M anonymouse space;
> 2 write access to that space;
> 3 munmap.
> The score is the aggregated iteration.
> 
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> 
> Acked-by: Mel Gorman 
> Signed-off-by: Aaron Lu 
> ---
>  mm/page_alloc.c | 39 +++
>  1 file changed, 23 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index faa33eac1635..dafdcdec9c1f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   int migratetype = 0;
>   int batch_free = 0;
>   bool isolated_pageblocks;
> -
> - spin_lock(>lock);
> - isolated_pageblocks = has_isolate_pageblock(zone);
> + struct page *page, *tmp;
> + LIST_HEAD(head);
>  
>   while (count) {
> - struct page *page;
>   struct list_head *list;
>  
>   /*
> @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   batch_free = count;
>  
>   do {
> - int mt; /* migratetype of the to-be-freed page */
> -
>   page = list_last_entry(list, struct page, lru);
> - /* must delete as __free_one_page list manipulates */
> + /* must delete to avoid corrupting pcp list */
>   list_del(>lru);

Well, since bulkfree_pcp_prepare() doesn't care about page->lru, you
could maybe use list_move_tail() instead of list_del() +
list_add_tail()? That avoids temporarily writing poison values.

Hm actually, you are reversing the list in the process, because page is
obtained by list_last_entry and you use list_add_tail. That could have
unintended performance consequences?

Also maybe list_cut_position() could be faster than shuffling pages one
by one? I guess not really, because batch_free will be generally low?

>   pcp->count--;
>  
> - mt = get_pcppage_migratetype(page);
> - /* MIGRATE_ISOLATE page should not go to pcplists */
> - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> - /* Pageblock could have been isolated meanwhile */
> - if (unlikely(isolated_pageblocks))
> - mt = get_pageblock_migratetype(page);
> -
>   if (bulkfree_pcp_prepare(page))
>   continue;
>  
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> - trace_mm_page_pcpu_drain(page, 0, mt);
> + list_add_tail(>lru, );
>   } while (--count && --batch_free && !list_empty(list));
>   }
> +
> + spin_lock(>lock);
> + isolated_pageblocks = has_isolate_pageblock(zone);
> +
> + /*
> +  * Use safe version since after __free_one_page(),
> +  * page->lru.next will not point to original list.
> +  */
> + list_for_each_entry_safe(page, tmp, , lru) {
> + int mt = get_pcppage_migratetype(page);
> + /* MIGRATE_ISOLATE page should not go to pcplists */
> + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> + /* Pageblock could have been isolated meanwhile */
> + if (unlikely(isolated_pageblocks))
> + mt = get_pageblock_migratetype(page);
> +
> + __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> + trace_mm_page_pcpu_drain(page, 0, mt);
> + }
>   spin_unlock(>lock);
>  }
>  
> 



Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-02 Thread Dave Hansen
On 03/02/2018 01:23 PM, Andrew Morton wrote:
>> On my Sandybridge desktop, with will-it-scale/page_fault1/single process
>> run to emulate uniprocessor system, the score is(average of 3 runs):
>>
>> base(patch 1/3): 649710 
>> this patch:  653554 +0.6%
> Does that mean we got faster or slower?

Faster.  The unit of measurement here is iterations through a loop.
More iterations are better.



Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-02 Thread Dave Hansen
On 03/02/2018 01:23 PM, Andrew Morton wrote:
>> On my Sandybridge desktop, with will-it-scale/page_fault1/single process
>> run to emulate uniprocessor system, the score is(average of 3 runs):
>>
>> base(patch 1/3): 649710 
>> this patch:  653554 +0.6%
> Does that mean we got faster or slower?

Faster.  The unit of measurement here is iterations through a loop.
More iterations are better.



Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-02 Thread Andrew Morton
On Fri, 2 Mar 2018 16:01:25 +0800 Aaron Lu  wrote:

> On Thu, Mar 01, 2018 at 04:01:05PM -0800, Andrew Morton wrote:
> > On Thu,  1 Mar 2018 14:28:44 +0800 Aaron Lu  wrote:
> > 
> > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > > the zone->lock is held and then pages are chosen from PCP's migratetype
> > > list. While there is actually no need to do this 'choose part' under
> > > lock since it's PCP pages, the only CPU that can touch them is us and
> > > irq is also disabled.
> > > 
> > > Moving this part outside could reduce lock held time and improve
> > > performance. Test with will-it-scale/page_fault1 full load:
> > > 
> > > kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > > v4.16-rc2+  90342157971818   13667135   15677465
> > > this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > > 
> > > What the test does is: starts $nr_cpu processes and each will repeatedly
> > > do the following for 5 minutes:
> > > 1 mmap 128M anonymouse space;
> > > 2 write access to that space;
> > > 3 munmap.
> > > The score is the aggregated iteration.
> > 
> > But it's a loss for uniprocessor systems: it adds more code and adds an
> > additional pass across a list.
> 
> Performance wise, I assume the loss is pretty small and can not
> be measured.
> 
> On my Sandybridge desktop, with will-it-scale/page_fault1/single process
> run to emulate uniprocessor system, the score is(average of 3 runs):
> 
> base(patch 1/3):  649710 
> this patch:   653554 +0.6%

Does that mean we got faster or slower?

> prefetch(patch 3/3):  650336 (in noise range compared to base)
> 
> On 4 sockets Intel Broadwell with will-it-scale/page_fault1/single
> process run:
> 
> base(patch 1/3):  498649
> this patch:   504171 +1.1%
> prefetch(patch 3/3):  506334 +1.5% (compared to base)
> 
> It looks like we don't need to worry too much about performance for
> uniprocessor system.

Well.  We can say that of hundreds of patches.  And we end up with a
fatter and slower kernel than we otherwise would.

Please take a look, see if there's a tidy way of avoiding this. 
Probably there isn't, in which case oh well.  But let's at least try.



Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-02 Thread Andrew Morton
On Fri, 2 Mar 2018 16:01:25 +0800 Aaron Lu  wrote:

> On Thu, Mar 01, 2018 at 04:01:05PM -0800, Andrew Morton wrote:
> > On Thu,  1 Mar 2018 14:28:44 +0800 Aaron Lu  wrote:
> > 
> > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > > the zone->lock is held and then pages are chosen from PCP's migratetype
> > > list. While there is actually no need to do this 'choose part' under
> > > lock since it's PCP pages, the only CPU that can touch them is us and
> > > irq is also disabled.
> > > 
> > > Moving this part outside could reduce lock held time and improve
> > > performance. Test with will-it-scale/page_fault1 full load:
> > > 
> > > kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > > v4.16-rc2+  90342157971818   13667135   15677465
> > > this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > > 
> > > What the test does is: starts $nr_cpu processes and each will repeatedly
> > > do the following for 5 minutes:
> > > 1 mmap 128M anonymouse space;
> > > 2 write access to that space;
> > > 3 munmap.
> > > The score is the aggregated iteration.
> > 
> > But it's a loss for uniprocessor systems: it adds more code and adds an
> > additional pass across a list.
> 
> Performance wise, I assume the loss is pretty small and can not
> be measured.
> 
> On my Sandybridge desktop, with will-it-scale/page_fault1/single process
> run to emulate uniprocessor system, the score is(average of 3 runs):
> 
> base(patch 1/3):  649710 
> this patch:   653554 +0.6%

Does that mean we got faster or slower?

> prefetch(patch 3/3):  650336 (in noise range compared to base)
> 
> On 4 sockets Intel Broadwell with will-it-scale/page_fault1/single
> process run:
> 
> base(patch 1/3):  498649
> this patch:   504171 +1.1%
> prefetch(patch 3/3):  506334 +1.5% (compared to base)
> 
> It looks like we don't need to worry too much about performance for
> uniprocessor system.

Well.  We can say that of hundreds of patches.  And we end up with a
fatter and slower kernel than we otherwise would.

Please take a look, see if there's a tidy way of avoiding this. 
Probably there isn't, in which case oh well.  But let's at least try.



Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-02 Thread Dave Hansen
On 03/01/2018 11:15 PM, Aaron Lu wrote:
> 
>> I am still quite surprised that this would have such a large impact.
> Most likely due to the cachelines for these page structures are warmed
> up outside of zone->lock.

The workload here is a pretty tight microbenchmark and single biggest
bottleneck is cache misses on 'struct page'.  It's not memory bandwidth
bound.  So, anything you can give the CPU keep it fed and not waiting on
cache misses will be a win.

There's never going to be a real-world workload that sees this kind of
increase, but the change in the micro isn't super-surprising because it
so directly targets the bottleneck.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-02 Thread Dave Hansen
On 03/01/2018 11:15 PM, Aaron Lu wrote:
> 
>> I am still quite surprised that this would have such a large impact.
> Most likely due to the cachelines for these page structures are warmed
> up outside of zone->lock.

The workload here is a pretty tight microbenchmark and single biggest
bottleneck is cache misses on 'struct page'.  It's not memory bandwidth
bound.  So, anything you can give the CPU keep it fed and not waiting on
cache misses will be a win.

There's never going to be a real-world workload that sees this kind of
increase, but the change in the micro isn't super-surprising because it
so directly targets the bottleneck.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-02 Thread Aaron Lu
On Thu, Mar 01, 2018 at 04:01:05PM -0800, Andrew Morton wrote:
> On Thu,  1 Mar 2018 14:28:44 +0800 Aaron Lu  wrote:
> 
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> > 
> > Moving this part outside could reduce lock held time and improve
> > performance. Test with will-it-scale/page_fault1 full load:
> > 
> > kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > v4.16-rc2+  90342157971818   13667135   15677465
> > this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > 
> > What the test does is: starts $nr_cpu processes and each will repeatedly
> > do the following for 5 minutes:
> > 1 mmap 128M anonymouse space;
> > 2 write access to that space;
> > 3 munmap.
> > The score is the aggregated iteration.
> 
> But it's a loss for uniprocessor systems: it adds more code and adds an
> additional pass across a list.

Performance wise, I assume the loss is pretty small and can not
be measured.

On my Sandybridge desktop, with will-it-scale/page_fault1/single process
run to emulate uniprocessor system, the score is(average of 3 runs):

base(patch 1/3):649710 
this patch: 653554 +0.6%
prefetch(patch 3/3):650336 (in noise range compared to base)

On 4 sockets Intel Broadwell with will-it-scale/page_fault1/single
process run:

base(patch 1/3):498649
this patch: 504171 +1.1%
prefetch(patch 3/3):506334 +1.5% (compared to base)

It looks like we don't need to worry too much about performance for
uniprocessor system.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-02 Thread Aaron Lu
On Thu, Mar 01, 2018 at 04:01:05PM -0800, Andrew Morton wrote:
> On Thu,  1 Mar 2018 14:28:44 +0800 Aaron Lu  wrote:
> 
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> > 
> > Moving this part outside could reduce lock held time and improve
> > performance. Test with will-it-scale/page_fault1 full load:
> > 
> > kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > v4.16-rc2+  90342157971818   13667135   15677465
> > this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > 
> > What the test does is: starts $nr_cpu processes and each will repeatedly
> > do the following for 5 minutes:
> > 1 mmap 128M anonymouse space;
> > 2 write access to that space;
> > 3 munmap.
> > The score is the aggregated iteration.
> 
> But it's a loss for uniprocessor systems: it adds more code and adds an
> additional pass across a list.

Performance wise, I assume the loss is pretty small and can not
be measured.

On my Sandybridge desktop, with will-it-scale/page_fault1/single process
run to emulate uniprocessor system, the score is(average of 3 runs):

base(patch 1/3):649710 
this patch: 653554 +0.6%
prefetch(patch 3/3):650336 (in noise range compared to base)

On 4 sockets Intel Broadwell with will-it-scale/page_fault1/single
process run:

base(patch 1/3):498649
this patch: 504171 +1.1%
prefetch(patch 3/3):506334 +1.5% (compared to base)

It looks like we don't need to worry too much about performance for
uniprocessor system.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-01 Thread Huang, Ying
Michal Hocko  writes:

> On Thu 01-03-18 14:28:44, Aaron Lu wrote:
>> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
>> the zone->lock is held and then pages are chosen from PCP's migratetype
>> list. While there is actually no need to do this 'choose part' under
>> lock since it's PCP pages, the only CPU that can touch them is us and
>> irq is also disabled.
>> 
>> Moving this part outside could reduce lock held time and improve
>> performance. Test with will-it-scale/page_fault1 full load:
>> 
>> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
>> v4.16-rc2+  90342157971818   13667135   15677465
>> this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
>> 
>> What the test does is: starts $nr_cpu processes and each will repeatedly
>> do the following for 5 minutes:
>> 1 mmap 128M anonymouse space;
>> 2 write access to that space;
>> 3 munmap.
>> The score is the aggregated iteration.
>
> Iteration count I assume. I am still quite surprised that this would
> have such a large impact.

The test is run with full load, this means near or more than 100
processes will allocate memory in parallel.  According to Amdahl's law,
the performance of a parallel program will be dominated by the serial
part.  For this case, the part protected by zone->lock.  So small
changes to code under zone->lock could make bigger changes to overall
score.

Best Regards,
Huang, Ying


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-01 Thread Huang, Ying
Michal Hocko  writes:

> On Thu 01-03-18 14:28:44, Aaron Lu wrote:
>> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
>> the zone->lock is held and then pages are chosen from PCP's migratetype
>> list. While there is actually no need to do this 'choose part' under
>> lock since it's PCP pages, the only CPU that can touch them is us and
>> irq is also disabled.
>> 
>> Moving this part outside could reduce lock held time and improve
>> performance. Test with will-it-scale/page_fault1 full load:
>> 
>> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
>> v4.16-rc2+  90342157971818   13667135   15677465
>> this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
>> 
>> What the test does is: starts $nr_cpu processes and each will repeatedly
>> do the following for 5 minutes:
>> 1 mmap 128M anonymouse space;
>> 2 write access to that space;
>> 3 munmap.
>> The score is the aggregated iteration.
>
> Iteration count I assume. I am still quite surprised that this would
> have such a large impact.

The test is run with full load, this means near or more than 100
processes will allocate memory in parallel.  According to Amdahl's law,
the performance of a parallel program will be dominated by the serial
part.  For this case, the part protected by zone->lock.  So small
changes to code under zone->lock could make bigger changes to overall
score.

Best Regards,
Huang, Ying


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-01 Thread Aaron Lu
On Thu, Mar 01, 2018 at 02:55:18PM +0100, Michal Hocko wrote:
> On Thu 01-03-18 14:28:44, Aaron Lu wrote:
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> > 
> > Moving this part outside could reduce lock held time and improve
> > performance. Test with will-it-scale/page_fault1 full load:
> > 
> > kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > v4.16-rc2+  90342157971818   13667135   15677465
> > this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > 
> > What the test does is: starts $nr_cpu processes and each will repeatedly
> > do the following for 5 minutes:
> > 1 mmap 128M anonymouse space;
> > 2 write access to that space;
> > 3 munmap.
> > The score is the aggregated iteration.
> 
> Iteration count I assume.

Correct.

> I am still quite surprised that this would have such a large impact.

Most likely due to the cachelines for these page structures are warmed
up outside of zone->lock.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-01 Thread Aaron Lu
On Thu, Mar 01, 2018 at 02:55:18PM +0100, Michal Hocko wrote:
> On Thu 01-03-18 14:28:44, Aaron Lu wrote:
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> > 
> > Moving this part outside could reduce lock held time and improve
> > performance. Test with will-it-scale/page_fault1 full load:
> > 
> > kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > v4.16-rc2+  90342157971818   13667135   15677465
> > this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > 
> > What the test does is: starts $nr_cpu processes and each will repeatedly
> > do the following for 5 minutes:
> > 1 mmap 128M anonymouse space;
> > 2 write access to that space;
> > 3 munmap.
> > The score is the aggregated iteration.
> 
> Iteration count I assume.

Correct.

> I am still quite surprised that this would have such a large impact.

Most likely due to the cachelines for these page structures are warmed
up outside of zone->lock.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-01 Thread Andrew Morton
On Thu,  1 Mar 2018 14:28:44 +0800 Aaron Lu  wrote:

> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.
> 
> Moving this part outside could reduce lock held time and improve
> performance. Test with will-it-scale/page_fault1 full load:
> 
> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> v4.16-rc2+  90342157971818   13667135   15677465
> this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> 
> What the test does is: starts $nr_cpu processes and each will repeatedly
> do the following for 5 minutes:
> 1 mmap 128M anonymouse space;
> 2 write access to that space;
> 3 munmap.
> The score is the aggregated iteration.

But it's a loss for uniprocessor systems: it adds more code and adds an
additional pass across a list.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-01 Thread Andrew Morton
On Thu,  1 Mar 2018 14:28:44 +0800 Aaron Lu  wrote:

> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.
> 
> Moving this part outside could reduce lock held time and improve
> performance. Test with will-it-scale/page_fault1 full load:
> 
> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> v4.16-rc2+  90342157971818   13667135   15677465
> this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> 
> What the test does is: starts $nr_cpu processes and each will repeatedly
> do the following for 5 minutes:
> 1 mmap 128M anonymouse space;
> 2 write access to that space;
> 3 munmap.
> The score is the aggregated iteration.

But it's a loss for uniprocessor systems: it adds more code and adds an
additional pass across a list.


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-01 Thread Michal Hocko
On Thu 01-03-18 14:28:44, Aaron Lu wrote:
> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.
> 
> Moving this part outside could reduce lock held time and improve
> performance. Test with will-it-scale/page_fault1 full load:
> 
> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> v4.16-rc2+  90342157971818   13667135   15677465
> this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> 
> What the test does is: starts $nr_cpu processes and each will repeatedly
> do the following for 5 minutes:
> 1 mmap 128M anonymouse space;
> 2 write access to that space;
> 3 munmap.
> The score is the aggregated iteration.

Iteration count I assume. I am still quite surprised that this would
have such a large impact.

> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> 
> Acked-by: Mel Gorman 
> Signed-off-by: Aaron Lu 

The patch makes sense to me
Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 39 +++
>  1 file changed, 23 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index faa33eac1635..dafdcdec9c1f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   int migratetype = 0;
>   int batch_free = 0;
>   bool isolated_pageblocks;
> -
> - spin_lock(>lock);
> - isolated_pageblocks = has_isolate_pageblock(zone);
> + struct page *page, *tmp;
> + LIST_HEAD(head);
>  
>   while (count) {
> - struct page *page;
>   struct list_head *list;
>  
>   /*
> @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   batch_free = count;
>  
>   do {
> - int mt; /* migratetype of the to-be-freed page */
> -
>   page = list_last_entry(list, struct page, lru);
> - /* must delete as __free_one_page list manipulates */
> + /* must delete to avoid corrupting pcp list */
>   list_del(>lru);
>   pcp->count--;
>  
> - mt = get_pcppage_migratetype(page);
> - /* MIGRATE_ISOLATE page should not go to pcplists */
> - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> - /* Pageblock could have been isolated meanwhile */
> - if (unlikely(isolated_pageblocks))
> - mt = get_pageblock_migratetype(page);
> -
>   if (bulkfree_pcp_prepare(page))
>   continue;
>  
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> - trace_mm_page_pcpu_drain(page, 0, mt);
> + list_add_tail(>lru, );
>   } while (--count && --batch_free && !list_empty(list));
>   }
> +
> + spin_lock(>lock);
> + isolated_pageblocks = has_isolate_pageblock(zone);
> +
> + /*
> +  * Use safe version since after __free_one_page(),
> +  * page->lru.next will not point to original list.
> +  */
> + list_for_each_entry_safe(page, tmp, , lru) {
> + int mt = get_pcppage_migratetype(page);
> + /* MIGRATE_ISOLATE page should not go to pcplists */
> + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> + /* Pageblock could have been isolated meanwhile */
> + if (unlikely(isolated_pageblocks))
> + mt = get_pageblock_migratetype(page);
> +
> + __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> + trace_mm_page_pcpu_drain(page, 0, mt);
> + }
>   spin_unlock(>lock);
>  }
>  
> -- 
> 2.14.3
> 

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-03-01 Thread Michal Hocko
On Thu 01-03-18 14:28:44, Aaron Lu wrote:
> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.
> 
> Moving this part outside could reduce lock held time and improve
> performance. Test with will-it-scale/page_fault1 full load:
> 
> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> v4.16-rc2+  90342157971818   13667135   15677465
> this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> 
> What the test does is: starts $nr_cpu processes and each will repeatedly
> do the following for 5 minutes:
> 1 mmap 128M anonymouse space;
> 2 write access to that space;
> 3 munmap.
> The score is the aggregated iteration.

Iteration count I assume. I am still quite surprised that this would
have such a large impact.

> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> 
> Acked-by: Mel Gorman 
> Signed-off-by: Aaron Lu 

The patch makes sense to me
Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 39 +++
>  1 file changed, 23 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index faa33eac1635..dafdcdec9c1f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   int migratetype = 0;
>   int batch_free = 0;
>   bool isolated_pageblocks;
> -
> - spin_lock(>lock);
> - isolated_pageblocks = has_isolate_pageblock(zone);
> + struct page *page, *tmp;
> + LIST_HEAD(head);
>  
>   while (count) {
> - struct page *page;
>   struct list_head *list;
>  
>   /*
> @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   batch_free = count;
>  
>   do {
> - int mt; /* migratetype of the to-be-freed page */
> -
>   page = list_last_entry(list, struct page, lru);
> - /* must delete as __free_one_page list manipulates */
> + /* must delete to avoid corrupting pcp list */
>   list_del(>lru);
>   pcp->count--;
>  
> - mt = get_pcppage_migratetype(page);
> - /* MIGRATE_ISOLATE page should not go to pcplists */
> - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> - /* Pageblock could have been isolated meanwhile */
> - if (unlikely(isolated_pageblocks))
> - mt = get_pageblock_migratetype(page);
> -
>   if (bulkfree_pcp_prepare(page))
>   continue;
>  
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> - trace_mm_page_pcpu_drain(page, 0, mt);
> + list_add_tail(>lru, );
>   } while (--count && --batch_free && !list_empty(list));
>   }
> +
> + spin_lock(>lock);
> + isolated_pageblocks = has_isolate_pageblock(zone);
> +
> + /*
> +  * Use safe version since after __free_one_page(),
> +  * page->lru.next will not point to original list.
> +  */
> + list_for_each_entry_safe(page, tmp, , lru) {
> + int mt = get_pcppage_migratetype(page);
> + /* MIGRATE_ISOLATE page should not go to pcplists */
> + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> + /* Pageblock could have been isolated meanwhile */
> + if (unlikely(isolated_pageblocks))
> + mt = get_pageblock_migratetype(page);
> +
> + __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> + trace_mm_page_pcpu_drain(page, 0, mt);
> + }
>   spin_unlock(>lock);
>  }
>  
> -- 
> 2.14.3
> 

-- 
Michal Hocko
SUSE Labs


[PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-02-28 Thread Aaron Lu
When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
the zone->lock is held and then pages are chosen from PCP's migratetype
list. While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance. Test with will-it-scale/page_fault1 full load:

kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.16-rc2+  90342157971818   13667135   15677465
this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%

What the test does is: starts $nr_cpu processes and each will repeatedly
do the following for 5 minutes:
1 mmap 128M anonymouse space;
2 write access to that space;
3 munmap.
The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Acked-by: Mel Gorman 
Signed-off-by: Aaron Lu 
---
 mm/page_alloc.c | 39 +++
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index faa33eac1635..dafdcdec9c1f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
int migratetype = 0;
int batch_free = 0;
bool isolated_pageblocks;
-
-   spin_lock(>lock);
-   isolated_pageblocks = has_isolate_pageblock(zone);
+   struct page *page, *tmp;
+   LIST_HEAD(head);
 
while (count) {
-   struct page *page;
struct list_head *list;
 
/*
@@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
batch_free = count;
 
do {
-   int mt; /* migratetype of the to-be-freed page */
-
page = list_last_entry(list, struct page, lru);
-   /* must delete as __free_one_page list manipulates */
+   /* must delete to avoid corrupting pcp list */
list_del(>lru);
pcp->count--;
 
-   mt = get_pcppage_migratetype(page);
-   /* MIGRATE_ISOLATE page should not go to pcplists */
-   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-   /* Pageblock could have been isolated meanwhile */
-   if (unlikely(isolated_pageblocks))
-   mt = get_pageblock_migratetype(page);
-
if (bulkfree_pcp_prepare(page))
continue;
 
-   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
-   trace_mm_page_pcpu_drain(page, 0, mt);
+   list_add_tail(>lru, );
} while (--count && --batch_free && !list_empty(list));
}
+
+   spin_lock(>lock);
+   isolated_pageblocks = has_isolate_pageblock(zone);
+
+   /*
+* Use safe version since after __free_one_page(),
+* page->lru.next will not point to original list.
+*/
+   list_for_each_entry_safe(page, tmp, , lru) {
+   int mt = get_pcppage_migratetype(page);
+   /* MIGRATE_ISOLATE page should not go to pcplists */
+   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
+   /* Pageblock could have been isolated meanwhile */
+   if (unlikely(isolated_pageblocks))
+   mt = get_pageblock_migratetype(page);
+
+   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
+   trace_mm_page_pcpu_drain(page, 0, mt);
+   }
spin_unlock(>lock);
 }
 
-- 
2.14.3



[PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

2018-02-28 Thread Aaron Lu
When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
the zone->lock is held and then pages are chosen from PCP's migratetype
list. While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance. Test with will-it-scale/page_fault1 full load:

kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.16-rc2+  90342157971818   13667135   15677465
this patch  9536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%

What the test does is: starts $nr_cpu processes and each will repeatedly
do the following for 5 minutes:
1 mmap 128M anonymouse space;
2 write access to that space;
3 munmap.
The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Acked-by: Mel Gorman 
Signed-off-by: Aaron Lu 
---
 mm/page_alloc.c | 39 +++
 1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index faa33eac1635..dafdcdec9c1f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
int migratetype = 0;
int batch_free = 0;
bool isolated_pageblocks;
-
-   spin_lock(>lock);
-   isolated_pageblocks = has_isolate_pageblock(zone);
+   struct page *page, *tmp;
+   LIST_HEAD(head);
 
while (count) {
-   struct page *page;
struct list_head *list;
 
/*
@@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
batch_free = count;
 
do {
-   int mt; /* migratetype of the to-be-freed page */
-
page = list_last_entry(list, struct page, lru);
-   /* must delete as __free_one_page list manipulates */
+   /* must delete to avoid corrupting pcp list */
list_del(>lru);
pcp->count--;
 
-   mt = get_pcppage_migratetype(page);
-   /* MIGRATE_ISOLATE page should not go to pcplists */
-   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-   /* Pageblock could have been isolated meanwhile */
-   if (unlikely(isolated_pageblocks))
-   mt = get_pageblock_migratetype(page);
-
if (bulkfree_pcp_prepare(page))
continue;
 
-   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
-   trace_mm_page_pcpu_drain(page, 0, mt);
+   list_add_tail(>lru, );
} while (--count && --batch_free && !list_empty(list));
}
+
+   spin_lock(>lock);
+   isolated_pageblocks = has_isolate_pageblock(zone);
+
+   /*
+* Use safe version since after __free_one_page(),
+* page->lru.next will not point to original list.
+*/
+   list_for_each_entry_safe(page, tmp, , lru) {
+   int mt = get_pcppage_migratetype(page);
+   /* MIGRATE_ISOLATE page should not go to pcplists */
+   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
+   /* Pageblock could have been isolated meanwhile */
+   if (unlikely(isolated_pageblocks))
+   mt = get_pageblock_migratetype(page);
+
+   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
+   trace_mm_page_pcpu_drain(page, 0, mt);
+   }
spin_unlock(>lock);
 }
 
-- 
2.14.3