Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-06 Thread Matthew Wilcox
On Tue, Mar 06, 2018 at 08:27:33PM +0800, Aaron Lu wrote:
> On Tue, Mar 06, 2018 at 08:55:57AM +0100, Vlastimil Babka wrote:
> > So the adjacent line prefetch might be disabled? Could you check bios or
> > the MSR mentioned in
> > https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
> 
> root@lkp-bdw-ep2 ~# rdmsr 0x1a4
> 0

Technically 0x1a4 is per-core, so you should run rdmsr -a 0x1a4 in order to
check all the cores.  But I can't imagine they're being set differently on
each core.

> > instructions (calculated from itlb misses and insns-per-itlb-miss) shows
> > less than 1% increase, so dunno. And the improvement comes from reduced
> > dTLB-load-misses? That makes no sense for order-0 buddy struct pages
> > which always share a page. And the memmap mapping should use huge pages.
> 
> THP is disabled to stress order 0 pages(should have mentioned this in
> patch's description, sorry about this).

THP isn't related to memmap; the kernel uses huge pages (usually the 1G
pages) in order to map its own memory.



Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-06 Thread Aaron Lu
On Tue, Mar 06, 2018 at 08:55:57AM +0100, Vlastimil Babka wrote:
> On 03/05/2018 12:41 PM, Aaron Lu wrote:
> > On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote:
> >> On 03/01/2018 03:00 PM, Michal Hocko wrote:
> >>>
> >>> I am really surprised that this has such a big impact.
> >>
> >> It's even stranger to me. Struct page is 64 bytes these days, exactly a
> >> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
> >> line (that forms an aligned 128 bytes block with the one we touch).
> >> Which is exactly a order-0 buddy struct page! Maybe that implicit
> >> prefetching stopped at L2 and explicit goes all the way to L1, can't
> > 
> > The Intel Architecture Optimization Manual section 7.3.2 says:
> > 
> > prefetchT0 - fetch data into all cache levels
> > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> > microarchitectures: 1st, 2nd and 3rd level cache.
> > 
> > prefetchT2 - fetch data into 2nd and 3rd level caches (identical to
> > prefetchT1)
> > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> > microarchitectures: 2nd and 3rd level cache.
> > 
> > prefetchNTA - fetch data into non-temporal cache close to the processor,
> > minimizing cache pollution
> > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> > microarchitectures: must fetch into 3rd level cache with fast replacement.
> > 
> > I tried 'prefetcht0' and 'prefetcht2' instead of the default
> > 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about
> > the same performance number as prefetchNTA. I had expected prefetchT0 to
> > deliver a better score if it was indeed due to L1D since prefetchT2 will
> > not place data into L1 while prefetchT0 will, but looks like it is not
> > the case here.
> > 
> > It feels more like the buddy cacheline isn't in any level of the caches
> > without prefetch for some reason.
> 
> So the adjacent line prefetch might be disabled? Could you check bios or
> the MSR mentioned in
> https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

root@lkp-bdw-ep2 ~# rdmsr 0x1a4
0

Looks like this feature isn't disabled(the doc you linked says value 1
means disable).

> >> remember. Would that make such a difference? It would be nice to do some
> >> perf tests with cache counters to see what is really going on...
> > 
> > Compare prefetchT2 to no-prefetch, I saw these metrics change:
> > 
> > no-prefetch  change  prefetchT2   metrics
> > \  \
> >stddev stddev
> > 
> >   0.18+0.00.18perf-stat.branch-miss-rate%   
> > 
> >  8.268e+09+3.8%  8.585e+09perf-stat.branch-misses   
> > 
> >  2.333e+10+4.7%  2.443e+10perf-stat.cache-misses
> > 
> >  2.402e+11+5.0%  2.522e+11perf-stat.cache-references
> > 
> >   3.52-1.1%   3.48perf-stat.cpi 
> > 
> >   0.02-0.00.01 ±3%
> > perf-stat.dTLB-load-miss-rate%   
> >  8.677e+08-7.3%  8.048e+08 ±3%perf-stat.dTLB-load-misses
> > 
> >   1.18+0.01.19
> > perf-stat.dTLB-store-miss-rate% 
> >  2.359e+10+6.0%  2.502e+10perf-stat.dTLB-store-misses   
> > 
> >  1.979e+12+5.0%  2.078e+12perf-stat.dTLB-stores 
> > 
> >  6.126e+09   +10.1%  6.745e+09 ±3%perf-stat.iTLB-load-misses
> > 
> >   3464-8.4%   3172 ±3%
> > perf-stat.instructions-per-iTLB-miss
> >   0.28+1.1%   0.29perf-stat.ipc 
> > 
> >  2.929e+09+5.1%  3.077e+09perf-stat.minor-faults
> >  
> >  9.244e+09+4.7%  9.681e+09perf-stat.node-loads  
> > 
> >  2.491e+08+5.8%  2.634e+08perf-stat.node-store-misses   
> > 
> >  6.472e+09+6.1%  6.869e+09perf-stat.node-stores 
> > 
> >  2.929e+09+5.1%  3.077e+09perf-stat.page-faults 
> >  
> >2182469-4.2%2090977perf-stat.path-length
> > 
> > Not sure if this is useful though...
> 
> Looks like most stats increased in absolute values as the work done
> increased and this is a time-limited benchmark? 

Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-05 Thread Vlastimil Babka
On 03/05/2018 12:41 PM, Aaron Lu wrote:
> On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote:
>> On 03/01/2018 03:00 PM, Michal Hocko wrote:
>>>
>>> I am really surprised that this has such a big impact.
>>
>> It's even stranger to me. Struct page is 64 bytes these days, exactly a
>> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
>> line (that forms an aligned 128 bytes block with the one we touch).
>> Which is exactly a order-0 buddy struct page! Maybe that implicit
>> prefetching stopped at L2 and explicit goes all the way to L1, can't
> 
> The Intel Architecture Optimization Manual section 7.3.2 says:
> 
> prefetchT0 - fetch data into all cache levels
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: 1st, 2nd and 3rd level cache.
> 
> prefetchT2 - fetch data into 2nd and 3rd level caches (identical to
> prefetchT1)
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: 2nd and 3rd level cache.
> 
> prefetchNTA - fetch data into non-temporal cache close to the processor,
> minimizing cache pollution
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: must fetch into 3rd level cache with fast replacement.
> 
> I tried 'prefetcht0' and 'prefetcht2' instead of the default
> 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about
> the same performance number as prefetchNTA. I had expected prefetchT0 to
> deliver a better score if it was indeed due to L1D since prefetchT2 will
> not place data into L1 while prefetchT0 will, but looks like it is not
> the case here.
> 
> It feels more like the buddy cacheline isn't in any level of the caches
> without prefetch for some reason.

So the adjacent line prefetch might be disabled? Could you check bios or
the MSR mentioned in
https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

>> remember. Would that make such a difference? It would be nice to do some
>> perf tests with cache counters to see what is really going on...
> 
> Compare prefetchT2 to no-prefetch, I saw these metrics change:
> 
> no-prefetch  change  prefetchT2   metrics
> \  \
>  stddev stddev
> 
>   0.18+0.00.18perf-stat.branch-miss-rate% 
>   
>  8.268e+09+3.8%  8.585e+09perf-stat.branch-misses 
>   
>  2.333e+10+4.7%  2.443e+10perf-stat.cache-misses  
>   
>  2.402e+11+5.0%  2.522e+11perf-stat.cache-references  
>   
>   3.52-1.1%   3.48perf-stat.cpi   
>   
>   0.02-0.00.01 ±3%perf-stat.dTLB-load-miss-rate%  
>  
>  8.677e+08-7.3%  8.048e+08 ±3%perf-stat.dTLB-load-misses  
>   
>   1.18+0.01.19perf-stat.dTLB-store-miss-rate% 
> 
>  2.359e+10+6.0%  2.502e+10perf-stat.dTLB-store-misses 
>   
>  1.979e+12+5.0%  2.078e+12perf-stat.dTLB-stores   
>   
>  6.126e+09   +10.1%  6.745e+09 ±3%perf-stat.iTLB-load-misses  
>   
>   3464-8.4%   3172 ±3%
> perf-stat.instructions-per-iTLB-miss
>   0.28+1.1%   0.29perf-stat.ipc   
>   
>  2.929e+09+5.1%  3.077e+09perf-stat.minor-faults  
>
>  9.244e+09+4.7%  9.681e+09perf-stat.node-loads
>   
>  2.491e+08+5.8%  2.634e+08perf-stat.node-store-misses 
>   
>  6.472e+09+6.1%  6.869e+09perf-stat.node-stores   
>   
>  2.929e+09+5.1%  3.077e+09perf-stat.page-faults   
>
>2182469-4.2%2090977perf-stat.path-length
> 
> Not sure if this is useful though...

Looks like most stats increased in absolute values as the work done
increased and this is a time-limited benchmark? Although number of
instructions (calculated from itlb misses and insns-per-itlb-miss) shows
less than 1% increase, so dunno. And the improvement comes from reduced
dTLB-load-misses? That makes no sense for order-0 buddy struct pages
which always share a page. And the memmap mapping should use huge pages.
BTW what is path-length?



Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-05 Thread Aaron Lu
On Mon, Mar 05, 2018 at 07:41:59PM +0800, Aaron Lu wrote:
> On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote:
> > On 03/01/2018 03:00 PM, Michal Hocko wrote:
> > > On Thu 01-03-18 14:28:45, Aaron Lu wrote:
> > >> When a page is freed back to the global pool, its buddy will be checked
> > >> to see if it's possible to do a merge. This requires accessing buddy's
> > >> page structure and that access could take a long time if it's cache cold.
> > >>
> > >> This patch adds a prefetch to the to-be-freed page's buddy outside of
> > >> zone->lock in hope of accessing buddy's page structure later under
> > >> zone->lock will be faster. Since we *always* do buddy merging and check
> > >> an order-0 page's buddy to try to merge it when it goes into the main
> > >> allocator, the cacheline will always come in, i.e. the prefetched data
> > >> will never be unused.
> > >>
> > >> In the meantime, there are two concerns:
> > >> 1 the prefetch could potentially evict existing cachelines, especially
> > >>   for L1D cache since it is not huge;
> > >> 2 there is some additional instruction overhead, namely calculating
> > >>   buddy pfn twice.
> > >>
> > >> For 1, it's hard to say, this microbenchmark though shows good result but
> > >> the actual benefit of this patch will be workload/CPU dependant;
> > >> For 2, since the calculation is a XOR on two local variables, it's 
> > >> expected
> > >> in many cases that cycles spent will be offset by reduced memory latency
> > >> later. This is especially true for NUMA machines where multiple CPUs are
> > >> contending on zone->lock and the most time consuming part under 
> > >> zone->lock
> > >> is the wait of 'struct page' cacheline of the to-be-freed pages and their
> > >> buddies.
> > >>
> > >> Test with will-it-scale/page_fault1 full load:
> > >>
> > >> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > >> v4.16-rc2+  90342157971818   13667135   15677465
> > >> patch2/39536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > >> this patch 10338868 +8.4%  8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> > >> Note: this patch's performance improvement percent is against patch2/3.
> > > 
> > > I am really surprised that this has such a big impact.
> > 
> > It's even stranger to me. Struct page is 64 bytes these days, exactly a
> > a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
> > line (that forms an aligned 128 bytes block with the one we touch).
> > Which is exactly a order-0 buddy struct page! Maybe that implicit
> > prefetching stopped at L2 and explicit goes all the way to L1, can't
> 
> The Intel Architecture Optimization Manual section 7.3.2 says:
> 
> prefetchT0 - fetch data into all cache levels
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: 1st, 2nd and 3rd level cache.
> 
> prefetchT2 - fetch data into 2nd and 3rd level caches (identical to
> prefetchT1)
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: 2nd and 3rd level cache.
> 
> prefetchNTA - fetch data into non-temporal cache close to the processor,
> minimizing cache pollution
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: must fetch into 3rd level cache with fast replacement.
> 
> I tried 'prefetcht0' and 'prefetcht2' instead of the default
> 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about
 ~~~
   Correction: should be Broadwell here.

> the same performance number as prefetchNTA. I had expected prefetchT0 to
> deliver a better score if it was indeed due to L1D since prefetchT2 will
> not place data into L1 while prefetchT0 will, but looks like it is not
> the case here.
> 
> It feels more like the buddy cacheline isn't in any level of the caches
> without prefetch for some reason.


Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-05 Thread Aaron Lu
On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote:
> On 03/01/2018 03:00 PM, Michal Hocko wrote:
> > On Thu 01-03-18 14:28:45, Aaron Lu wrote:
> >> When a page is freed back to the global pool, its buddy will be checked
> >> to see if it's possible to do a merge. This requires accessing buddy's
> >> page structure and that access could take a long time if it's cache cold.
> >>
> >> This patch adds a prefetch to the to-be-freed page's buddy outside of
> >> zone->lock in hope of accessing buddy's page structure later under
> >> zone->lock will be faster. Since we *always* do buddy merging and check
> >> an order-0 page's buddy to try to merge it when it goes into the main
> >> allocator, the cacheline will always come in, i.e. the prefetched data
> >> will never be unused.
> >>
> >> In the meantime, there are two concerns:
> >> 1 the prefetch could potentially evict existing cachelines, especially
> >>   for L1D cache since it is not huge;
> >> 2 there is some additional instruction overhead, namely calculating
> >>   buddy pfn twice.
> >>
> >> For 1, it's hard to say, this microbenchmark though shows good result but
> >> the actual benefit of this patch will be workload/CPU dependant;
> >> For 2, since the calculation is a XOR on two local variables, it's expected
> >> in many cases that cycles spent will be offset by reduced memory latency
> >> later. This is especially true for NUMA machines where multiple CPUs are
> >> contending on zone->lock and the most time consuming part under zone->lock
> >> is the wait of 'struct page' cacheline of the to-be-freed pages and their
> >> buddies.
> >>
> >> Test with will-it-scale/page_fault1 full load:
> >>
> >> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> >> v4.16-rc2+  90342157971818   13667135   15677465
> >> patch2/39536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> >> this patch 10338868 +8.4%  8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> >> Note: this patch's performance improvement percent is against patch2/3.
> > 
> > I am really surprised that this has such a big impact.
> 
> It's even stranger to me. Struct page is 64 bytes these days, exactly a
> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
> line (that forms an aligned 128 bytes block with the one we touch).
> Which is exactly a order-0 buddy struct page! Maybe that implicit
> prefetching stopped at L2 and explicit goes all the way to L1, can't

The Intel Architecture Optimization Manual section 7.3.2 says:

prefetchT0 - fetch data into all cache levels
Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
microarchitectures: 1st, 2nd and 3rd level cache.

prefetchT2 - fetch data into 2nd and 3rd level caches (identical to
prefetchT1)
Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
microarchitectures: 2nd and 3rd level cache.

prefetchNTA - fetch data into non-temporal cache close to the processor,
minimizing cache pollution
Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
microarchitectures: must fetch into 3rd level cache with fast replacement.

I tried 'prefetcht0' and 'prefetcht2' instead of the default
'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about
the same performance number as prefetchNTA. I had expected prefetchT0 to
deliver a better score if it was indeed due to L1D since prefetchT2 will
not place data into L1 while prefetchT0 will, but looks like it is not
the case here.

It feels more like the buddy cacheline isn't in any level of the caches
without prefetch for some reason.

> remember. Would that make such a difference? It would be nice to do some
> perf tests with cache counters to see what is really going on...

Compare prefetchT2 to no-prefetch, I saw these metrics change:

no-prefetch  change  prefetchT2   metrics
\  \
   stddev stddev

  0.18+0.00.18perf-stat.branch-miss-rate%   

 8.268e+09+3.8%  8.585e+09perf-stat.branch-misses   

 2.333e+10+4.7%  2.443e+10perf-stat.cache-misses

 2.402e+11+5.0%  2.522e+11perf-stat.cache-references

  3.52-1.1%   3.48perf-stat.cpi 

  0.02-0.00.01 ±3%perf-stat.dTLB-load-miss-rate%
   
 8.677e+08-7.3%  8.048e+08 ±3%perf-stat.dTLB-load-misses

  1.18+0.01.19perf-stat.dTLB-store-miss-rate%   
  
 2.359e+10+6.0%  2.502e+10

Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-02 Thread Vlastimil Babka
On 03/02/2018 07:00 PM, Dave Hansen wrote:
> On 03/02/2018 09:55 AM, Vlastimil Babka wrote:
>> It's even stranger to me. Struct page is 64 bytes these days, exactly a
>> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
>> line (that forms an aligned 128 bytes block with the one we touch).
> 
> I believe that was a behavior that was specific to the Pentium 4
> "Netburst" era.  I don't think the 128-byte line behavior exists on
> modern Intel cpus.

I remember it on Core 2 something (Nehalem IIRC). And this page suggests
up to Broadwell, and it can be disabled. And it's an L2 prefetcher indeed.
https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors




Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-02 Thread Dave Hansen
On 03/02/2018 09:55 AM, Vlastimil Babka wrote:
> It's even stranger to me. Struct page is 64 bytes these days, exactly a
> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
> line (that forms an aligned 128 bytes block with the one we touch).

I believe that was a behavior that was specific to the Pentium 4
"Netburst" era.  I don't think the 128-byte line behavior exists on
modern Intel cpus.


Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-02 Thread Vlastimil Babka
On 03/01/2018 03:00 PM, Michal Hocko wrote:
> On Thu 01-03-18 14:28:45, Aaron Lu wrote:
>> When a page is freed back to the global pool, its buddy will be checked
>> to see if it's possible to do a merge. This requires accessing buddy's
>> page structure and that access could take a long time if it's cache cold.
>>
>> This patch adds a prefetch to the to-be-freed page's buddy outside of
>> zone->lock in hope of accessing buddy's page structure later under
>> zone->lock will be faster. Since we *always* do buddy merging and check
>> an order-0 page's buddy to try to merge it when it goes into the main
>> allocator, the cacheline will always come in, i.e. the prefetched data
>> will never be unused.
>>
>> In the meantime, there are two concerns:
>> 1 the prefetch could potentially evict existing cachelines, especially
>>   for L1D cache since it is not huge;
>> 2 there is some additional instruction overhead, namely calculating
>>   buddy pfn twice.
>>
>> For 1, it's hard to say, this microbenchmark though shows good result but
>> the actual benefit of this patch will be workload/CPU dependant;
>> For 2, since the calculation is a XOR on two local variables, it's expected
>> in many cases that cycles spent will be offset by reduced memory latency
>> later. This is especially true for NUMA machines where multiple CPUs are
>> contending on zone->lock and the most time consuming part under zone->lock
>> is the wait of 'struct page' cacheline of the to-be-freed pages and their
>> buddies.
>>
>> Test with will-it-scale/page_fault1 full load:
>>
>> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
>> v4.16-rc2+  90342157971818   13667135   15677465
>> patch2/39536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
>> this patch 10338868 +8.4%  8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
>> Note: this patch's performance improvement percent is against patch2/3.
> 
> I am really surprised that this has such a big impact.

It's even stranger to me. Struct page is 64 bytes these days, exactly a
a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
line (that forms an aligned 128 bytes block with the one we touch).
Which is exactly a order-0 buddy struct page! Maybe that implicit
prefetching stopped at L2 and explicit goes all the way to L1, can't
remember. Would that make such a difference? It would be nice to do some
perf tests with cache counters to see what is really going on...

Vlastimil

> Is this a win on
> other architectures as well?
>  
>> [changelog stole from Dave Hansen and Mel Gorman's comments]
>> https://lkml.org/lkml/2018/1/24/551
> 
> Please use http://lkml.kernel.org/r/ for references because
> lkml.org is quite unstable. It would be
> http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8...@intel.com
> here.
> 



Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-02 Thread Aaron Lu
On Thu, Mar 01, 2018 at 03:00:44PM +0100, Michal Hocko wrote:
> On Thu 01-03-18 14:28:45, Aaron Lu wrote:
> > When a page is freed back to the global pool, its buddy will be checked
> > to see if it's possible to do a merge. This requires accessing buddy's
> > page structure and that access could take a long time if it's cache cold.
> > 
> > This patch adds a prefetch to the to-be-freed page's buddy outside of
> > zone->lock in hope of accessing buddy's page structure later under
> > zone->lock will be faster. Since we *always* do buddy merging and check
> > an order-0 page's buddy to try to merge it when it goes into the main
> > allocator, the cacheline will always come in, i.e. the prefetched data
> > will never be unused.
> > 
> > In the meantime, there are two concerns:
> > 1 the prefetch could potentially evict existing cachelines, especially
> >   for L1D cache since it is not huge;
> > 2 there is some additional instruction overhead, namely calculating
> >   buddy pfn twice.
> > 
> > For 1, it's hard to say, this microbenchmark though shows good result but
> > the actual benefit of this patch will be workload/CPU dependant;
> > For 2, since the calculation is a XOR on two local variables, it's expected
> > in many cases that cycles spent will be offset by reduced memory latency
> > later. This is especially true for NUMA machines where multiple CPUs are
> > contending on zone->lock and the most time consuming part under zone->lock
> > is the wait of 'struct page' cacheline of the to-be-freed pages and their
> > buddies.
> > 
> > Test with will-it-scale/page_fault1 full load:
> > 
> > kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > v4.16-rc2+  90342157971818   13667135   15677465
> > patch2/39536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > this patch 10338868 +8.4%  8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> > Note: this patch's performance improvement percent is against patch2/3.
> 
> I am really surprised that this has such a big impact.  Is this a win on
> other architectures as well?

For NUMA machines, I guess so. But I didn't test other archs so can't
say for sure.

>  
> > [changelog stole from Dave Hansen and Mel Gorman's comments]
> > https://lkml.org/lkml/2018/1/24/551
> 
> Please use http://lkml.kernel.org/r/ for references because
> lkml.org is quite unstable. It would be
> http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8...@intel.com
> here.

Good to know this, thanks!


Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-02 Thread Aaron Lu
On Thu, Mar 01, 2018 at 04:09:50PM -0800, Andrew Morton wrote:
> On Thu,  1 Mar 2018 14:28:45 +0800 Aaron Lu  wrote:
> 
> > When a page is freed back to the global pool, its buddy will be checked
> > to see if it's possible to do a merge. This requires accessing buddy's
> > page structure and that access could take a long time if it's cache cold.
> > 
> > This patch adds a prefetch to the to-be-freed page's buddy outside of
> > zone->lock in hope of accessing buddy's page structure later under
> > zone->lock will be faster. Since we *always* do buddy merging and check
> > an order-0 page's buddy to try to merge it when it goes into the main
> > allocator, the cacheline will always come in, i.e. the prefetched data
> > will never be unused.
> > 
> > In the meantime, there are two concerns:
> > 1 the prefetch could potentially evict existing cachelines, especially
> >   for L1D cache since it is not huge;
> > 2 there is some additional instruction overhead, namely calculating
> >   buddy pfn twice.
> > 
> > For 1, it's hard to say, this microbenchmark though shows good result but
> > the actual benefit of this patch will be workload/CPU dependant;
> > For 2, since the calculation is a XOR on two local variables, it's expected
> > in many cases that cycles spent will be offset by reduced memory latency
> > later. This is especially true for NUMA machines where multiple CPUs are
> > contending on zone->lock and the most time consuming part under zone->lock
> > is the wait of 'struct page' cacheline of the to-be-freed pages and their
> > buddies.
> > 
> > Test with will-it-scale/page_fault1 full load:
> > 
> > kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> > v4.16-rc2+  90342157971818   13667135   15677465
> > patch2/39536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > this patch 10338868 +8.4%  8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> > Note: this patch's performance improvement percent is against patch2/3.
> > 
> > ...
> >
> > @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, 
> > int count,
> > continue;
> >  
> > list_add_tail(>lru, );
> > +
> > +   /*
> > +* We are going to put the page back to the global
> > +* pool, prefetch its buddy to speed up later access
> > +* under zone->lock. It is believed the overhead of
> > +* calculating buddy_pfn here can be offset by reduced
> > +* memory latency later.
> > +*/
> > +   pfn = page_to_pfn(page);
> > +   buddy_pfn = __find_buddy_pfn(pfn, 0);
> > +   buddy = page + (buddy_pfn - pfn);
> > +   prefetch(buddy);
> 
> What is the typical list length here?  Maybe it's approximately the pcp
> batch size which is typically 128 pages?

Most of time it is pcp->batch, unless when pcp's pages need to be
all drained like in drain_local_pages(zone).

The pcp->batch has a default value of 31 and its upper limit is 96 for
x86_64. For this test, it is 31 here, I didn't manipulate
/proc/sys/vm/percpu_pagelist_fraction to change it.

With this said, the count here could be pcp->count when pcp's pages
need to be all drained and though pcp->count's default value is
(6*pcp->batch)=186, user can increase that value through the above
mentioned procfs interface and the resulting pcp->count could be too
big for prefetch. Ying also mentioned this today and suggested adding
an upper limit here to avoid prefetching too much. Perhaps just prefetch
the last pcp->batch pages if count here > pcp->batch? Since pcp->batch
has an upper limit, we won't need to worry prefetching too much.

> 
> If so, I'm a bit surprised that it is effective to prefetch 128 page
> frames before using any them for real.  I guess they'll fit in the L2
> cache.   Thoughts?


Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-01 Thread Andrew Morton
On Thu,  1 Mar 2018 14:28:45 +0800 Aaron Lu  wrote:

> When a page is freed back to the global pool, its buddy will be checked
> to see if it's possible to do a merge. This requires accessing buddy's
> page structure and that access could take a long time if it's cache cold.
> 
> This patch adds a prefetch to the to-be-freed page's buddy outside of
> zone->lock in hope of accessing buddy's page structure later under
> zone->lock will be faster. Since we *always* do buddy merging and check
> an order-0 page's buddy to try to merge it when it goes into the main
> allocator, the cacheline will always come in, i.e. the prefetched data
> will never be unused.
> 
> In the meantime, there are two concerns:
> 1 the prefetch could potentially evict existing cachelines, especially
>   for L1D cache since it is not huge;
> 2 there is some additional instruction overhead, namely calculating
>   buddy pfn twice.
> 
> For 1, it's hard to say, this microbenchmark though shows good result but
> the actual benefit of this patch will be workload/CPU dependant;
> For 2, since the calculation is a XOR on two local variables, it's expected
> in many cases that cycles spent will be offset by reduced memory latency
> later. This is especially true for NUMA machines where multiple CPUs are
> contending on zone->lock and the most time consuming part under zone->lock
> is the wait of 'struct page' cacheline of the to-be-freed pages and their
> buddies.
> 
> Test with will-it-scale/page_fault1 full load:
> 
> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> v4.16-rc2+  90342157971818   13667135   15677465
> patch2/39536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> this patch 10338868 +8.4%  8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> Note: this patch's performance improvement percent is against patch2/3.
> 
> ...
>
> @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   continue;
>  
>   list_add_tail(>lru, );
> +
> + /*
> +  * We are going to put the page back to the global
> +  * pool, prefetch its buddy to speed up later access
> +  * under zone->lock. It is believed the overhead of
> +  * calculating buddy_pfn here can be offset by reduced
> +  * memory latency later.
> +  */
> + pfn = page_to_pfn(page);
> + buddy_pfn = __find_buddy_pfn(pfn, 0);
> + buddy = page + (buddy_pfn - pfn);
> + prefetch(buddy);

What is the typical list length here?  Maybe it's approximately the pcp
batch size which is typically 128 pages?

If so, I'm a bit surprised that it is effective to prefetch 128 page
frames before using any them for real.  I guess they'll fit in the L2
cache.   Thoughts?


Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

2018-03-01 Thread Michal Hocko
On Thu 01-03-18 14:28:45, Aaron Lu wrote:
> When a page is freed back to the global pool, its buddy will be checked
> to see if it's possible to do a merge. This requires accessing buddy's
> page structure and that access could take a long time if it's cache cold.
> 
> This patch adds a prefetch to the to-be-freed page's buddy outside of
> zone->lock in hope of accessing buddy's page structure later under
> zone->lock will be faster. Since we *always* do buddy merging and check
> an order-0 page's buddy to try to merge it when it goes into the main
> allocator, the cacheline will always come in, i.e. the prefetched data
> will never be unused.
> 
> In the meantime, there are two concerns:
> 1 the prefetch could potentially evict existing cachelines, especially
>   for L1D cache since it is not huge;
> 2 there is some additional instruction overhead, namely calculating
>   buddy pfn twice.
> 
> For 1, it's hard to say, this microbenchmark though shows good result but
> the actual benefit of this patch will be workload/CPU dependant;
> For 2, since the calculation is a XOR on two local variables, it's expected
> in many cases that cycles spent will be offset by reduced memory latency
> later. This is especially true for NUMA machines where multiple CPUs are
> contending on zone->lock and the most time consuming part under zone->lock
> is the wait of 'struct page' cacheline of the to-be-freed pages and their
> buddies.
> 
> Test with will-it-scale/page_fault1 full load:
> 
> kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
> v4.16-rc2+  90342157971818   13667135   15677465
> patch2/39536374 +5.6%  8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> this patch 10338868 +8.4%  8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> Note: this patch's performance improvement percent is against patch2/3.

I am really surprised that this has such a big impact.  Is this a win on
other architectures as well?
 
> [changelog stole from Dave Hansen and Mel Gorman's comments]
> https://lkml.org/lkml/2018/1/24/551

Please use http://lkml.kernel.org/r/ for references because
lkml.org is quite unstable. It would be
http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8...@intel.com
here.
-- 
Michal Hocko
SUSE Labs