Re: [LKP] [lkp] [mm] 795ae7a0de: pixz.throughput -9.1% regression

2016-06-22 Thread Johannes Weiner
Hi,

On Wed, Jun 08, 2016 at 01:37:26PM +0800, Ye Xiaolong wrote:
> On Tue, Jun 07, 2016 at 05:56:27PM -0400, Johannes Weiner wrote:
> >But just to make sure I'm looking at the right code, can you first try
> >the following patch on top of Linus's current tree and see if that
> >gets performance back to normal? It's a partial revert of the
> >watermarks that singles out the fair zone allocator:
> 
> Seems that this patch doesn't help to gets performance back.
> I've attached the comparison result among 3ed3a4f, 795ae7ay, v4.7-rc2 and
> 1fe49ba5 ("mm: revert fairness batching to before the watermarks were")
> with perf profile information.  You can find it via searching 'perf-profile'.

Sorry for the delay, and thank you for running these. I still can't
reproduce this.

> 3ed3a4f0ddffece9 795ae7a0de6b834a0cc202aa55   v4.7-rc2 
> 1fe49ba5002a50aefd5b6c4913
>  -- -- 
> --
>fail:runs  %reproductionfail:runs  %reproductionfail:runs  
> %reproductionfail:runs
>| | | | |  
>| |
>:40%:70%:4 
>   50%   2:4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
>:4   50%   2:70%:4 
>0%:4 kmsg.Spurious_LAPIC_timer_interrupt_on_cpu
>:40%:7   14%   1:4 
>   25%   1:4 kmsg.igb#:#:#:exceed_max#second
>  %stddev %change %stddev %change %stddev 
> %change %stddev
>  \  |\  |\
>   |\
>   78505362 ±  0%  -9.2%   71298182 ±  0% -11.8%   69280014 ±  0%  
> -9.1%   71350485 ±  0%  pixz.throughput
>5586220 ±  2%  -1.6%5498492 ±  2%  +6.5%5950210 ±  1%  
> +8.4%6052963 ±  1%  pixz.time.involuntary_context_switches
>4582198 ±  2%  -3.6%4416275 ±  2%  -8.6%4189304 ±  4%  
> -8.0%4214839 ±  0%  pixz.time.minor_page_faults
>   4530 ±  0%  +1.0%   4575 ±  0%  -1.6%   4458 ±  0%  
> -1.3%   4469 ±  0%  pixz.time.percent_of_cpu_this_job_got
>  92.03 ±  0%  +5.6%  97.23 ± 11% +31.3% 120.83 ±  1% 
> +30.4% 119.98 ±  0%  pixz.time.system_time
>  14911 ±  0%  +2.1%  15218 ±  0%  -1.0%  14759 ±  1%  
> -1.0%  14764 ±  0%  pixz.time.user_time
>6586930 ±  0%  -8.4%6033444 ±  1%  -4.4%6295529 ±  1%  
> -2.6%6416460 ±  1%  pixz.time.voluntary_context_switches
>2179703 ±  4%  +4.8%2285049 ±  2% -15.3%1846752 ± 16%  
> -8.2%2000913 ±  4%  softirqs.RCU
>  92.03 ±  0%  +5.6%  97.23 ± 11% +31.3% 120.83 ±  1% 
> +30.4% 119.98 ±  0%  time.system_time
>   2237 ±  2%  -2.9%   2172 ±  7% +16.3%   2601 ±  7%  
> +8.0%   2416 ±  6%  uptime.idle
>  49869 ±  1% -12.6%  43583 ±  8% -18.0%  40917 ±  0% 
> -16.3%  41728 ±  1%  vmstat.system.cs
>  97890 ±  1%  -0.0%  97848 ±  3%  +7.4% 105143 ±  2%  
> +6.8% 104518 ±  2%  vmstat.system.in
> 105682 ±  1%  +0.6% 106297 ±  1% -85.2%  15631 ±  4% 
> -85.1%  15768 ±  1%  meminfo.Active(file)
> 390126 ±  0%  -0.2% 389529 ±  0% +23.9% 483296 ±  0% 
> +23.9% 483194 ±  0%  meminfo.Inactive
> 380750 ±  0%  -0.2% 380141 ±  0% +24.5% 473891 ±  0% 
> +24.4% 473760 ±  0%  meminfo.Inactive(file)
>   2401 ±107% +76.9%   4247 ± 79% -99.8%   5.75 ± 18% 
> -99.7%   6.75 ± 39%  numa-numastat.node0.other_node
>2074670 ±  2% -11.3%1840052 ± 11% -21.1%1637071 ± 12% 
> -22.5%1607724 ±  7%  numa-numastat.node1.local_node
>2081648 ±  2% -11.4%1844923 ± 11% -21.4%1637081 ± 12% 
> -22.8%1607730 ±  7%  numa-numastat.node1.numa_hit
>   6977 ± 36% -30.2%   4871 ± 66% -99.8%  10.50 ± 17% 
> -99.9%   5.50 ± 20%  numa-numastat.node1.other_node
>   13061458 ± 19%  -3.3%   12634644 ± 24% +33.5%   17435714 ± 47% 
> +58.3%   20674526 ± 14%  cpuidle.C1-IVT.time
> 193807 ± 15% +26.8% 245657 ± 76%+101.8% 391021 ±  8%
> +115.5% 417669 ± 20%  cpuidle.C1-IVT.usage
>  8.866e+08 ±  2% -15.6%  7.479e+08 ±  6% +25.0%  1.108e+09 ±  5% 
> +21.0%  1.073e+09 ±  4%  cpuidle.C6-IVT.time
>  93283 ±  0% -13.2%  80988 ±  3%+300.6% 373726 ±121% 
> +20.8% 112719 ±  1%  cpuidle.C6-IVT.usage
>8559466 ± 20% -39.3%5195127 ± 40% -98.1% 159481 ±173%
> -100.0%  97.50 ± 40%  cpuidle.POLL.time
> 771388 ±  9% -53.

Re: [LKP] [lkp] [mm] 795ae7a0de: pixz.throughput -9.1% regression

2016-06-07 Thread Ye Xiaolong
On Tue, Jun 07, 2016 at 05:56:27PM -0400, Johannes Weiner wrote:
>On Tue, Jun 07, 2016 at 12:48:17PM +0800, Ye Xiaolong wrote:
>> FYI, below is the comparison info between 3ed3a4f, 795ae7ay, v4.7-rc2 and the
>> revert commit (eaa7f0d).
>
>Thanks for running this.
>
>Alas, I still can not make heads or tails of this, or reproduce it
>locally for that matter.
>
>With this test run, there seems to be a significant increase in system time:
>
>>  92.03 ±  0%  +5.6%  97.23 ± 11% +30.5% 120.08 ±  1% 
>> +30.0% 119.61 ±  0%  pixz.time.system_time
>
>Would it be possible to profile the testruns using perf? Maybe we can
>find out where the kernel is spending the extra time.
>
>But just to make sure I'm looking at the right code, can you first try
>the following patch on top of Linus's current tree and see if that
>gets performance back to normal? It's a partial revert of the
>watermarks that singles out the fair zone allocator:

Seems that this patch doesn't help to gets performance back.
I've attached the comparison result among 3ed3a4f, 795ae7ay, v4.7-rc2 and
1fe49ba5 ("mm: revert fairness batching to before the watermarks were")
with perf profile information.  You can find it via searching 'perf-profile'.

Thanks,
Xiaolong

>
>From 2015eaad688486d65fcf86185e213fff8506b3fe Mon Sep 17 00:00:00 2001
>From: Johannes Weiner 
>Date: Tue, 7 Jun 2016 17:45:03 -0400
>Subject: [PATCH] mm: revert fairness batching to before the watermarks were
> boosted
>
>Signed-off-by: Johannes Weiner 
>---
> include/linux/mmzone.h | 2 ++
> mm/page_alloc.c| 6 --
> 2 files changed, 6 insertions(+), 2 deletions(-)
>
>diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>index 02069c2..4565b92 100644
>--- a/include/linux/mmzone.h
>+++ b/include/linux/mmzone.h
>@@ -327,6 +327,8 @@ struct zone {
>   /* zone watermarks, access with *_wmark_pages(zone) macros */
>   unsigned long watermark[NR_WMARK];
> 
>+  unsigned long fairbatch;
>+
>   unsigned long nr_reserved_highatomic;
> 
>   /*
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index 6903b69..33387ab 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -2889,7 +2889,7 @@ static void reset_alloc_batches(struct zone 
>*preferred_zone)
> 
>   do {
>   mod_zone_page_state(zone, NR_ALLOC_BATCH,
>-  high_wmark_pages(zone) - low_wmark_pages(zone) -
>+  zone->fairbatch -
>   atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
>   clear_bit(ZONE_FAIR_DEPLETED, &zone->flags);
>   } while (zone++ != preferred_zone);
>@@ -6842,6 +6842,8 @@ static void __setup_per_zone_wmarks(void)
>   zone->watermark[WMARK_MIN] = tmp;
>   }
> 
>+  zone->fairbatch = tmp >> 2;
>+
>   /*
>* Set the kswapd watermarks distance according to the
>* scale factor in proportion to available memory, but
>@@ -6855,7 +6857,7 @@ static void __setup_per_zone_wmarks(void)
>   zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
> 
>   __mod_zone_page_state(zone, NR_ALLOC_BATCH,
>-  high_wmark_pages(zone) - low_wmark_pages(zone) -
>+  zone->fairbatch -
>   atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
> 
>   spin_unlock_irqrestore(&zone->lock, flags);
>-- 
>2.8.2
=
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/testcase:
  gcc-4.9/performance/x86_64-rhel/100%/debian-x86_64-2015-02-07.cgz/ivb43/pixz

commit: 
  3ed3a4f0ddffece942bb2661924d87be4ce63cb7
  795ae7a0de6b834a0cc202aa55c190ef81496665
  v4.7-rc2
  1fe49ba5002a50aefd5b6c4913e61eff86ac7253

3ed3a4f0ddffece9 795ae7a0de6b834a0cc202aa55   v4.7-rc2 
1fe49ba5002a50aefd5b6c4913
 -- -- 
--
   fail:runs  %reproductionfail:runs  %reproductionfail:runs  
%reproductionfail:runs
   | | | | |
 | |
   :40%:70%:4   
50%   2:4 kmsg.DHCP/BOOTP:Reply_not_for_us,op[#]xid[#]
   :4   50%   2:70%:4   
 0%:4 kmsg.Spurious_LAPIC_timer_interrupt_on_cpu
   :40%:7   14%   1:4   
25%   1:4 kmsg.igb#:#:#:exceed_max#second
 %stddev %change %stddev %change %stddev 
%change %stddev
 \  |\  |\  
|\
  78505362 ±  0%  -9.2%   71298182 ±  0% -11.8%   69280014 ±  0%  
-9.1%   71350485 ±  0%  pixz.throughput
   55

Re: [LKP] [lkp] [mm] 795ae7a0de: pixz.throughput -9.1% regression

2016-06-07 Thread Johannes Weiner
On Tue, Jun 07, 2016 at 12:48:17PM +0800, Ye Xiaolong wrote:
> FYI, below is the comparison info between 3ed3a4f, 795ae7ay, v4.7-rc2 and the
> revert commit (eaa7f0d).

Thanks for running this.

Alas, I still can not make heads or tails of this, or reproduce it
locally for that matter.

With this test run, there seems to be a significant increase in system time:

>  92.03 ±  0%  +5.6%  97.23 ± 11% +30.5% 120.08 ±  1% 
> +30.0% 119.61 ±  0%  pixz.time.system_time

Would it be possible to profile the testruns using perf? Maybe we can
find out where the kernel is spending the extra time.

But just to make sure I'm looking at the right code, can you first try
the following patch on top of Linus's current tree and see if that
gets performance back to normal? It's a partial revert of the
watermarks that singles out the fair zone allocator:

>From 2015eaad688486d65fcf86185e213fff8506b3fe Mon Sep 17 00:00:00 2001
From: Johannes Weiner 
Date: Tue, 7 Jun 2016 17:45:03 -0400
Subject: [PATCH] mm: revert fairness batching to before the watermarks were
 boosted

Signed-off-by: Johannes Weiner 
---
 include/linux/mmzone.h | 2 ++
 mm/page_alloc.c| 6 --
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02069c2..4565b92 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -327,6 +327,8 @@ struct zone {
/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long watermark[NR_WMARK];
 
+   unsigned long fairbatch;
+
unsigned long nr_reserved_highatomic;
 
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6903b69..33387ab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2889,7 +2889,7 @@ static void reset_alloc_batches(struct zone 
*preferred_zone)
 
do {
mod_zone_page_state(zone, NR_ALLOC_BATCH,
-   high_wmark_pages(zone) - low_wmark_pages(zone) -
+   zone->fairbatch -
atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
clear_bit(ZONE_FAIR_DEPLETED, &zone->flags);
} while (zone++ != preferred_zone);
@@ -6842,6 +6842,8 @@ static void __setup_per_zone_wmarks(void)
zone->watermark[WMARK_MIN] = tmp;
}
 
+   zone->fairbatch = tmp >> 2;
+
/*
 * Set the kswapd watermarks distance according to the
 * scale factor in proportion to available memory, but
@@ -6855,7 +6857,7 @@ static void __setup_per_zone_wmarks(void)
zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
 
__mod_zone_page_state(zone, NR_ALLOC_BATCH,
-   high_wmark_pages(zone) - low_wmark_pages(zone) -
+   zone->fairbatch -
atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
 
spin_unlock_irqrestore(&zone->lock, flags);
-- 
2.8.2


Re: [LKP] [lkp] [mm] 795ae7a0de: pixz.throughput -9.1% regression

2016-06-06 Thread Ye Xiaolong
On Mon, Jun 06, 2016 at 04:53:07PM +0800, Ye Xiaolong wrote:
>On Fri, Jun 03, 2016 at 06:21:09PM -0400, Johannes Weiner wrote:
>>On Thu, Jun 02, 2016 at 12:07:06PM -0400, Johannes Weiner wrote:
>>> Hi,
>>> 
>>> On Thu, Jun 02, 2016 at 02:45:07PM +0800, kernel test robot wrote:
>>> > FYI, we noticed pixz.throughput -9.1% regression due to commit:
>>> > 
>>> > commit 795ae7a0de6b834a0cc202aa55c190ef81496665 ("mm: scale kswapd 
>>> > watermarks in proportion to memory")
>>> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>>> > 
>>> > in testcase: pixz
>>> > on test machine: ivb43: 48 threads Ivytown Ivy Bridge-EP with 64G memory 
>>> > with following parameters: cpufreq_governor=performance/nr_threads=100%
>>> 
>>> Xiaolong, thanks for the report.
>>> 
>>> It looks like the regression stems from a change in NUMA placement:
>>
>>Scratch that, I was misreading the test results. It's just fewer
>>allocations in general that happen during the fixed testing time.
>>
>>I'm stumped by this report. All this patch does other than affect page
>>reclaim (which is not involved here) is increase the size of the round
>>robin batches in the fair zone allocator. That should *reduce* work in
>>the page allocator, if anything.
>>
>>But I also keep failing to reproduce this - having tried it on the
>>third machine now - neither pixz nor will-it-scale/page_fault1 give me
>>that -8-9% regression:
>>
>>4.5.0-02574-g3ed3a4f:
>>PIXZ-good.log:throughput: 39908733.604941994
>>PIXZ-good.log:throughput: 37067947.25049969
>>PIXZ-good.log:throughput: 38604938.39131216
>>
>>4.5.0-02575-g795ae7a:
>> PIXZ-bad.log:throughput: 39489120.87179377
>> PIXZ-bad.log:throughput: 39307299.288432725
>> PIXZ-bad.log:throughput: 38795994.3329781
>>
>>Is this reliably reproducible with 3ed3a4f vs 795ae7ay?
>
>Yes, it can be stably reproduced in LKP environment, I've run 3ed3a4f0ddffece9 
>for 4 times, and 795ae7a0de6b834a0cc202aa55 for 7 times.
>
>3ed3a4f0ddffece9 795ae7a0de6b834a0cc202aa55
> --
>   fail:runs  %reproductionfail:runs
>   | | |
>   :4   50%   2:7 
> kmsg.Spurious_LAPIC_timer_interrupt_on_cpu
> %stddev %change %stddev
> \  |\
>  78505362 ±  0%  -9.2%   71298182 ±  0%  pixz.throughput
>
>
>>
>>Could I ask you to retry the test with Linus's current head as well as
>>with 795ae7a reverted on top of it? (It's a clean revert.)
>>
>
>Sure, I have queued the test tasks for them, will update the comparison
>once I get the results.

FYI, below is the comparison info between 3ed3a4f, 795ae7ay, v4.7-rc2 and the
revert commit (eaa7f0d).

=
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/testcase:
  gcc-4.9/performance/x86_64-rhel/100%/debian-x86_64-2015-02-07.cgz/ivb43/pixz

commit: 
  3ed3a4f0ddffece942bb2661924d87be4ce63cb7
  795ae7a0de6b834a0cc202aa55c190ef81496665
  v4.7-rc2
  eaa7f0d7239a9508c616f208527cd34eb10ec90f

3ed3a4f0ddffece9 795ae7a0de6b834a0cc202aa55   v4.7-rc2 
eaa7f0d7239a9508c616f20852
 -- -- 
--
   fail:runs  %reproductionfail:runs  %reproductionfail:runs  
%reproductionfail:runs
   | | | | |
 | |
   :4   50%   2:70%:3   
33%   1:6 kmsg.Spurious_LAPIC_timer_interrupt_on_cpu
   :40%:70%:3   
33%   1:6 kmsg.igb#:#:#:exceed_max#second
 %stddev %change %stddev %change %stddev 
%change %stddev
 \  |\  |\  
|\
  78505362 ±  0%  -9.2%   71298182 ±  0% -11.4%   69554765 ±  0%  
-9.4%   71129419 ±  0%  pixz.throughput
   5586220 ±  2%  -1.6%5498492 ±  2%  +6.4%5942217 ±  2%  
+9.6%6122002 ±  0%  pixz.time.involuntary_context_switches
   6560113 ±  0%  -0.6%6518532 ±  0%  -0.6%6519065 ±  0%  
-1.3%6475587 ±  0%  pixz.time.maximum_resident_set_size
   4582198 ±  2%  -3.6%4416275 ±  2%  -9.0%4167645 ±  4%  
-6.7%4275069 ±  1%  pixz.time.minor_page_faults
  4530 ±  0%  +1.0%   4575 ±  0%  -1.7%   4454 ±  0%  
-1.5%   4463 ±  0%  pixz.time.percent_of_cpu_this_job_got
 92.03 ±  0%  +5.6%  97.23 ± 11% +30.5% 120.08 ±  1% 
+30.0% 119.61 ±  0%  pixz.time.system_time
 14911 ±  0%  +2.1%  15218 ±  0%  -1.1%  14753 ±  1%  
-1.1%  14743 ±  0%  pixz.time.user_time
   6586930 ±  0%  -8.4%6033444 ±  1%  -4.4%6298874 ±  1%