Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention

2012-09-21 Thread Richard Davies
Hi Mel,

Thank you for this series. I have applied on clean 3.6-rc5 and tested, and
it works well for me - the lock contention is (still) gone and
isolate_freepages_block is much reduced.

Here is a typical test with these patches:

# grep -F '[k]' report | head -8
65.20% qemu-kvm  [kernel.kallsyms] [k] clear_page_c
 2.18% qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
 1.56% qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
 1.40% qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
 1.38%  swapper  [kernel.kallsyms] [k] default_idle
 1.35% qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
 0.74% ksmd  [kernel.kallsyms] [k] memcmp
 0.72% qemu-kvm  [kernel.kallsyms] [k] free_pages_prepare


I did manage to get a couple which were slightly worse, but nothing like as
bad as before. Here are the results:

# grep -F '[k]' report | head -8
45.60%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
11.26%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
 3.21%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
 2.27%   ksmd  [kernel.kallsyms] [k] memcmp
 2.02%swapper  [kernel.kallsyms] [k] default_idle
 1.58%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
 1.30%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
 1.09%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist

# grep -F '[k]' report | head -8
61.29%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
 4.52%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
 2.64%   qemu-kvm  [kernel.kallsyms] [k] copy_page_c
 1.61%swapper  [kernel.kallsyms] [k] default_idle
 1.57%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
 1.18%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
 1.18%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
 1.11%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run

I will follow up with the detailed traces for these three tests.

Thank you!

Richard.



Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention

2012-09-21 Thread Mel Gorman
On Fri, Sep 21, 2012 at 10:13:33AM +0100, Richard Davies wrote:
 Hi Mel,
 
 Thank you for this series. I have applied on clean 3.6-rc5 and tested, and
 it works well for me - the lock contention is (still) gone and
 isolate_freepages_block is much reduced.
 

Excellent!

 Here is a typical test with these patches:
 
 # grep -F '[k]' report | head -8
 65.20% qemu-kvm  [kernel.kallsyms] [k] clear_page_c
  2.18% qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
  1.56% qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
  1.40% qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
  1.38%  swapper  [kernel.kallsyms] [k] default_idle
  1.35% qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
  0.74% ksmd  [kernel.kallsyms] [k] memcmp
  0.72% qemu-kvm  [kernel.kallsyms] [k] free_pages_prepare
 

Ok, so that is more or less acceptable. I would like to reduce the scanning
even further but I'll take this as a start -- largely because I do not have
any new good ideas on how it could be reduced further without incurring
a large cost in the page allocator :)

 I did manage to get a couple which were slightly worse, but nothing like as
 bad as before. Here are the results:
 
 # grep -F '[k]' report | head -8
 45.60%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
 11.26%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
  3.21%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
  2.27%   ksmd  [kernel.kallsyms] [k] memcmp
  2.02%swapper  [kernel.kallsyms] [k] default_idle
  1.58%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
  1.30%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
  1.09%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
 
 # grep -F '[k]' report | head -8
 61.29%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
  4.52%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
  2.64%   qemu-kvm  [kernel.kallsyms] [k] copy_page_c
  1.61%swapper  [kernel.kallsyms] [k] default_idle
  1.57%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
  1.18%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
  1.18%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
  1.11%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
 
 

Were the boot times acceptable even when these slightly worse figures
were recorded?

 I will follow up with the detailed traces for these three tests.
 
 Thank you!
 

Thank you for the detailed reporting and the testing, it's much
appreciated. I've already rebased the patches to Andrew's tree and tested
them overnight and the figures look good on my side. I'll update the
changelog and push them shortly.

-- 
Mel Gorman
SUSE Labs



Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention

2012-09-21 Thread Richard Davies
Mel Gorman wrote:
  I did manage to get a couple which were slightly worse, but nothing like as
  bad as before. Here are the results:
 
  # grep -F '[k]' report | head -8
  45.60%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
  11.26%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
   3.21%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
   2.27%   ksmd  [kernel.kallsyms] [k] memcmp
   2.02%swapper  [kernel.kallsyms] [k] default_idle
   1.58%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
   1.30%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
   1.09%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
 
  # grep -F '[k]' report | head -8
  61.29%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
   4.52%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
   2.64%   qemu-kvm  [kernel.kallsyms] [k] copy_page_c
   1.61%swapper  [kernel.kallsyms] [k] default_idle
   1.57%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
   1.18%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
   1.18%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
   1.11%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run

 Were the boot times acceptable even when these slightly worse figures
 were recorded?

Yes, they were 10-20% slower as you might expect from the traces, rather
than a factor slower.

 Thank you for the detailed reporting and the testing, it's much
 appreciated. I've already rebased the patches to Andrew's tree and tested
 them overnight and the figures look good on my side. I'll update the
 changelog and push them shortly.

Great. On my side, I'm delighted that senior kernel developers such as you,
Rik and Avi took our bug report seriously and helped fix it!

Thank you,

Richard.



Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention

2012-09-21 Thread Mel Gorman
On Fri, Sep 21, 2012 at 10:17:01AM +0100, Richard Davies wrote:
 Richard Davies wrote:
  I did manage to get a couple which were slightly worse, but nothing like as
  bad as before. Here are the results:
  
  # grep -F '[k]' report | head -8
  45.60%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c
  11.26%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block
   3.21%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock
   2.27%   ksmd  [kernel.kallsyms] [k] memcmp
   2.02%swapper  [kernel.kallsyms] [k] default_idle
   1.58%   qemu-kvm  [kernel.kallsyms] [k] svm_vcpu_run
   1.30%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock_irqsave
   1.09%   qemu-kvm  [kernel.kallsyms] [k] get_page_from_freelist
 
 # 
 # captured on: Fri Sep 21 08:17:52 2012
 # os release : 3.6.0-rc5-elastic+
 # perf version : 3.5.2
 # arch : x86_64
 # nrcpus online : 16
 # nrcpus avail : 16
 # cpudesc : AMD Opteron(tm) Processor 6128
 # cpuid : AuthenticAMD,16,9,1
 # total memory : 131973276 kB
 # cmdline : /home/root/bin/perf record -g -a 
 # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 
 0x0, excl_usr = 0, excl_kern = 0, id = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
 12, 13, 14, 15, 16 }
 # HEADER_CPU_TOPOLOGY info available, use -I to display
 # HEADER_NUMA_TOPOLOGY info available, use -I to display
 # 
 #
 # Samples: 283K of event 'cycles'
 # Event count (approx.): 109057976176
 #
 # OverheadCommand Shared Object   
Symbol
 #   .    
 ..
 #
 45.60%   qemu-kvm  [kernel.kallsyms] [k] clear_page_c 
  
  |
  --- clear_page_c
 |  
 |--93.35%-- do_huge_pmd_anonymous_page

This is unavoidable. If THP was disabled, the cost would still be
incurred, just on base pages instead of huge pages.

 SNIP
 11.26%   qemu-kvm  [kernel.kallsyms] [k] isolate_freepages_block  
  
  |
  --- isolate_freepages_block
  compaction_alloc
  migrate_pages
  compact_zone
  compact_zone_order
  try_to_compact_pages
  __alloc_pages_direct_compact
  __alloc_pages_nodemask
  alloc_pages_vma
  do_huge_pmd_anonymous_page

And this is showing that we're still spending a lot of time scanning
for free pages to isolate. I do not have a great idea on how this can be
reduced further without interfering with the page allocator.

One ok idea I considered in the past was using the buddy lists to find
free pages quickly but there is first the problem that the buddy lists
themselves may need to be searched and now that the zone lock is not held
during the scan it would be particularly difficult. The harder problem is
deciding when compaction finishes. I'll put more thought into it over
the weekend and see if something falls out but I'm not going to hold up
this series waiting for inspiration.

  3.21%   qemu-kvm  [kernel.kallsyms] [k] _raw_spin_lock   
  
  |
  --- _raw_spin_lock
 |  
 |--39.96%-- tdp_page_fault

Nothing very interesting here until...

 |--1.69%-- free_pcppages_bulk
 |  |  
 |  |--77.53%-- drain_pages
 |  |  |  
 |  |  |--95.77%-- drain_local_pages
 |  |  |  |  
 |  |  |  |--97.90%-- 
 generic_smp_call_function_interrupt
 |  |  |  |  
 smp_call_function_interrupt
 |  |  |  |  
 call_function_interrupt
 |  |  |  |  |  
 |  |  |  |  |--23.37%-- 
 kvm_vcpu_ioctl
 |  |  |  |  |  
 do_vfs_ioctl
 |  |  |  |  |  
 sys_ioctl
 |  |  |  |  |  
 system_call_fastpath
 |  |  |  |  |  
 ioctl
 |  |  |  |  |  |  
 
 |  |  |  |  |  
 |--97.22%-- 0x1010006
 |  |  | 

[Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention

2012-09-20 Thread Mel Gorman
Hi Richard,

This series is following up from your mail at
http://www.spinics.net/lists/kvm/msg80080.html . I am pleased the lock
contention is now reduced but acknowledge that the scanning rates are
stupidly high. Fortunately, I am reasonably confident I know what is
going wrong. If all goes according to plain this should drastically reduce
the amount of time your workload spends on compaction. I would very much
appreciate if you drop the MM patches (i.e. keep the btrfs patches) and
replace them with this series. I know that Rik's patches are dropped and
this is deliberate. I reimplemented his idea on top of the fifth patch on
this series to cover both the migrate and free scanners. Thanks to Rik who
discussed how the idea could be reimplemented on IRC which was very helpful.
Hopefully the patch actually reflects what we discussed :)

Shaohua, I would also appreciate if you tested this series. I picked up
one of your patches but replaced another and want to make sure that the
workload you were investigating is still ok.

===

Richard Davies and Shaohua Li have both reported lock contention problems
in compaction on the zone and LRU locks as well as significant amounts of
time being spent in compaction. It is critical that performance gains from
THP are not offset by the cost of allocating them in the first place. This
series aims to reduce lock contention and scanning rates.

Patch 1 is a fix for c67fe375 (mm: compaction: Abort async compaction if
locks are contended or taking too long) to properly abort in all
cases when contention is detected.

Patch 2 defers acquiring the zone-lru_lock as long as possible.

Patch 3 defers acquiring the zone-lock as lock as possible.

Patch 4 reverts Rik's skip-free patches as the core concept gets
reimplemented later and the remaining patches are easier to
understand if this is reverted first.

Patch 5 adds a pageblock-skip bit to the pageblock flags to cache what
pageblocks should be skipped by the migrate and free scanners.
This drastically reduces the amount of scanning compaction has
to do.

Patch 6 reimplements something similar to Rik's idea except it uses the
pageblock-skip information to decide where the scanners should
restart from and does not need to wrap around.

I tested this on 3.6-rc5 as that was the kernel base that the earlier threads
worked on. It will need a bit of work to rebase on top of Andrews tree for
merging due to other compaction changes but it will not be a major problem.
Kernels tested were

vanilla 3.6-rc5
lesslockPatches 1-3
revert  Patches 1-4
cachefail   Patches 1-5
skipuseless Patches 1-6

Stress high-order allocation tests looked ok.

STRESS-HIGHALLOC
   3.6.0 3.6.0-rc5 3.6.0-rc53.6.0-rc5   
  3.6.0-rc5
   rc5-vanillalesslockrevertcachefail   
skipuseless  
Pass 1  17.00 ( 0.00%)19.00 ( 2.00%)29.00 (12.00%)   24.00 ( 
7.00%)20.00 ( 3.00%)
Pass 2  16.00 ( 0.00%)19.00 ( 3.00%)39.00 (23.00%)   37.00 
(21.00%)35.00 (19.00%)
while Rested88.00 ( 0.00%)88.00 ( 0.00%)88.00 ( 0.00%)   85.00 
(-3.00%)86.00 (-2.00%)

Success rates are improved a bit by the series as there are fewer
opporunities to race with other allocation requests if compaction is
scanning less.  I recognise the success rates are still low but patches
that tackle parts of that are in Andrews tree already.

The time to complete the tests did not vary that much and are uninteresting
as were the vmstat statistics so I will not present them here.

Using ftrace I recorded how much scanning was done by compaction and got this

3.6.0 3.6.0-rc5 3.6.0-rc5  3.6.0-rc5  
3.6.0-rc5
rc5-vanillalesslockrevert  cachefail  
skipuseless  
Total   freescanned   185020625  223313210  744553485   37149462   
29231432 
Total   freeisolated 84509411747594301672 906689 
721963 
Total   freeefficiency  0.0046%0.0053%0.0058%0.0244%
0.0247% 
Total   migrate scanned   187708506  143133150  428180990   21941574   
12288851 
Total   migrate isolated 71437610811343950098 711357 
590552 
Total   migrate efficiency  0.0038%0.0076%0.0092%0.0324%
0.0481% 

The efficiency is worthless because of the nature of the test and the
number of failures.  The really interesting point as far as this patch
series is concerned is the number of pages scanned.

Note that reverting Rik's patches massively increases the number of pages 
scanned
indicating that those patches really did make a huge difference to CPU usage.

However, caching what pageblocks should be skipped has a much higher
impact. With patches 1-5 applied, free page scanning is reduced by 80%
in comparison to the vanilla kernel and