Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention
Hi Mel, Thank you for this series. I have applied on clean 3.6-rc5 and tested, and it works well for me - the lock contention is (still) gone and isolate_freepages_block is much reduced. Here is a typical test with these patches: # grep -F '[k]' report | head -8 65.20% qemu-kvm [kernel.kallsyms] [k] clear_page_c 2.18% qemu-kvm [kernel.kallsyms] [k] isolate_freepages_block 1.56% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock 1.40% qemu-kvm [kernel.kallsyms] [k] svm_vcpu_run 1.38% swapper [kernel.kallsyms] [k] default_idle 1.35% qemu-kvm [kernel.kallsyms] [k] get_page_from_freelist 0.74% ksmd [kernel.kallsyms] [k] memcmp 0.72% qemu-kvm [kernel.kallsyms] [k] free_pages_prepare I did manage to get a couple which were slightly worse, but nothing like as bad as before. Here are the results: # grep -F '[k]' report | head -8 45.60% qemu-kvm [kernel.kallsyms] [k] clear_page_c 11.26% qemu-kvm [kernel.kallsyms] [k] isolate_freepages_block 3.21% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock 2.27% ksmd [kernel.kallsyms] [k] memcmp 2.02%swapper [kernel.kallsyms] [k] default_idle 1.58% qemu-kvm [kernel.kallsyms] [k] svm_vcpu_run 1.30% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.09% qemu-kvm [kernel.kallsyms] [k] get_page_from_freelist # grep -F '[k]' report | head -8 61.29% qemu-kvm [kernel.kallsyms] [k] clear_page_c 4.52% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock_irqsave 2.64% qemu-kvm [kernel.kallsyms] [k] copy_page_c 1.61%swapper [kernel.kallsyms] [k] default_idle 1.57% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock 1.18% qemu-kvm [kernel.kallsyms] [k] get_page_from_freelist 1.18% qemu-kvm [kernel.kallsyms] [k] isolate_freepages_block 1.11% qemu-kvm [kernel.kallsyms] [k] svm_vcpu_run I will follow up with the detailed traces for these three tests. Thank you! Richard.
Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention
On Fri, Sep 21, 2012 at 10:13:33AM +0100, Richard Davies wrote: Hi Mel, Thank you for this series. I have applied on clean 3.6-rc5 and tested, and it works well for me - the lock contention is (still) gone and isolate_freepages_block is much reduced. Excellent! Here is a typical test with these patches: # grep -F '[k]' report | head -8 65.20% qemu-kvm [kernel.kallsyms] [k] clear_page_c 2.18% qemu-kvm [kernel.kallsyms] [k] isolate_freepages_block 1.56% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock 1.40% qemu-kvm [kernel.kallsyms] [k] svm_vcpu_run 1.38% swapper [kernel.kallsyms] [k] default_idle 1.35% qemu-kvm [kernel.kallsyms] [k] get_page_from_freelist 0.74% ksmd [kernel.kallsyms] [k] memcmp 0.72% qemu-kvm [kernel.kallsyms] [k] free_pages_prepare Ok, so that is more or less acceptable. I would like to reduce the scanning even further but I'll take this as a start -- largely because I do not have any new good ideas on how it could be reduced further without incurring a large cost in the page allocator :) I did manage to get a couple which were slightly worse, but nothing like as bad as before. Here are the results: # grep -F '[k]' report | head -8 45.60% qemu-kvm [kernel.kallsyms] [k] clear_page_c 11.26% qemu-kvm [kernel.kallsyms] [k] isolate_freepages_block 3.21% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock 2.27% ksmd [kernel.kallsyms] [k] memcmp 2.02%swapper [kernel.kallsyms] [k] default_idle 1.58% qemu-kvm [kernel.kallsyms] [k] svm_vcpu_run 1.30% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.09% qemu-kvm [kernel.kallsyms] [k] get_page_from_freelist # grep -F '[k]' report | head -8 61.29% qemu-kvm [kernel.kallsyms] [k] clear_page_c 4.52% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock_irqsave 2.64% qemu-kvm [kernel.kallsyms] [k] copy_page_c 1.61%swapper [kernel.kallsyms] [k] default_idle 1.57% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock 1.18% qemu-kvm [kernel.kallsyms] [k] get_page_from_freelist 1.18% qemu-kvm [kernel.kallsyms] [k] isolate_freepages_block 1.11% qemu-kvm [kernel.kallsyms] [k] svm_vcpu_run Were the boot times acceptable even when these slightly worse figures were recorded? I will follow up with the detailed traces for these three tests. Thank you! Thank you for the detailed reporting and the testing, it's much appreciated. I've already rebased the patches to Andrew's tree and tested them overnight and the figures look good on my side. I'll update the changelog and push them shortly. -- Mel Gorman SUSE Labs
Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention
Mel Gorman wrote: I did manage to get a couple which were slightly worse, but nothing like as bad as before. Here are the results: # grep -F '[k]' report | head -8 45.60% qemu-kvm [kernel.kallsyms] [k] clear_page_c 11.26% qemu-kvm [kernel.kallsyms] [k] isolate_freepages_block 3.21% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock 2.27% ksmd [kernel.kallsyms] [k] memcmp 2.02%swapper [kernel.kallsyms] [k] default_idle 1.58% qemu-kvm [kernel.kallsyms] [k] svm_vcpu_run 1.30% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.09% qemu-kvm [kernel.kallsyms] [k] get_page_from_freelist # grep -F '[k]' report | head -8 61.29% qemu-kvm [kernel.kallsyms] [k] clear_page_c 4.52% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock_irqsave 2.64% qemu-kvm [kernel.kallsyms] [k] copy_page_c 1.61%swapper [kernel.kallsyms] [k] default_idle 1.57% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock 1.18% qemu-kvm [kernel.kallsyms] [k] get_page_from_freelist 1.18% qemu-kvm [kernel.kallsyms] [k] isolate_freepages_block 1.11% qemu-kvm [kernel.kallsyms] [k] svm_vcpu_run Were the boot times acceptable even when these slightly worse figures were recorded? Yes, they were 10-20% slower as you might expect from the traces, rather than a factor slower. Thank you for the detailed reporting and the testing, it's much appreciated. I've already rebased the patches to Andrew's tree and tested them overnight and the figures look good on my side. I'll update the changelog and push them shortly. Great. On my side, I'm delighted that senior kernel developers such as you, Rik and Avi took our bug report seriously and helped fix it! Thank you, Richard.
Re: [Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention
On Fri, Sep 21, 2012 at 10:17:01AM +0100, Richard Davies wrote: Richard Davies wrote: I did manage to get a couple which were slightly worse, but nothing like as bad as before. Here are the results: # grep -F '[k]' report | head -8 45.60% qemu-kvm [kernel.kallsyms] [k] clear_page_c 11.26% qemu-kvm [kernel.kallsyms] [k] isolate_freepages_block 3.21% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock 2.27% ksmd [kernel.kallsyms] [k] memcmp 2.02%swapper [kernel.kallsyms] [k] default_idle 1.58% qemu-kvm [kernel.kallsyms] [k] svm_vcpu_run 1.30% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock_irqsave 1.09% qemu-kvm [kernel.kallsyms] [k] get_page_from_freelist # # captured on: Fri Sep 21 08:17:52 2012 # os release : 3.6.0-rc5-elastic+ # perf version : 3.5.2 # arch : x86_64 # nrcpus online : 16 # nrcpus avail : 16 # cpudesc : AMD Opteron(tm) Processor 6128 # cpuid : AuthenticAMD,16,9,1 # total memory : 131973276 kB # cmdline : /home/root/bin/perf record -g -a # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, id = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 } # HEADER_CPU_TOPOLOGY info available, use -I to display # HEADER_NUMA_TOPOLOGY info available, use -I to display # # # Samples: 283K of event 'cycles' # Event count (approx.): 109057976176 # # OverheadCommand Shared Object Symbol # . .. # 45.60% qemu-kvm [kernel.kallsyms] [k] clear_page_c | --- clear_page_c | |--93.35%-- do_huge_pmd_anonymous_page This is unavoidable. If THP was disabled, the cost would still be incurred, just on base pages instead of huge pages. SNIP 11.26% qemu-kvm [kernel.kallsyms] [k] isolate_freepages_block | --- isolate_freepages_block compaction_alloc migrate_pages compact_zone compact_zone_order try_to_compact_pages __alloc_pages_direct_compact __alloc_pages_nodemask alloc_pages_vma do_huge_pmd_anonymous_page And this is showing that we're still spending a lot of time scanning for free pages to isolate. I do not have a great idea on how this can be reduced further without interfering with the page allocator. One ok idea I considered in the past was using the buddy lists to find free pages quickly but there is first the problem that the buddy lists themselves may need to be searched and now that the zone lock is not held during the scan it would be particularly difficult. The harder problem is deciding when compaction finishes. I'll put more thought into it over the weekend and see if something falls out but I'm not going to hold up this series waiting for inspiration. 3.21% qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--39.96%-- tdp_page_fault Nothing very interesting here until... |--1.69%-- free_pcppages_bulk | | | |--77.53%-- drain_pages | | | | | |--95.77%-- drain_local_pages | | | | | | | |--97.90%-- generic_smp_call_function_interrupt | | | | smp_call_function_interrupt | | | | call_function_interrupt | | | | | | | | | |--23.37%-- kvm_vcpu_ioctl | | | | | do_vfs_ioctl | | | | | sys_ioctl | | | | | system_call_fastpath | | | | | ioctl | | | | | | | | | | | |--97.22%-- 0x1010006 | | |
[Qemu-devel] [PATCH 0/6] Reduce compaction scanning and lock contention
Hi Richard, This series is following up from your mail at http://www.spinics.net/lists/kvm/msg80080.html . I am pleased the lock contention is now reduced but acknowledge that the scanning rates are stupidly high. Fortunately, I am reasonably confident I know what is going wrong. If all goes according to plain this should drastically reduce the amount of time your workload spends on compaction. I would very much appreciate if you drop the MM patches (i.e. keep the btrfs patches) and replace them with this series. I know that Rik's patches are dropped and this is deliberate. I reimplemented his idea on top of the fifth patch on this series to cover both the migrate and free scanners. Thanks to Rik who discussed how the idea could be reimplemented on IRC which was very helpful. Hopefully the patch actually reflects what we discussed :) Shaohua, I would also appreciate if you tested this series. I picked up one of your patches but replaced another and want to make sure that the workload you were investigating is still ok. === Richard Davies and Shaohua Li have both reported lock contention problems in compaction on the zone and LRU locks as well as significant amounts of time being spent in compaction. It is critical that performance gains from THP are not offset by the cost of allocating them in the first place. This series aims to reduce lock contention and scanning rates. Patch 1 is a fix for c67fe375 (mm: compaction: Abort async compaction if locks are contended or taking too long) to properly abort in all cases when contention is detected. Patch 2 defers acquiring the zone-lru_lock as long as possible. Patch 3 defers acquiring the zone-lock as lock as possible. Patch 4 reverts Rik's skip-free patches as the core concept gets reimplemented later and the remaining patches are easier to understand if this is reverted first. Patch 5 adds a pageblock-skip bit to the pageblock flags to cache what pageblocks should be skipped by the migrate and free scanners. This drastically reduces the amount of scanning compaction has to do. Patch 6 reimplements something similar to Rik's idea except it uses the pageblock-skip information to decide where the scanners should restart from and does not need to wrap around. I tested this on 3.6-rc5 as that was the kernel base that the earlier threads worked on. It will need a bit of work to rebase on top of Andrews tree for merging due to other compaction changes but it will not be a major problem. Kernels tested were vanilla 3.6-rc5 lesslockPatches 1-3 revert Patches 1-4 cachefail Patches 1-5 skipuseless Patches 1-6 Stress high-order allocation tests looked ok. STRESS-HIGHALLOC 3.6.0 3.6.0-rc5 3.6.0-rc53.6.0-rc5 3.6.0-rc5 rc5-vanillalesslockrevertcachefail skipuseless Pass 1 17.00 ( 0.00%)19.00 ( 2.00%)29.00 (12.00%) 24.00 ( 7.00%)20.00 ( 3.00%) Pass 2 16.00 ( 0.00%)19.00 ( 3.00%)39.00 (23.00%) 37.00 (21.00%)35.00 (19.00%) while Rested88.00 ( 0.00%)88.00 ( 0.00%)88.00 ( 0.00%) 85.00 (-3.00%)86.00 (-2.00%) Success rates are improved a bit by the series as there are fewer opporunities to race with other allocation requests if compaction is scanning less. I recognise the success rates are still low but patches that tackle parts of that are in Andrews tree already. The time to complete the tests did not vary that much and are uninteresting as were the vmstat statistics so I will not present them here. Using ftrace I recorded how much scanning was done by compaction and got this 3.6.0 3.6.0-rc5 3.6.0-rc5 3.6.0-rc5 3.6.0-rc5 rc5-vanillalesslockrevert cachefail skipuseless Total freescanned 185020625 223313210 744553485 37149462 29231432 Total freeisolated 84509411747594301672 906689 721963 Total freeefficiency 0.0046%0.0053%0.0058%0.0244% 0.0247% Total migrate scanned 187708506 143133150 428180990 21941574 12288851 Total migrate isolated 71437610811343950098 711357 590552 Total migrate efficiency 0.0038%0.0076%0.0092%0.0324% 0.0481% The efficiency is worthless because of the nature of the test and the number of failures. The really interesting point as far as this patch series is concerned is the number of pages scanned. Note that reverting Rik's patches massively increases the number of pages scanned indicating that those patches really did make a huge difference to CPU usage. However, caching what pageblocks should be skipped has a much higher impact. With patches 1-5 applied, free page scanning is reduced by 80% in comparison to the vanilla kernel and