Performance counter - vPMU
Hi, I saw the a presentation on virtualizing performance counters at http://www.linux-kvm.org/wiki/images/6/6d/Kvm-forum-2011-performance-monitoring.pdf. Has the code been merged? Can I get something to play with/provide feedback? Balbir Singh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance counter - vPMU
On Mon, Nov 14, 2011 at 5:48 PM, Gleb Natapov g...@redhat.com wrote: Not yet merged. You can take it from here https://lkml.org/lkml/2011/11/10/215 Thank you very much, Gleb! Balbir Singh. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
* KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com [2011-04-01 16:56:57]: Hi 1) zone reclaim doesn't work if the system has multiple node and the workload is file cache oriented (eg file server, web server, mail server, et al). because zone recliam make some much free pages than zone-pages_min and then new page cache request consume nearest node memory and then it bring next zone reclaim. Then, memory utilization is reduced and unnecessary LRU discard is increased dramatically. SGI folks added CPUSET specific solution in past. (cpuset.memory_spread_page) But global recliam still have its issue. zone recliam is HPC workload specific feature and HPC folks has no motivation to don't use CPUSET. I am afraid you misread the patches and the intent. The intent to explictly enable control of unmapped pages and has nothing specifically to do with multiple nodes at this point. The control is system wide and carefully enabled by the administrator. Hm. OK, I may misread. Can you please explain the reason why de-duplication feature need to selectable and disabled by defaut. explicity enable mean this feature want to spot corner case issue?? Yes, because given a selection of choices (including what you mentioned in the review), it would be nice to have this selectable. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
* Andrew Morton a...@linux-foundation.org [2011-03-30 22:32:31]: On Thu, 31 Mar 2011 10:57:03 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * Andrew Morton a...@linux-foundation.org [2011-03-30 16:36:07]: On Wed, 30 Mar 2011 11:00:26 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Data from the previous patchsets can be found at https://lkml.org/lkml/2010/11/30/79 It would be nice if the data for the current patchset was present in the current patchset's changelog! Sure, since there were no major changes, I put in a URL. The main change was the documentation update. Well some poor schmuck has to copy and paste the data into the changelog so it's still there in five years time. It's better to carry this info around in the patch's own metedata, and to maintain and update it. Agreed, will do. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
* KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com [2011-04-01 22:21:26]: * KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com [2011-04-01 16:56:57]: Hi 1) zone reclaim doesn't work if the system has multiple node and the workload is file cache oriented (eg file server, web server, mail server, et al). because zone recliam make some much free pages than zone-pages_min and then new page cache request consume nearest node memory and then it bring next zone reclaim. Then, memory utilization is reduced and unnecessary LRU discard is increased dramatically. SGI folks added CPUSET specific solution in past. (cpuset.memory_spread_page) But global recliam still have its issue. zone recliam is HPC workload specific feature and HPC folks has no motivation to don't use CPUSET. I am afraid you misread the patches and the intent. The intent to explictly enable control of unmapped pages and has nothing specifically to do with multiple nodes at this point. The control is system wide and carefully enabled by the administrator. Hm. OK, I may misread. Can you please explain the reason why de-duplication feature need to selectable and disabled by defaut. explicity enable mean this feature want to spot corner case issue?? Yes, because given a selection of choices (including what you mentioned in the review), it would be nice to have this selectable. It's no good answer. :-/ I am afraid I cannot please you with my answers Who need the feature and who shouldn't use it? It this enough valuable for enough large people? That's my question point. You can see the use cases documented, including when running Linux as a guest under other hypervisors, today we have a choice of not using host page cache with cache=none, but nothing the other way round. There are other use cases for embedded folks (in terms of controlling unmapped page cache), please see previous discussions. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
* KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com [2011-03-31 14:40:33]: The following series implements page cache control, this is a split out version of patch 1 of version 3 of the page cache optimization patches posted earlier at Previous posting http://lwn.net/Articles/425851/ and analysis at http://lwn.net/Articles/419713/ Detailed Description This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. If anyone think this series works, They are just crazy. This patch reintroduce two old issues. 1) zone reclaim doesn't work if the system has multiple node and the workload is file cache oriented (eg file server, web server, mail server, et al). because zone recliam make some much free pages than zone-pages_min and then new page cache request consume nearest node memory and then it bring next zone reclaim. Then, memory utilization is reduced and unnecessary LRU discard is increased dramatically. SGI folks added CPUSET specific solution in past. (cpuset.memory_spread_page) But global recliam still have its issue. zone recliam is HPC workload specific feature and HPC folks has no motivation to don't use CPUSET. I am afraid you misread the patches and the intent. The intent to explictly enable control of unmapped pages and has nothing specifically to do with multiple nodes at this point. The control is system wide and carefully enabled by the administrator. 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to decide to filter out mapped pages. It made a lot of problems for DB servers and large application servers. Because, if the system has a lot of mapped pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2) reclaim latency become terribly slow and hangup detectors misdetect its state and start to force reboot. That was big problem of RHEL5 based banking system. So, sc-may_unmap should be killed in future. Don't increase uses. Can you remove sc-may_unmap without removing zone_reclaim()? The LRU churn can be addressed at the time of isolation, I'll send out an incremental patch for that. But, I agree that now we have to concern slightly large VM change parhaps (or parhaps not). Ok, it's good opportunity to fill out some thing. Historically, Linux MM has free memory are waste memory policy, and It worked completely fine. But now we have a few exceptions. 1) RT, embedded and finance systems. They really hope to avoid reclaim latency (ie avoid foreground reclaim completely) and they can accept to make slightly much free pages before memory shortage. 2) VM guest VM host and VM guest naturally makes two level page cache model. and Linux page cache + two level don't work fine. It has two issues 1) hard to visualize real memory consumption. That makes harder to works baloon fine. And google want to visualize memory utilization to pack in more jobs. 2) hard to make in kernel memory utilization improvement mechanism. And, now we have four proposal of utilization related issues. 1) cleancache (from Oracle) Cleancache requires both hypervisor and guest support. With these patches, Linux can run well under hypverisor if we know the hypversior does a lot of the IO and maintains the cache. 2) VirtFS (from IBM) 3) kstaled (from Google) 4) unmapped page reclaim (from you) Probably, we can't merge all of them and we need to consolidate some requirement and implementations. cleancache seems most straight forward
Re: [PATCH 0/3] Unmapped page cache control (v5)
* Dave Chinner da...@fromorbit.com [2011-04-01 08:40:33]: On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote: The following series implements page cache control, this is a split out version of patch 1 of version 3 of the page cache optimization patches posted earlier at Previous posting http://lwn.net/Articles/425851/ and analysis at http://lwn.net/Articles/419713/ Detailed Description This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. What does this do that cache=none for the VMs and using the page cache inside the guest doesn't acheive? That avoids double caching and doesn't require any new complexity inside the host OS to acheive... There was a long discussion on cache=none in the first posting and the downsides/impact on throughput. Please see http://www.mail-archive.com/kvm@vger.kernel.org/msg30655.html -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Unmapped page cache control (v5)
* Andrew Morton a...@linux-foundation.org [2011-03-30 16:36:07]: On Wed, 30 Mar 2011 11:00:26 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Data from the previous patchsets can be found at https://lkml.org/lkml/2010/11/30/79 It would be nice if the data for the current patchset was present in the current patchset's changelog! Sure, since there were no major changes, I put in a URL. The main change was the documentation update. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v5)
* Andrew Morton a...@linux-foundation.org [2011-03-30 16:35:45]: On Wed, 30 Mar 2011 11:02:38 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Changelog v4 1. Added documentation for max_unmapped_pages 2. Better #ifdef'ing of max_unmapped_pages and min_unmapped_pages Changelog v2 1. Use a config option to enable the code (Andrew Morton) 2. Explain the magic tunables in the code or at-least attempt to explain them (General comment) 3. Hint uses of the boot parameter with unlikely (Andrew Morton) 4. Use better names (balanced is not a good naming convention) Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. This: akpm:/usr/src/25 grep '^+#' patches/provide-control-over-unmapped-pages-v5.patch +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#else +#endif +#ifdef CONFIG_NUMA +#else +#define zone_reclaim_mode 0 +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */ +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +#endif +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) is getting out of control. What happens if we just make the feature non-configurable? I added the configuration based on review comments I received. If the feature is made non-configurable, it should be easy to remove them or just set the default value to y in the config. +static int __init unmapped_page_control_parm(char *str) +{ + unmapped_page_control = 1; + /* +* XXX: Should we tweak swappiness here? +*/ + return 1; +} +__setup(unmapped_page_control, unmapped_page_control_parm); That looks like a pain - it requires a reboot to change the option, which makes testing harder and slower. Methinks you're being a bit virtualization-centric here! :-) The reason for the boot parameter is to ensure that people know what they are doing. +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */ +static inline void reclaim_unmapped_pages(int priority, + struct zone *zone, struct scan_control *sc) +{ + return 0; +} +#endif + static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, struct scan_control *sc) { @@ -2371,6 +2394,12 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); + /* +* We do unmapped page reclaim once here and once +* below, so that we don't lose out +*/ + reclaim_unmapped_pages(priority, zone, sc); Doing this here seems wrong. balance_pgdat() does two passes across the zones. The first pass is a read-only work-out-what-to-do pass and the second pass is a now-reclaim-some-stuff pass. But here we've stuck a do-some-reclaiming operation inside the first, work-out-what-to-do pass. The reason is primarily for balancing, zone_watermark's do not give us a good idea of whether unmapped pages are balanced, hence the code. @@ -2408,6 +2437,11 @@ loop_again: continue; sc.nr_scanned = 0; + /* +* Reclaim unmapped pages upfront, this should be +* really cheap Comment is mysterious. Why is it cheap? Cheap because we do a quick check to see if unmapped pages exceed a threshold. If selective users enable this functionality (which is expected), the use case is primarily for embedded and virtualization folks, this should be a simple check. +*/ + reclaim_unmapped_pages(priority, zone, sc); I dunno, the whole thing seems rather nasty to me. It sticks a magical reclaim-unmapped-pages operation right in the middle of regular page reclaim. This means that reclaim will walk the LRU looking at mapped and unmapped pages. Then it will walk some more, looking at only unmapped pages
[PATCH 0/3] Unmapped page cache control (v5)
The following series implements page cache control, this is a split out version of patch 1 of version 3 of the page cache optimization patches posted earlier at Previous posting http://lwn.net/Articles/425851/ and analysis at http://lwn.net/Articles/419713/ Detailed Description This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim, a similar max_unmapped_ratio sysctl is added and helps in the decision making process of when reclaim should occur. This is tunable and set by default to 16 (based on tradeoff's seen between aggressiveness in balancing versus size of unmapped pages). Distro's and administrators can further tweak this for desired control. Data from the previous patchsets can be found at https://lkml.org/lkml/2010/11/30/79 --- Balbir Singh (3): Move zone_reclaim() outside of CONFIG_NUMA Refactor zone_reclaim code Provide control over unmapped pages Documentation/kernel-parameters.txt |8 ++ Documentation/sysctl/vm.txt | 19 + include/linux/mmzone.h | 11 +++ include/linux/swap.h| 25 ++- init/Kconfig| 12 +++ kernel/sysctl.c | 29 ++-- mm/page_alloc.c | 35 +- mm/vmscan.c | 123 +++ 8 files changed, 229 insertions(+), 33 deletions(-) -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v5)
This patch moves zone_reclaim and associated helpers outside CONFIG_NUMA. This infrastructure is reused in the patches for page cache control that follow. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Reviewed-by: Christoph Lameter c...@linux.com --- include/linux/mmzone.h |4 ++-- include/linux/swap.h |4 ++-- kernel/sysctl.c| 16 mm/page_alloc.c|6 +++--- mm/vmscan.c|2 -- 5 files changed, 15 insertions(+), 17 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 628f07b..59cbed0 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -306,12 +306,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; -#ifdef CONFIG_NUMA - int node; /* * zone reclaim becomes active if more unmapped pages exist. */ unsigned long min_unmapped_pages; +#ifdef CONFIG_NUMA + int node; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index ed6ebe6..ce8f686 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -264,11 +264,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern int sysctl_min_unmapped_ratio; +extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 927fc5a..e3a8ce4 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1214,14 +1214,6 @@ static struct ctl_table vm_table[] = { .proc_handler = proc_dointvec_unsigned, }, #endif -#ifdef CONFIG_NUMA - { - .procname = zone_reclaim_mode, - .data = zone_reclaim_mode, - .maxlen = sizeof(zone_reclaim_mode), - .mode = 0644, - .proc_handler = proc_dointvec_unsigned, - }, { .procname = min_unmapped_ratio, .data = sysctl_min_unmapped_ratio, @@ -1231,6 +1223,14 @@ static struct ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, +#ifdef CONFIG_NUMA + { + .procname = zone_reclaim_mode, + .data = zone_reclaim_mode, + .maxlen = sizeof(zone_reclaim_mode), + .mode = 0644, + .proc_handler = proc_dointvec_unsigned, + }, { .procname = min_slab_ratio, .data = sysctl_min_slab_ratio, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e1b52a..1d32865 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4249,10 +4249,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone-spanned_pages = size; zone-present_pages = realsize; -#ifdef CONFIG_NUMA - zone-node = nid; zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) / 100; +#ifdef CONFIG_NUMA + zone-node = nid; zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; #endif zone-name = zone_names[j]; @@ -5157,7 +5157,6 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write, return 0; } -#ifdef CONFIG_NUMA int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { @@ -5174,6 +5173,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, return 0; } +#ifdef CONFIG_NUMA int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/mm/vmscan.c b/mm/vmscan.c index 060e4c1..4923160 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2874,7 +2874,6 @@ static int __init kswapd_init(void) module_init(kswapd_init) -#ifdef CONFIG_NUMA /* * Zone reclaim mode * @@ -3084,7 +3083,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) return ret; } -#endif /* * page_evictable - test whether a page is evictable -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Refactor zone_reclaim code (v5)
Changelog v3 1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages Refactor zone_reclaim, move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Reviewed-by: Christoph Lameter c...@linux.com --- mm/vmscan.c | 35 +++ 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4923160..5b24e74 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2949,6 +2949,27 @@ static long zone_pagecache_reclaimable(struct zone *zone) } /* + * Helper function to reclaim unmapped pages, we might add something + * similar to this for slab cache as well. Currently this function + * is shared with __zone_reclaim() + */ +static inline void +zone_reclaim_pages(struct zone *zone, struct scan_control *sc, + unsigned long nr_pages) +{ + int priority; + /* +* Free memory by calling shrink zone with increasing +* priorities until we have enough memory freed. +*/ + priority = ZONE_RECLAIM_PRIORITY; + do { + shrink_zone(priority, zone, sc); + priority--; + } while (priority = 0 sc-nr_reclaimed nr_pages); +} + +/* * Try to free up some pages from this zone through reclaim. */ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) @@ -2957,7 +2978,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) const unsigned long nr_pages = 1 order; struct task_struct *p = current; struct reclaim_state reclaim_state; - int priority; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode RECLAIM_WRITE), .may_unmap = !!(zone_reclaim_mode RECLAIM_SWAP), @@ -2981,17 +3001,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) reclaim_state.reclaimed_slab = 0; p-reclaim_state = reclaim_state; - if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) { - /* -* Free memory by calling shrink zone with increasing -* priorities until we have enough memory freed. -*/ - priority = ZONE_RECLAIM_PRIORITY; - do { - shrink_zone(priority, zone, sc); - priority--; - } while (priority = 0 sc.nr_reclaimed nr_pages); - } + if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) + zone_reclaim_pages(zone, sc, nr_pages); nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); if (nr_slab_pages0 zone-min_slab_pages) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Provide control over unmapped pages (v5)
Changelog v4 1. Added documentation for max_unmapped_pages 2. Better #ifdef'ing of max_unmapped_pages and min_unmapped_pages Changelog v2 1. Use a config option to enable the code (Andrew Morton) 2. Explain the magic tunables in the code or at-least attempt to explain them (General comment) 3. Hint uses of the boot parameter with unlikely (Andrew Morton) 4. Use better names (balanced is not a good naming convention) Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Reviewed-by: Christoph Lameter c...@linux.com --- Documentation/kernel-parameters.txt |8 +++ Documentation/sysctl/vm.txt | 19 +++- include/linux/mmzone.h |7 +++ include/linux/swap.h| 25 -- init/Kconfig| 12 + kernel/sysctl.c | 13 + mm/page_alloc.c | 29 mm/vmscan.c | 88 +++ 8 files changed, 194 insertions(+), 7 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index d4e67a5..f522c34 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2520,6 +2520,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted. [X86] Set unknown_nmi_panic=1 early on boot. + unmapped_page_control + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL + is enabled. It controls the amount of unmapped memory + that is present in the system. This boot option plus + vm.min_unmapped_ratio (sysctl) provide granular control + over how much unmapped page cache can exist in the system + before kswapd starts reclaiming unmapped page cache pages. + usbcore.autosuspend= [USB] The autosuspend time delay (in seconds) used for newly-detected USB devices (default 2). This diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 30289fa..1c722f7 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -381,11 +381,14 @@ and may not be fast. min_unmapped_ratio: -This is available only on NUMA kernels. +This is available only on NUMA kernels or when unmapped page cache +control is enabled. This is a percentage of the total pages in each zone. Zone reclaim will only occur if more than this percentage of pages are in a state that -zone_reclaim_mode allows to be reclaimed. +zone_reclaim_mode allows to be reclaimed. If unmapped page cache control +is enabled, this is the minimum level to which the cache will be shrunk +down to. If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared against all file-backed unmapped pages including swapcache pages and tmpfs @@ -396,6 +399,18 @@ The default is 1 percent. == +max_unmapped_ratio: + +This is available only when unmapped page cache control is enabled. + +This is a percentage of the total pages in each zone. Zone reclaim will +only occur if more than this percentage of pages are in a state and +unmapped page cache control is enabled. + +The default is 16 percent. + +== + mmap_min_addr This file indicates the amount of address space which a user process will diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 59cbed0..caa29ad 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -309,7 +309,12 @@ struct zone { /* * zone reclaim becomes active if more unmapped pages exist. */ +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) unsigned long min_unmapped_pages; +#endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) + unsigned long max_unmapped_pages; +#endif #ifdef CONFIG_NUMA int node; unsigned long min_slab_pages; @@ -776,6 +781,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int sysctl_max_unmapped_ratio_sysctl_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); diff --git a/include/linux/swap.h b/include/linux/swap.h index ce8f686..86cafc5 100644
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
* MinChan Kim minchan@gmail.com [2011-02-10 14:41:44]: I don't know why the part of message is deleted only when I send you. Maybe it's gmail bug. I hope mail sending is successful in this turn. :) On Thu, Feb 10, 2011 at 2:33 PM, Minchan Kim minchan@gmail.com wrote: Sorry for late response. On Fri, Jan 28, 2011 at 8:18 PM, Balbir Singh bal...@linux.vnet.ibm.com wrote: * MinChan Kim minchan@gmail.com [2011-01-28 16:24:19]: But the assumption for LRU order to change happens only if the page cannot be successfully freed, which means it is in some way active.. and needs to be moved no? 1. holded page by someone 2. mapped pages 3. active pages 1 is rare so it isn't the problem. Of course, in case of 3, we have to activate it so no problem. The problem is 2. 2 is a problem, but due to the size aspects not a big one. Like you said even lumpy reclaim affects it. May be the reclaim code could honour may_unmap much earlier. Even if it is, it's a trade-off to get a big contiguous memory. I don't want to add new mess. (In addition, lumpy is weak by compaction as time goes by) What I have in mind for preventing LRU ignore is that put the page into original position instead of head of lru. Maybe it can help the situation both lumpy and your case. But it's another story. How about the idea? I borrow the idea from CFLRU[1] - PCFLRU(Page-Cache First LRU) When we allocates new page for page cache, we adds the page into LRU's tail. When we map the page cache into page table, we rotate the page into LRU's head. So, inactive list's result is following as. M.P : mapped page N.P : none-mapped page HEAD-M.P-M.P-M.P-M.P-N.P-N.P-N.P-N.P-N.P-TAIL Admin can set threshold window size which determines stop reclaiming none-mapped page contiguously. I think it needs some tweak of page cache/page mapping functions but we can use kswapd/direct reclaim without change. Also, it can change page reclaim policy totally but it's just what you want, I think. I am not sure how this would work, moreover the idea behind min_unmapped_pages is to keep sufficient unmapped pages around for the FS metadata and has been working with the existing code for zone reclaim. What you propose is more drastic re-org of the LRU and I am not sure I have the apetite for it. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3][RESEND] Provide control over unmapped pages (v4)
On 02/09/2011 05:27 AM, Andrew Morton wrote: On Tue, 01 Feb 2011 22:25:45 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Changelog v4 1. Add max_unmapped_ratio and use that as the upper limit to check when to shrink the unmapped page cache (Christoph Lameter) Changelog v2 1. Use a config option to enable the code (Andrew Morton) 2. Explain the magic tunables in the code or at-least attempt to explain them (General comment) 3. Hint uses of the boot parameter with unlikely (Andrew Morton) 4. Use better names (balanced is not a good naming convention) Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. A new sysctl for max_unmapped_ratio is provided and set to 16, indicating 16% of the total zone pages are unmapped, we start shrinking unmapped page cache. We'll need some documentation for sysctl_max_unmapped_ratio, please. In Documentation/sysctl/vm.txt, I suppose. It will be interesting to find out what this ratio refers to. it apears to be a percentage. We've had problem in the past where 1% was way too much and we had to change the kernel to provide much finer-grained control. Sure, I'll update the Documentation as a part of this patchset. Yes, the current min_unmapped_ratio is a percentage and so is max_unmapped_ratio. min_unmapped_ratio already exists, adding max_ should not affect granularity of control. It will be worth relooking at the granularity based on user feedback and experience. We won't break ABI if we add additional interfaces to help granularity. ... --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -306,7 +306,10 @@ struct zone { /* * zone reclaim becomes active if more unmapped pages exist. */ +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA) unsigned long min_unmapped_pages; +unsigned long max_unmapped_pages; +#endif This change breaks the connection between min_unmapped_pages and its documentation, and fails to document max_unmapped_pages. I'll fix that Also, afacit if CONFIG_NUMA=y and CONFIG_UNMAPPED_PAGE_CONTROL=n, max_unmapped_pages will be present in the kernel image and will appear in /proc but it won't actually do anything. Seems screwed up and misleading. Good catch! In one of the emails Christoph mentioned that max_unmapped_ratio might be helpful even in the general case (but we need to work on that later). For now, I'll fix this and repose. ... +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +/* + * Routine to reclaim unmapped pages, inspired from the code under + * CONFIG_NUMA that does unmapped page and slab page control by keeping + * min_unmapped_pages in the zone. We currently reclaim just unmapped + * pages, slab control will come in soon, at which point this routine + * should be called reclaim cached pages + */ +unsigned long reclaim_unmapped_pages(int priority, struct zone *zone, +struct scan_control *sc) +{ +if (unlikely(unmapped_page_control) +(zone_unmapped_file_pages(zone) zone-min_unmapped_pages)) { +struct scan_control nsc; +unsigned long nr_pages; + +nsc = *sc; + +nsc.swappiness = 0; +nsc.may_writepage = 0; +nsc.may_unmap = 0; +nsc.nr_reclaimed = 0; + +nr_pages = zone_unmapped_file_pages(zone) - +zone-min_unmapped_pages; +/* + * We don't want to be too aggressive with our + * reclaim, it is our best effort to control + * unmapped pages + */ +nr_pages = 3; + +zone_reclaim_pages(zone, nsc, nr_pages); +return nsc.nr_reclaimed; +} +return 0; +} This returns an undocumented ulong which is never used by callers. Good catch! I;ll remove the return value, I don't expect it to be used to check how much we could reclaim. Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3][RESEND] Provide unmapped page cache control (v4)
NOTE: Resending the series with the Reviewed-by tags updated The following series implements page cache control, this is a split out version of patch 1 of version 3 of the page cache optimization patches posted earlier at Previous posting http://lwn.net/Articles/419564/ The previous few revision received lot of comments, I've tried to address as many of those as possible in this revision. Detailed Description This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim, a similar max_unmapped_ratio sysctl is added and helps in the decision making process of when reclaim should occur. This is tunable and set by default to 16 (based on tradeoff's seen between aggressiveness in balancing versus size of unmapped pages). Distro's and administrators can further tweak this for desired control. Data from the previous patchsets can be found at https://lkml.org/lkml/2010/11/30/79 --- Balbir Singh (3): Move zone_reclaim() outside of CONFIG_NUMA Refactor zone_reclaim code Provide control over unmapped pages Documentation/kernel-parameters.txt |8 ++ include/linux/mmzone.h |9 ++- include/linux/swap.h| 23 +-- init/Kconfig| 12 +++ kernel/sysctl.c | 29 ++-- mm/page_alloc.c | 31 - mm/vmscan.c | 122 +++ 7 files changed, 202 insertions(+), 32 deletions(-) -- Balbir Singh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3][RESEND] Move zone_reclaim() outside of CONFIG_NUMA (v4)
This patch moves zone_reclaim and associated helpers outside CONFIG_NUMA. This infrastructure is reused in the patches for page cache control that follow. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Reviewed-by: Christoph Lameter c...@linux.com --- include/linux/mmzone.h |4 ++-- include/linux/swap.h |4 ++-- kernel/sysctl.c| 18 +- mm/page_alloc.c|6 +++--- mm/vmscan.c|2 -- 5 files changed, 16 insertions(+), 18 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 02ecb01..2485acc 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -303,12 +303,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; -#ifdef CONFIG_NUMA - int node; /* * zone reclaim becomes active if more unmapped pages exist. */ unsigned long min_unmapped_pages; +#ifdef CONFIG_NUMA + int node; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index 5e3355a..7b75626 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -255,11 +255,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern int sysctl_min_unmapped_ratio; +extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index bc86bb3..12e8f26 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1224,15 +1224,6 @@ static struct ctl_table vm_table[] = { .extra1 = zero, }, #endif -#ifdef CONFIG_NUMA - { - .procname = zone_reclaim_mode, - .data = zone_reclaim_mode, - .maxlen = sizeof(zone_reclaim_mode), - .mode = 0644, - .proc_handler = proc_dointvec, - .extra1 = zero, - }, { .procname = min_unmapped_ratio, .data = sysctl_min_unmapped_ratio, @@ -1242,6 +1233,15 @@ static struct ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, +#ifdef CONFIG_NUMA + { + .procname = zone_reclaim_mode, + .data = zone_reclaim_mode, + .maxlen = sizeof(zone_reclaim_mode), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = zero, + }, { .procname = min_slab_ratio, .data = sysctl_min_slab_ratio, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index aede3a4..7b56473 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4167,10 +4167,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone-spanned_pages = size; zone-present_pages = realsize; -#ifdef CONFIG_NUMA - zone-node = nid; zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) / 100; +#ifdef CONFIG_NUMA + zone-node = nid; zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; #endif zone-name = zone_names[j]; @@ -5084,7 +5084,6 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write, return 0; } -#ifdef CONFIG_NUMA int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { @@ -5101,6 +5100,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, return 0; } +#ifdef CONFIG_NUMA int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/mm/vmscan.c b/mm/vmscan.c index 47a5096..5899f2f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2868,7 +2868,6 @@ static int __init kswapd_init(void) module_init(kswapd_init) -#ifdef CONFIG_NUMA /* * Zone reclaim mode * @@ -3078,7 +3077,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) return ret; } -#endif /* * page_evictable - test whether a page is evictable -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3][RESEND] Refactor zone_reclaim code (v4)
Changelog v3 1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages Refactor zone_reclaim, move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Reviewed-by: Christoph Lameter c...@linux.com --- mm/vmscan.c | 35 +++ 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5899f2f..02cc82e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2943,6 +2943,27 @@ static long zone_pagecache_reclaimable(struct zone *zone) } /* + * Helper function to reclaim unmapped pages, we might add something + * similar to this for slab cache as well. Currently this function + * is shared with __zone_reclaim() + */ +static inline void +zone_reclaim_pages(struct zone *zone, struct scan_control *sc, + unsigned long nr_pages) +{ + int priority; + /* +* Free memory by calling shrink zone with increasing +* priorities until we have enough memory freed. +*/ + priority = ZONE_RECLAIM_PRIORITY; + do { + shrink_zone(priority, zone, sc); + priority--; + } while (priority = 0 sc-nr_reclaimed nr_pages); +} + +/* * Try to free up some pages from this zone through reclaim. */ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) @@ -2951,7 +2972,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) const unsigned long nr_pages = 1 order; struct task_struct *p = current; struct reclaim_state reclaim_state; - int priority; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode RECLAIM_WRITE), .may_unmap = !!(zone_reclaim_mode RECLAIM_SWAP), @@ -2975,17 +2995,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) reclaim_state.reclaimed_slab = 0; p-reclaim_state = reclaim_state; - if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) { - /* -* Free memory by calling shrink zone with increasing -* priorities until we have enough memory freed. -*/ - priority = ZONE_RECLAIM_PRIORITY; - do { - shrink_zone(priority, zone, sc); - priority--; - } while (priority = 0 sc.nr_reclaimed nr_pages); - } + if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) + zone_reclaim_pages(zone, sc, nr_pages); nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); if (nr_slab_pages0 zone-min_slab_pages) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3][RESEND] Provide control over unmapped pages (v4)
Changelog v4 1. Add max_unmapped_ratio and use that as the upper limit to check when to shrink the unmapped page cache (Christoph Lameter) Changelog v2 1. Use a config option to enable the code (Andrew Morton) 2. Explain the magic tunables in the code or at-least attempt to explain them (General comment) 3. Hint uses of the boot parameter with unlikely (Andrew Morton) 4. Use better names (balanced is not a good naming convention) Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. A new sysctl for max_unmapped_ratio is provided and set to 16, indicating 16% of the total zone pages are unmapped, we start shrinking unmapped page cache. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Reviewed-by: Christoph Lameter c...@linux.com --- Documentation/kernel-parameters.txt |8 +++ include/linux/mmzone.h |5 ++ include/linux/swap.h| 23 - init/Kconfig| 12 + kernel/sysctl.c | 11 mm/page_alloc.c | 25 ++ mm/vmscan.c | 87 +++ 7 files changed, 166 insertions(+), 5 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index fee5f57..65a4ee6 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2500,6 +2500,14 @@ and is between 256 and 4096 characters. It is defined in the file [X86] Set unknown_nmi_panic=1 early on boot. + unmapped_page_control + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL + is enabled. It controls the amount of unmapped memory + that is present in the system. This boot option plus + vm.min_unmapped_ratio (sysctl) provide granular control + over how much unmapped page cache can exist in the system + before kswapd starts reclaiming unmapped page cache pages. + usbcore.autosuspend= [USB] The autosuspend time delay (in seconds) used for newly-detected USB devices (default 2). This diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 2485acc..18f0f09 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -306,7 +306,10 @@ struct zone { /* * zone reclaim becomes active if more unmapped pages exist. */ +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA) unsigned long min_unmapped_pages; + unsigned long max_unmapped_pages; +#endif #ifdef CONFIG_NUMA int node; unsigned long min_slab_pages; @@ -773,6 +776,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int sysctl_max_unmapped_ratio_sysctl_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); diff --git a/include/linux/swap.h b/include/linux/swap.h index 7b75626..ae62a03 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -255,19 +255,34 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) extern int sysctl_min_unmapped_ratio; +extern int sysctl_max_unmapped_ratio; + extern int zone_reclaim(struct zone *, gfp_t, unsigned int); -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; -extern int sysctl_min_slab_ratio; #else -#define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; } #endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +extern bool should_reclaim_unmapped_pages(struct zone *zone); +#else +static inline bool should_reclaim_unmapped_pages(struct zone *zone) +{ + return false; +} +#endif + +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; +extern int sysctl_min_slab_ratio; +#else +#define zone_reclaim_mode 0 +#endif + extern int page_evictable(struct page *page, struct vm_area_struct *vma); extern void scan_mapping_unevictable_pages(struct address_space *); diff --git a/init/Kconfig b/init/Kconfig index 4f6cdbf..2dfbc09 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -828,6 +828,18 @@ config SCHED_AUTOGROUP config MM_OWNER bool +config UNMAPPED_PAGECACHE_CONTROL + bool Provide
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2011-01-31 08:58:53]: On Fri, 28 Jan 2011 09:20:02 -0600 (CST) Christoph Lameter c...@linux.com wrote: On Fri, 28 Jan 2011, KAMEZAWA Hiroyuki wrote: I see it as a tradeoff of when to check? add_to_page_cache or when we are want more free memory (due to allocation). It is OK to wakeup kswapd while allocating memory, somehow for this purpose (global page cache), add_to_page_cache or add_to_page_cache_locked does not seem the right place to hook into. I'd be open to comments/suggestions though from others as well. I don't like add hook here. AND I don't want to run kswapd because 'kswapd' has been a sign as there are memory shortage. (reusing code is ok.) How about adding new daemon ? Recently, khugepaged, ksmd works for managing memory. Adding one more daemon for special purpose is not very bad, I think. Then, you can do - wake up without hook - throttle its work. - balance the whole system rather than zone. I think per-node balance is enough... I think we already have enough kernel daemons floating around. They are multiplying in an amazing way. What would be useful is to map all the memory management background stuff into a process. May call this memd instead? Perhaps we can fold khugepaged into kswapd as well etc. Making kswapd slow for whis additional, requested by user, not by system work is good thing ? I think workqueue works enough well, it's scale based on workloads, if using thread is bad. Making it slow is a generic statement, kswapd is supposed to do background reclaim, in this case a special request for unmapped pages, specifically and deliberately requested by the admin via a boot option. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2011-01-28 16:56:05]: On Fri, 28 Jan 2011 16:24:19 +0900 Minchan Kim minchan@gmail.com wrote: On Fri, Jan 28, 2011 at 3:48 PM, Balbir Singh bal...@linux.vnet.ibm.com wrote: * MinChan Kim minchan@gmail.com [2011-01-28 14:44:50]: On Fri, Jan 28, 2011 at 11:56 AM, Balbir Singh bal...@linux.vnet.ibm.com wrote: On Thu, Jan 27, 2011 at 4:42 AM, Minchan Kim minchan@gmail.com wrote: [snip] index 7b56473..2ac8549 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1660,6 +1660,9 @@ zonelist_scan:             unsigned long mark;             int ret; +            if (should_reclaim_unmapped_pages(zone)) +                wakeup_kswapd(zone, order, classzone_idx); + Do we really need the check in fastpath? There are lost of caller of alloc_pages. Many of them are not related to mapped pages. Could we move the check into add_to_page_cache_locked? The check is a simple check to see if the unmapped pages need balancing, the reason I placed this check here is to allow other allocations to benefit as well, if there are some unmapped pages to be freed. add_to_page_cache_locked (check under a critical section) is even worse, IMHO. It just moves the overhead from general into specific case(ie, allocates page for just page cache). Another cases(ie, allocates pages for other purpose except page cache, ex device drivers or fs allocation for internal using) aren't affected. So, It would be better. The goal in this patch is to remove only page cache page, isn't it? So I think we could the balance check in add_to_page_cache and trigger reclaim. If we do so, what's the problem? I see it as a tradeoff of when to check? add_to_page_cache or when we are want more free memory (due to allocation). It is OK to wakeup kswapd while allocating memory, somehow for this purpose (global page cache), add_to_page_cache or add_to_page_cache_locked does not seem the right place to hook into. I'd be open to comments/suggestions though from others as well. I don't like add hook here. AND I don't want to run kswapd because 'kswapd' has been a sign as there are memory shortage. (reusing code is ok.) How about adding new daemon ? Recently, khugepaged, ksmd works for managing memory. Adding one more daemon for special purpose is not very bad, I think. Then, you can do - wake up without hook - throttle its work. - balance the whole system rather than zone. I think per-node balance is enough... Honestly, I did look at that option, but balancing via kswapd seemed like the best option. Creating a new thread/daemon did not make sense because 1. The control is very lose 2. kswapd can deal with it while balancing other things, in fact imagine kswapd waking up to free memory, but there being other free memory easily available. Parallel reclaim, zone lock contention addition does not help, IMHO. 3. kswapd does not indicate memory shortage per-se, please see min_free_kbytes_sysctl_handler, kswapd is to balance the nodes/zone. If you tune min_free_kbytes and kswapd runs, it does not mean memory shortage on the system             mark = zone-watermark[alloc_flags ALLOC_WMARK_MASK];             if (zone_watermark_ok(zone, order, mark,                   classzone_idx, alloc_flags)) @@ -4167,8 +4170,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,         zone-spanned_pages = size;         zone-present_pages = realsize; +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)         zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)                         / 100; +        zone-max_unmapped_pages = (realsize*sysctl_max_unmapped_ratio) +                        / 100; +#endif  #ifdef CONFIG_NUMA         zone-node = nid;         zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; @@ -5084,6 +5091,7 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write,     return 0;  } +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)  int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,     void __user *buffer, size_t *length, loff_t *ppos)  { @@ -5100,6 +5108,23 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,     return 0;  } +int sysctl_max_unmapped_ratio_sysctl_handler(ctl_table *table, int write, +    void __user *buffer, size_t *length, loff_t *ppos) +{ +    struct zone *zone
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
* MinChan Kim minchan@gmail.com [2011-01-28 16:24:19]: But the assumption for LRU order to change happens only if the page cannot be successfully freed, which means it is in some way active.. and needs to be moved no? 1. holded page by someone 2. mapped pages 3. active pages 1 is rare so it isn't the problem. Of course, in case of 3, we have to activate it so no problem. The problem is 2. 2 is a problem, but due to the size aspects not a big one. Like you said even lumpy reclaim affects it. May be the reclaim code could honour may_unmap much earlier. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2011-01-28 17:17:44]: On Fri, 28 Jan 2011 13:49:28 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2011-01-28 16:56:05]: BTW, it seems this doesn't work when some apps use huge shmem. How to handle the issue ? Could you elaborate further? == static inline unsigned long zone_unmapped_file_pages(struct zone *zone) { unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED); unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) + zone_page_state(zone, NR_ACTIVE_FILE); /* * It's possible for there to be more file mapped pages than * accounted for by the pages on the file LRU lists because * tmpfs pages accounted for as ANON can also be FILE_MAPPED */ return (file_lru file_mapped) ? (file_lru - file_mapped) : 0; } Yes, I did :) The word huge confused me. I am not sure if there is an easy accounting fix for this one, though given the approximate nature of the control, I am not sure it would matter very much. But you do have a very good point. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
On Thu, Jan 27, 2011 at 4:42 AM, Minchan Kim minchan@gmail.com wrote: [snip] index 7b56473..2ac8549 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1660,6 +1660,9 @@ zonelist_scan:             unsigned long mark;             int ret; +            if (should_reclaim_unmapped_pages(zone)) +                wakeup_kswapd(zone, order, classzone_idx); + Do we really need the check in fastpath? There are lost of caller of alloc_pages. Many of them are not related to mapped pages. Could we move the check into add_to_page_cache_locked? The check is a simple check to see if the unmapped pages need balancing, the reason I placed this check here is to allow other allocations to benefit as well, if there are some unmapped pages to be freed. add_to_page_cache_locked (check under a critical section) is even worse, IMHO.             mark = zone-watermark[alloc_flags ALLOC_WMARK_MASK];             if (zone_watermark_ok(zone, order, mark,                   classzone_idx, alloc_flags)) @@ -4167,8 +4170,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,         zone-spanned_pages = size;         zone-present_pages = realsize; +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)         zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)                         / 100; +        zone-max_unmapped_pages = (realsize*sysctl_max_unmapped_ratio) +                        / 100; +#endif  #ifdef CONFIG_NUMA         zone-node = nid;         zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; @@ -5084,6 +5091,7 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write,     return 0;  } +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)  int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,     void __user *buffer, size_t *length, loff_t *ppos)  { @@ -5100,6 +5108,23 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,     return 0;  } +int sysctl_max_unmapped_ratio_sysctl_handler(ctl_table *table, int write, +    void __user *buffer, size_t *length, loff_t *ppos) +{ +    struct zone *zone; +    int rc; + +    rc = proc_dointvec_minmax(table, write, buffer, length, ppos); +    if (rc) +        return rc; + +    for_each_zone(zone) +        zone-max_unmapped_pages = (zone-present_pages * +                sysctl_max_unmapped_ratio) / 100; +    return 0; +} +#endif +  #ifdef CONFIG_NUMA  int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,     void __user *buffer, size_t *length, loff_t *ppos) diff --git a/mm/vmscan.c b/mm/vmscan.c index 02cc82e..6377411 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -159,6 +159,29 @@ static DECLARE_RWSEM(shrinker_rwsem);  #define scanning_global_lru(sc)     (1)  #endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +static unsigned long reclaim_unmapped_pages(int priority, struct zone *zone, +                        struct scan_control *sc); +static int unmapped_page_control __read_mostly; + +static int __init unmapped_page_control_parm(char *str) +{ +    unmapped_page_control = 1; +    /* +     * XXX: Should we tweak swappiness here? +     */ +    return 1; +} +__setup(unmapped_page_control, unmapped_page_control_parm); + +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */ +static inline unsigned long reclaim_unmapped_pages(int priority, +                struct zone *zone, struct scan_control *sc) +{ +    return 0; +} +#endif +  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,                          struct scan_control *sc)  { @@ -2359,6 +2382,12 @@ loop_again:                 shrink_active_list(SWAP_CLUSTER_MAX, zone,                             sc, priority, 0); +            /* +             * We do unmapped page reclaim once here and once +             * below, so that we don't lose out +             */ +            reclaim_unmapped_pages(priority, zone, sc); +             if (!zone_watermark_ok_safe(zone, order,                     high_wmark_pages(zone), 0, 0)) {                 end_zone = i; @@ -2396,6 +2425,11 @@ loop_again:                 continue;             sc.nr_scanned = 0; +            /* +             * Reclaim unmapped pages upfront, this should be +             * really cheap +             */ +            reclaim_unmapped_pages(priority, zone,
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
* Christoph Lameter c...@linux.com [2011-01-26 10:57:37]: Reviewed-by: Christoph Lameter c...@linux.com Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v4)
* Christoph Lameter c...@linux.com [2011-01-26 10:56:56]: Reviewed-by: Christoph Lameter c...@linux.com Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v4)
* MinChan Kim minchan@gmail.com [2011-01-28 14:44:50]: On Fri, Jan 28, 2011 at 11:56 AM, Balbir Singh bal...@linux.vnet.ibm.com wrote: On Thu, Jan 27, 2011 at 4:42 AM, Minchan Kim minchan@gmail.com wrote: [snip] index 7b56473..2ac8549 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1660,6 +1660,9 @@ zonelist_scan:             unsigned long mark;             int ret; +            if (should_reclaim_unmapped_pages(zone)) +                wakeup_kswapd(zone, order, classzone_idx); + Do we really need the check in fastpath? There are lost of caller of alloc_pages. Many of them are not related to mapped pages. Could we move the check into add_to_page_cache_locked? The check is a simple check to see if the unmapped pages need balancing, the reason I placed this check here is to allow other allocations to benefit as well, if there are some unmapped pages to be freed. add_to_page_cache_locked (check under a critical section) is even worse, IMHO. It just moves the overhead from general into specific case(ie, allocates page for just page cache). Another cases(ie, allocates pages for other purpose except page cache, ex device drivers or fs allocation for internal using) aren't affected. So, It would be better. The goal in this patch is to remove only page cache page, isn't it? So I think we could the balance check in add_to_page_cache and trigger reclaim. If we do so, what's the problem? I see it as a tradeoff of when to check? add_to_page_cache or when we are want more free memory (due to allocation). It is OK to wakeup kswapd while allocating memory, somehow for this purpose (global page cache), add_to_page_cache or add_to_page_cache_locked does not seem the right place to hook into. I'd be open to comments/suggestions though from others as well.             mark = zone-watermark[alloc_flags ALLOC_WMARK_MASK];             if (zone_watermark_ok(zone, order, mark,                   classzone_idx, alloc_flags)) @@ -4167,8 +4170,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,         zone-spanned_pages = size;         zone-present_pages = realsize; +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)         zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)                         / 100; +        zone-max_unmapped_pages = (realsize*sysctl_max_unmapped_ratio) +                        / 100; +#endif  #ifdef CONFIG_NUMA         zone-node = nid;         zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; @@ -5084,6 +5091,7 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write,     return 0;  } +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)  int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,     void __user *buffer, size_t *length, loff_t *ppos)  { @@ -5100,6 +5108,23 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,     return 0;  } +int sysctl_max_unmapped_ratio_sysctl_handler(ctl_table *table, int write, +    void __user *buffer, size_t *length, loff_t *ppos) +{ +    struct zone *zone; +    int rc; + +    rc = proc_dointvec_minmax(table, write, buffer, length, ppos); +    if (rc) +        return rc; + +    for_each_zone(zone) +        zone-max_unmapped_pages = (zone-present_pages * +                sysctl_max_unmapped_ratio) / 100; +    return 0; +} +#endif +  #ifdef CONFIG_NUMA  int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,     void __user *buffer, size_t *length, loff_t *ppos) diff --git a/mm/vmscan.c b/mm/vmscan.c index 02cc82e..6377411 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -159,6 +159,29 @@ static DECLARE_RWSEM(shrinker_rwsem);  #define scanning_global_lru(sc)     (1)  #endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +static unsigned long reclaim_unmapped_pages(int priority, struct zone *zone, +                        struct scan_control *sc); +static int unmapped_page_control __read_mostly; + +static int __init unmapped_page_control_parm(char *str) +{ +    unmapped_page_control = 1; +    /* +     * XXX: Should we tweak swappiness here? +     */ +    return 1; +} +__setup(unmapped_page_control, unmapped_page_control_parm); + +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */ +static inline unsigned long reclaim_unmapped_pages(int priority, +                struct zone *zone, struct scan_control *sc) +{ +    return 0; +} +#endif +  static struct zone_reclaim_stat
[PATCH 0/3] Unmapped Page Cache Control (v4)
The following series implements page cache control, this is a split out version of patch 1 of version 3 of the page cache optimization patches posted earlier at Previous posting http://lwn.net/Articles/419564/ The previous few revision received lot of comments, I've tried to address as many of those as possible in this revision. Detailed Description This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim, a similar max_unmapped_ratio sysctl is added and helps in the decision making process of when reclaim should occur. This is tunable and set by default to 16 (based on tradeoff's seen between aggressiveness in balancing versus size of unmapped pages). Distro's and administrators can further tweak this for desired control. Data from the previous patchsets can be found at https://lkml.org/lkml/2010/11/30/79 --- Balbir Singh (3): Move zone_reclaim() outside of CONFIG_NUMA Refactor zone_reclaim code Provide control over unmapped pages Documentation/kernel-parameters.txt |8 ++ include/linux/mmzone.h |9 ++- include/linux/swap.h| 23 +-- init/Kconfig| 12 +++ kernel/sysctl.c | 29 ++-- mm/page_alloc.c | 31 - mm/vmscan.c | 122 +++ 7 files changed, 202 insertions(+), 32 deletions(-) -- Balbir Singh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v4)
This patch moves zone_reclaim and associated helpers outside CONFIG_NUMA. This infrastructure is reused in the patches for page cache control that follow. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/mmzone.h |4 ++-- include/linux/swap.h |4 ++-- kernel/sysctl.c| 18 +- mm/page_alloc.c|6 +++--- mm/vmscan.c|2 -- 5 files changed, 16 insertions(+), 18 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 02ecb01..2485acc 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -303,12 +303,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; -#ifdef CONFIG_NUMA - int node; /* * zone reclaim becomes active if more unmapped pages exist. */ unsigned long min_unmapped_pages; +#ifdef CONFIG_NUMA + int node; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index 5e3355a..7b75626 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -255,11 +255,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern int sysctl_min_unmapped_ratio; +extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index bc86bb3..12e8f26 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1224,15 +1224,6 @@ static struct ctl_table vm_table[] = { .extra1 = zero, }, #endif -#ifdef CONFIG_NUMA - { - .procname = zone_reclaim_mode, - .data = zone_reclaim_mode, - .maxlen = sizeof(zone_reclaim_mode), - .mode = 0644, - .proc_handler = proc_dointvec, - .extra1 = zero, - }, { .procname = min_unmapped_ratio, .data = sysctl_min_unmapped_ratio, @@ -1242,6 +1233,15 @@ static struct ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, +#ifdef CONFIG_NUMA + { + .procname = zone_reclaim_mode, + .data = zone_reclaim_mode, + .maxlen = sizeof(zone_reclaim_mode), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = zero, + }, { .procname = min_slab_ratio, .data = sysctl_min_slab_ratio, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index aede3a4..7b56473 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4167,10 +4167,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone-spanned_pages = size; zone-present_pages = realsize; -#ifdef CONFIG_NUMA - zone-node = nid; zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) / 100; +#ifdef CONFIG_NUMA + zone-node = nid; zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; #endif zone-name = zone_names[j]; @@ -5084,7 +5084,6 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write, return 0; } -#ifdef CONFIG_NUMA int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { @@ -5101,6 +5100,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write, return 0; } +#ifdef CONFIG_NUMA int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos) { diff --git a/mm/vmscan.c b/mm/vmscan.c index 47a5096..5899f2f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2868,7 +2868,6 @@ static int __init kswapd_init(void) module_init(kswapd_init) -#ifdef CONFIG_NUMA /* * Zone reclaim mode * @@ -3078,7 +3077,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) return ret; } -#endif /* * page_evictable - test whether a page is evictable -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Refactor zone_reclaim code (v4)
Changelog v3 1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages Refactor zone_reclaim, move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Reviewed-by: Christoph Lameter c...@linux.com --- mm/vmscan.c | 35 +++ 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5899f2f..02cc82e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2943,6 +2943,27 @@ static long zone_pagecache_reclaimable(struct zone *zone) } /* + * Helper function to reclaim unmapped pages, we might add something + * similar to this for slab cache as well. Currently this function + * is shared with __zone_reclaim() + */ +static inline void +zone_reclaim_pages(struct zone *zone, struct scan_control *sc, + unsigned long nr_pages) +{ + int priority; + /* +* Free memory by calling shrink zone with increasing +* priorities until we have enough memory freed. +*/ + priority = ZONE_RECLAIM_PRIORITY; + do { + shrink_zone(priority, zone, sc); + priority--; + } while (priority = 0 sc-nr_reclaimed nr_pages); +} + +/* * Try to free up some pages from this zone through reclaim. */ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) @@ -2951,7 +2972,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) const unsigned long nr_pages = 1 order; struct task_struct *p = current; struct reclaim_state reclaim_state; - int priority; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode RECLAIM_WRITE), .may_unmap = !!(zone_reclaim_mode RECLAIM_SWAP), @@ -2975,17 +2995,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) reclaim_state.reclaimed_slab = 0; p-reclaim_state = reclaim_state; - if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) { - /* -* Free memory by calling shrink zone with increasing -* priorities until we have enough memory freed. -*/ - priority = ZONE_RECLAIM_PRIORITY; - do { - shrink_zone(priority, zone, sc); - priority--; - } while (priority = 0 sc.nr_reclaimed nr_pages); - } + if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) + zone_reclaim_pages(zone, sc, nr_pages); nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); if (nr_slab_pages0 zone-min_slab_pages) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Provide control over unmapped pages (v4)
Changelog v4 1. Add max_unmapped_ratio and use that as the upper limit to check when to shrink the unmapped page cache (Christoph Lameter) Changelog v2 1. Use a config option to enable the code (Andrew Morton) 2. Explain the magic tunables in the code or at-least attempt to explain them (General comment) 3. Hint uses of the boot parameter with unlikely (Andrew Morton) 4. Use better names (balanced is not a good naming convention) Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. A new sysctl for max_unmapped_ratio is provided and set to 16, indicating 16% of the total zone pages are unmapped, we start shrinking unmapped page cache. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- Documentation/kernel-parameters.txt |8 +++ include/linux/mmzone.h |5 ++ include/linux/swap.h| 23 - init/Kconfig| 12 + kernel/sysctl.c | 11 mm/page_alloc.c | 25 ++ mm/vmscan.c | 87 +++ 7 files changed, 166 insertions(+), 5 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index fee5f57..65a4ee6 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2500,6 +2500,14 @@ and is between 256 and 4096 characters. It is defined in the file [X86] Set unknown_nmi_panic=1 early on boot. + unmapped_page_control + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL + is enabled. It controls the amount of unmapped memory + that is present in the system. This boot option plus + vm.min_unmapped_ratio (sysctl) provide granular control + over how much unmapped page cache can exist in the system + before kswapd starts reclaiming unmapped page cache pages. + usbcore.autosuspend= [USB] The autosuspend time delay (in seconds) used for newly-detected USB devices (default 2). This diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 2485acc..18f0f09 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -306,7 +306,10 @@ struct zone { /* * zone reclaim becomes active if more unmapped pages exist. */ +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA) unsigned long min_unmapped_pages; + unsigned long max_unmapped_pages; +#endif #ifdef CONFIG_NUMA int node; unsigned long min_slab_pages; @@ -773,6 +776,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); +int sysctl_max_unmapped_ratio_sysctl_handler(struct ctl_table *, int, + void __user *, size_t *, loff_t *); int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); diff --git a/include/linux/swap.h b/include/linux/swap.h index 7b75626..ae62a03 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -255,19 +255,34 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) extern int sysctl_min_unmapped_ratio; +extern int sysctl_max_unmapped_ratio; + extern int zone_reclaim(struct zone *, gfp_t, unsigned int); -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; -extern int sysctl_min_slab_ratio; #else -#define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; } #endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +extern bool should_reclaim_unmapped_pages(struct zone *zone); +#else +static inline bool should_reclaim_unmapped_pages(struct zone *zone) +{ + return false; +} +#endif + +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; +extern int sysctl_min_slab_ratio; +#else +#define zone_reclaim_mode 0 +#endif + extern int page_evictable(struct page *page, struct vm_area_struct *vma); extern void scan_mapping_unevictable_pages(struct address_space *); diff --git a/init/Kconfig b/init/Kconfig index 4f6cdbf..2dfbc09 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -828,6 +828,18 @@ config SCHED_AUTOGROUP config MM_OWNER bool +config UNMAPPED_PAGECACHE_CONTROL + bool Provide control over unmapped page cache
Re: [PATCH 1/2] Refactor zone_reclaim code (v4)
* Balbir Singh bal...@linux.vnet.ibm.com [2011-01-25 10:40:09]: Changelog v3 1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages Refactor zone_reclaim, move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Reviewed-by: Christoph Lameter c...@linux.com I got the patch numbering wrong due to a internet connection going down in the middle of stg mail, restarting with specified patches goofed up the numbering. I can resend the patches with the correct numbering if desired. This patch should be numbered 2/3 -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)
* Christoph Lameter c...@linux.com [2011-01-21 09:55:17]: On Fri, 21 Jan 2011, Balbir Singh wrote: * Christoph Lameter c...@linux.com [2011-01-20 09:00:09]: On Thu, 20 Jan 2011, Balbir Singh wrote: + unmapped_page_control + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL + is enabled. It controls the amount of unmapped memory + that is present in the system. This boot option plus + vm.min_unmapped_ratio (sysctl) provide granular control min_unmapped_ratio is there to guarantee that zone reclaim does not reclaim all unmapped pages. What you want here is a max_unmapped_ratio. I thought about that, the logic for reusing min_unmapped_ratio was to keep a limit beyond which unmapped page cache shrinking should stop. Right. That is the role of it. Its a minimum to leave. You want a maximum size of the pagte cache. In this case we want the maximum to be as small as the minimum, but from a general design perspective maximum does make sense. I think you are suggesting max_unmapped_ratio as the point at which shrinking should begin, right? The role of min_unmapped_ratio is to never reclaim more pagecache if we reach that ratio even if we have to go off node for an allocation. AFAICT What you propose is a maximum size of the page cache. If the number of page cache pages goes beyond that then you trim the page cache in background reclaim. + reclaim_unmapped_pages(priority, zone, sc); + if (!zone_watermark_ok_safe(zone, order, H. Okay that means background reclaim does it. If so then we also want zone reclaim to be able to work in the background I think. Anything specific you had in mind, works for me in testing, but is there anything specific that stands out in your mind that needs to be done? Hmmm. So this would also work in a NUMA configuration, right. Limiting the sizes of the page cache would avoid zone reclaim through these limit. Page cache size would be limited by the max_unmapped_ratio. zone_reclaim only would come into play if other allocations make the memory on the node so tight that we would have to evict more page cache pages in direct reclaim. Then zone_reclaim could go down to shrink the page cache size to min_unmapped_ratio. I'll repost with max_unmapped_ration changes Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[REPOST][PATCH 0/3] Unmapped page cache control (v3)
The following series implements page cache control, this is a split out version of patch 1 of version 3 of the page cache optimization patches posted earlier at Previous posting http://lwn.net/Articles/419564/ The previous few revision received lot of comments, I've tried to address as many of those as possible in this revision. The last series was reviewed-by Christoph Lameter. There were comments on overlap with Nick's changes and overlap with them. I don't feel these changes impact Nick's work and integration can/will be considered as the patches evolve, if need be. Detailed Description This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. Data from the previous patchsets can be found at https://lkml.org/lkml/2010/11/30/79 Size measurement CONFIG_UNMAPPED_PAGECACHE_CONTROL and CONFIG_NUMA enabled # size mm/built-in.o textdata bss dec hex filename 419431 1883047 140888 2443366 254866 mm/built-in.o CONFIG_UNMAPPED_PAGECACHE_CONTROL disabled, CONFIG_NUMA enabled # size mm/built-in.o textdata bss dec hex filename 418908 1883023 140888 2442819 254643 mm/built-in.o --- Balbir Singh (3): Move zone_reclaim() outside of CONFIG_NUMA Refactor zone_reclaim code Provide control over unmapped pages Documentation/kernel-parameters.txt |8 ++ include/linux/mmzone.h |4 + include/linux/swap.h| 21 +- init/Kconfig| 12 +++ kernel/sysctl.c | 20 +++-- mm/page_alloc.c |9 ++ mm/vmscan.c | 132 +++ 7 files changed, 175 insertions(+), 31 deletions(-) -- Balbir Singh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)
This patch moves zone_reclaim and associated helpers outside CONFIG_NUMA. This infrastructure is reused in the patches for page cache control that follow. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/mmzone.h |4 ++-- include/linux/swap.h |4 ++-- kernel/sysctl.c| 18 +- mm/vmscan.c|2 -- 4 files changed, 13 insertions(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4890662..aeede91 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -302,12 +302,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; -#ifdef CONFIG_NUMA - int node; /* * zone reclaim becomes active if more unmapped pages exist. */ unsigned long min_unmapped_pages; +#ifdef CONFIG_NUMA + int node; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index 84375e4..ac5c06e 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,11 +253,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern int sysctl_min_unmapped_ratio; +extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index a00fdef..e40040e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1211,15 +1211,6 @@ static struct ctl_table vm_table[] = { .extra1 = zero, }, #endif -#ifdef CONFIG_NUMA - { - .procname = zone_reclaim_mode, - .data = zone_reclaim_mode, - .maxlen = sizeof(zone_reclaim_mode), - .mode = 0644, - .proc_handler = proc_dointvec, - .extra1 = zero, - }, { .procname = min_unmapped_ratio, .data = sysctl_min_unmapped_ratio, @@ -1229,6 +1220,15 @@ static struct ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, +#ifdef CONFIG_NUMA + { + .procname = zone_reclaim_mode, + .data = zone_reclaim_mode, + .maxlen = sizeof(zone_reclaim_mode), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = zero, + }, { .procname = min_slab_ratio, .data = sysctl_min_slab_ratio, diff --git a/mm/vmscan.c b/mm/vmscan.c index 42a4859..e841cae 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2740,7 +2740,6 @@ static int __init kswapd_init(void) module_init(kswapd_init) -#ifdef CONFIG_NUMA /* * Zone reclaim mode * @@ -2950,7 +2949,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) return ret; } -#endif /* * page_evictable - test whether a page is evictable -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)
Changelog v3 1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages Refactor zone_reclaim, move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- mm/vmscan.c | 35 +++ 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index e841cae..3b25423 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone) } /* + * Helper function to reclaim unmapped pages, we might add something + * similar to this for slab cache as well. Currently this function + * is shared with __zone_reclaim() + */ +static inline void +zone_reclaim_pages(struct zone *zone, struct scan_control *sc, + unsigned long nr_pages) +{ + int priority; + /* +* Free memory by calling shrink zone with increasing +* priorities until we have enough memory freed. +*/ + priority = ZONE_RECLAIM_PRIORITY; + do { + shrink_zone(priority, zone, sc); + priority--; + } while (priority = 0 sc-nr_reclaimed nr_pages); +} + +/* * Try to free up some pages from this zone through reclaim. */ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) @@ -2823,7 +2844,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) const unsigned long nr_pages = 1 order; struct task_struct *p = current; struct reclaim_state reclaim_state; - int priority; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode RECLAIM_WRITE), .may_unmap = !!(zone_reclaim_mode RECLAIM_SWAP), @@ -2847,17 +2867,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) reclaim_state.reclaimed_slab = 0; p-reclaim_state = reclaim_state; - if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) { - /* -* Free memory by calling shrink zone with increasing -* priorities until we have enough memory freed. -*/ - priority = ZONE_RECLAIM_PRIORITY; - do { - shrink_zone(priority, zone, sc); - priority--; - } while (priority = 0 sc.nr_reclaimed nr_pages); - } + if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) + zone_reclaim_pages(zone, sc, nr_pages); nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); if (nr_slab_pages0 zone-min_slab_pages) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)
Changelog v2 1. Use a config option to enable the code (Andrew Morton) 2. Explain the magic tunables in the code or at-least attempt to explain them (General comment) 3. Hint uses of the boot parameter with unlikely (Andrew Morton) 4. Use better names (balanced is not a good naming convention) Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- Documentation/kernel-parameters.txt |8 +++ include/linux/swap.h| 21 ++-- init/Kconfig| 12 kernel/sysctl.c |2 + mm/page_alloc.c |9 +++ mm/vmscan.c | 97 +++ 6 files changed, 142 insertions(+), 7 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index dd8fe2b..f52b0bd 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined in the file [X86] Set unknown_nmi_panic=1 early on boot. + unmapped_page_control + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL + is enabled. It controls the amount of unmapped memory + that is present in the system. This boot option plus + vm.min_unmapped_ratio (sysctl) provide granular control + over how much unmapped page cache can exist in the system + before kswapd starts reclaiming unmapped page cache pages. + usbcore.autosuspend= [USB] The autosuspend time delay (in seconds) used for newly-detected USB devices (default 2). This diff --git a/include/linux/swap.h b/include/linux/swap.h index ac5c06e..773d7e5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,19 +253,32 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) extern int sysctl_min_unmapped_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; -extern int sysctl_min_slab_ratio; #else -#define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; } #endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +extern bool should_reclaim_unmapped_pages(struct zone *zone); +#else +static inline bool should_reclaim_unmapped_pages(struct zone *zone) +{ + return false; +} +#endif + +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; +extern int sysctl_min_slab_ratio; +#else +#define zone_reclaim_mode 0 +#endif + extern int page_evictable(struct page *page, struct vm_area_struct *vma); extern void scan_mapping_unevictable_pages(struct address_space *); diff --git a/init/Kconfig b/init/Kconfig index 3eb22ad..78c9169 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -782,6 +782,18 @@ endif # NAMESPACES config MM_OWNER bool +config UNMAPPED_PAGECACHE_CONTROL + bool Provide control over unmapped page cache + default n + help + This option adds support for controlling unmapped page cache + via a boot parameter (unmapped_page_control). The boot parameter + with sysctl (vm.min_unmapped_ratio) control the total number + of unmapped pages in the system. This feature is useful if + you want to limit the amount of unmapped page cache or want + to reduce page cache duplication in a virtualized environment. + If unsure say 'N' + config SYSFS_DEPRECATED bool enable deprecated sysfs features to support old userspace tools depends on SYSFS diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e40040e..ab2c60a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = { .extra1 = zero, }, #endif +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA) { .procname = min_unmapped_ratio, .data = sysctl_min_unmapped_ratio, @@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, +#endif #ifdef CONFIG_NUMA { .procname = zone_reclaim_mode, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1845a97..1c9fbab 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9 @@ zonelist_scan: unsigned long mark
Re: [REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)
* Christoph Lameter c...@linux.com [2011-01-20 08:49:27]: On Thu, 20 Jan 2011, Balbir Singh wrote: --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,11 +253,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern int sysctl_min_unmapped_ratio; +extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 So the end result of this patch is that zone reclaim is compiled into vmscan.o even on !NUMA configurations but since zone_reclaim_mode == 0 noone can ever call that code? The third patch, fixes this with the introduction of a config (cut-copy-paste below). If someone were to bisect to this point, what you say is correct. +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) extern int sysctl_min_unmapped_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; -extern int sysctl_min_slab_ratio; #else -#define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; } #endif Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)
* Christoph Lameter c...@linux.com [2011-01-20 08:50:40]: Reviewed-by: Christoph Lameter c...@linux.com Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)
* Christoph Lameter c...@linux.com [2011-01-20 09:00:09]: On Thu, 20 Jan 2011, Balbir Singh wrote: + unmapped_page_control + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL + is enabled. It controls the amount of unmapped memory + that is present in the system. This boot option plus + vm.min_unmapped_ratio (sysctl) provide granular control min_unmapped_ratio is there to guarantee that zone reclaim does not reclaim all unmapped pages. What you want here is a max_unmapped_ratio. I thought about that, the logic for reusing min_unmapped_ratio was to keep a limit beyond which unmapped page cache shrinking should stop. I think you are suggesting max_unmapped_ratio as the point at which shrinking should begin, right? { @@ -2297,6 +2320,12 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); + /* +* We do unmapped page reclaim once here and once +* below, so that we don't lose out +*/ + reclaim_unmapped_pages(priority, zone, sc); + if (!zone_watermark_ok_safe(zone, order, H. Okay that means background reclaim does it. If so then we also want zone reclaim to be able to work in the background I think. Anything specific you had in mind, works for me in testing, but is there anything specific that stands out in your mind that needs to be done? Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages (v2)
* MinChan Kim minchan@gmail.com [2010-12-14 20:02:45]: + Â Â Â Â Â Â Â Â Â Â Â if (should_reclaim_unmapped_pages(zone)) + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â wakeup_kswapd(zone, order); I think we can put the logic into zone_watermark_okay. I did some checks and zone_watermark_ok is used in several places for a generic check like this -- for example prior to zone_reclaim(), if in get_page_from_freelist() we skip zones based on the return value. The compaction code uses it as well, the impact would be deeper. The compaction code uses it to check whether an allocation will succeed or not, I don't want unmapped page control to impact that. + Â Â Â Â Â Â Â Â Â Â Â /* + Â Â Â Â Â Â Â Â Â Â Â Â * We do unmapped page reclaim once here and once + Â Â Â Â Â Â Â Â Â Â Â Â * below, so that we don't lose out + Â Â Â Â Â Â Â Â Â Â Â Â */ + Â Â Â Â Â Â Â Â Â Â Â reclaim_unmapped_pages(priority, zone, sc); It can make unnecessary stir of lru pages. How about this? zone_watermark_ok returns ZONE_UNMAPPED_PAGE_FULL. wakeup_kswapd(..., please reclaim unmapped page cache). If kswapd is woke up by unmapped page full, kswapd sets up sc with unmap = 0. If the kswapd try to reclaim unmapped page, shrink_page_list doesn't rotate non-unmapped pages. With may_unmap set to 0 and may_writepage set to 0, I don't think this should be a major problem, like I said this code is already enabled if zone_reclaim_mode != 0 and CONFIG_NUMA is set. +unsigned long reclaim_unmapped_pages(int priority, struct zone *zone, + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct scan_control *sc) +{ + Â Â Â if (unlikely(unmapped_page_control) + Â Â Â Â Â Â Â (zone_unmapped_file_pages(zone) zone-min_unmapped_pages)) { + Â Â Â Â Â Â Â struct scan_control nsc; + Â Â Â Â Â Â Â unsigned long nr_pages; + + Â Â Â Â Â Â Â nsc = *sc; + + Â Â Â Â Â Â Â nsc.swappiness = 0; + Â Â Â Â Â Â Â nsc.may_writepage = 0; + Â Â Â Â Â Â Â nsc.may_unmap = 0; + Â Â Â Â Â Â Â nsc.nr_reclaimed = 0; This logic can be put in zone_reclaim_unmapped_pages. Now that I refactored the code and called it zone_reclaim_pages, I expect the correct sc to be passed to it. This code is reused between zone_reclaim() and reclaim_unmapped_pages(). In the former, zone_reclaim does the setup. If we want really this, how about the new cache lru idea as Kame suggests? For example, add_to_page_cache_lru adds the page into cache lru. page_add_file_rmap moves the page into inactive file. page_remove_rmap moves the page into lru cache, again. We can count the unmapped pages and if the size exceeds limit, we can wake up kswapd. whenever the memory pressure happens, first of all, reclaimer try to reclaim cache lru. We already have a file LRU and that has active/inactive lists, I don't think a special mapped/unmapped list makes sense at this point. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Refactor zone_reclaim (v2)
* MinChan Kim minchan@gmail.com [2010-12-15 07:38:42]: On Tue, Dec 14, 2010 at 8:45 PM, Balbir Singh bal...@linux.vnet.ibm.com wrote: * MinChan Kim minchan@gmail.com [2010-12-14 19:01:26]: Hi Balbir, On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh bal...@linux.vnet.ibm.com wrote: Move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- Â mm/vmscan.c | Â 35 +++ Â 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index e841cae..4e2ad05 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone) Â } Â /* + * Helper function to reclaim unmapped pages, we might add something + * similar to this for slab cache as well. Currently this function + * is shared with __zone_reclaim() + */ +static inline void +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc, + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â unsigned long nr_pages) +{ + Â Â Â int priority; + Â Â Â /* + Â Â Â Â * Free memory by calling shrink zone with increasing + Â Â Â Â * priorities until we have enough memory freed. + Â Â Â Â */ + Â Â Â priority = ZONE_RECLAIM_PRIORITY; + Â Â Â do { + Â Â Â Â Â Â Â shrink_zone(priority, zone, sc); + Â Â Â Â Â Â Â priority--; + Â Â Â } while (priority = 0 sc-nr_reclaimed nr_pages); +} As I said previous version, zone_reclaim_unmapped_pages doesn't have any functions related to reclaim unmapped pages. The scan control point has the right arguments for implementing reclaim of unmapped pages. I mean you should set up scan_control setup in this function. Current zone_reclaim_unmapped_pages doesn't have any specific routine related to reclaim unmapped pages. Otherwise, change the function name with just zone_reclaim_pages. I think you don't want it. Done, I renamed the function to zone_reclaim_pages. Thanks! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Unmapped Page Control (v3)
The following series implements page cache control, this is a split out version of patch 1 of version 3 of the page cache optimization patches posted earlier at Previous posting http://lwn.net/Articles/419564/ For those with LWN.net access, there is a detailed coverage of the patchset at http://lwn.net/Articles/419713/ The previous few revision received lot of comments, I've tried to address as many of those as possible in this revision. An earlier series was reviewed-by Christoph Lameter. There were comments on overlap with Nick's changes and overlap with them. I don't feel these changes impact Nick's work and integration can/will be considered as the patches evolve, if need be. Detailed Description This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. Data from the previous patchsets can be found at https://lkml.org/lkml/2010/11/30/79 Size measurement CONFIG_UNMAPPED_PAGECACHE_CONTROL and CONFIG_NUMA enabled # size mm/built-in.o textdata bss dec hex filename 419431 1883047 140888 2443366 254866 mm/built-in.o CONFIG_UNMAPPED_PAGECACHE_CONTROL disabled, CONFIG_NUMA enabled # size mm/built-in.o textdata bss dec hex filename 418908 1883023 140888 2442819 254643 mm/built-in.o --- Balbir Singh (3): Move zone_reclaim() outside of CONFIG_NUMA Refactor zone_reclaim code Provide control over unmapped pages Documentation/kernel-parameters.txt |8 ++ include/linux/mmzone.h |4 + include/linux/swap.h| 21 +- init/Kconfig| 12 +++ kernel/sysctl.c | 20 +++-- mm/page_alloc.c |9 ++ mm/vmscan.c | 132 +++ 7 files changed, 175 insertions(+), 31 deletions(-) -- Balbir Singh -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)
This patch moves zone_reclaim and associated helpers outside CONFIG_NUMA. This infrastructure is reused in the patches for page cache control that follow. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/mmzone.h |4 ++-- include/linux/swap.h |4 ++-- kernel/sysctl.c| 18 +- mm/vmscan.c|2 -- 4 files changed, 13 insertions(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4890662..aeede91 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -302,12 +302,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; -#ifdef CONFIG_NUMA - int node; /* * zone reclaim becomes active if more unmapped pages exist. */ unsigned long min_unmapped_pages; +#ifdef CONFIG_NUMA + int node; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index 84375e4..ac5c06e 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,11 +253,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern int sysctl_min_unmapped_ratio; +extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index a00fdef..e40040e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1211,15 +1211,6 @@ static struct ctl_table vm_table[] = { .extra1 = zero, }, #endif -#ifdef CONFIG_NUMA - { - .procname = zone_reclaim_mode, - .data = zone_reclaim_mode, - .maxlen = sizeof(zone_reclaim_mode), - .mode = 0644, - .proc_handler = proc_dointvec, - .extra1 = zero, - }, { .procname = min_unmapped_ratio, .data = sysctl_min_unmapped_ratio, @@ -1229,6 +1220,15 @@ static struct ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, +#ifdef CONFIG_NUMA + { + .procname = zone_reclaim_mode, + .data = zone_reclaim_mode, + .maxlen = sizeof(zone_reclaim_mode), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = zero, + }, { .procname = min_slab_ratio, .data = sysctl_min_slab_ratio, diff --git a/mm/vmscan.c b/mm/vmscan.c index 42a4859..e841cae 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2740,7 +2740,6 @@ static int __init kswapd_init(void) module_init(kswapd_init) -#ifdef CONFIG_NUMA /* * Zone reclaim mode * @@ -2950,7 +2949,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) return ret; } -#endif /* * page_evictable - test whether a page is evictable -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Refactor zone_reclaim code (v3)
Changelog v3 1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages Refactor zone_reclaim, move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- mm/vmscan.c | 35 +++ 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index e841cae..3b25423 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone) } /* + * Helper function to reclaim unmapped pages, we might add something + * similar to this for slab cache as well. Currently this function + * is shared with __zone_reclaim() + */ +static inline void +zone_reclaim_pages(struct zone *zone, struct scan_control *sc, + unsigned long nr_pages) +{ + int priority; + /* +* Free memory by calling shrink zone with increasing +* priorities until we have enough memory freed. +*/ + priority = ZONE_RECLAIM_PRIORITY; + do { + shrink_zone(priority, zone, sc); + priority--; + } while (priority = 0 sc-nr_reclaimed nr_pages); +} + +/* * Try to free up some pages from this zone through reclaim. */ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) @@ -2823,7 +2844,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) const unsigned long nr_pages = 1 order; struct task_struct *p = current; struct reclaim_state reclaim_state; - int priority; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode RECLAIM_WRITE), .may_unmap = !!(zone_reclaim_mode RECLAIM_SWAP), @@ -2847,17 +2867,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) reclaim_state.reclaimed_slab = 0; p-reclaim_state = reclaim_state; - if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) { - /* -* Free memory by calling shrink zone with increasing -* priorities until we have enough memory freed. -*/ - priority = ZONE_RECLAIM_PRIORITY; - do { - shrink_zone(priority, zone, sc); - priority--; - } while (priority = 0 sc.nr_reclaimed nr_pages); - } + if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) + zone_reclaim_pages(zone, sc, nr_pages); nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); if (nr_slab_pages0 zone-min_slab_pages) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Provide control over unmapped pages (v3)
Changelog v2 1. Use a config option to enable the code (Andrew Morton) 2. Explain the magic tunables in the code or at-least attempt to explain them (General comment) 3. Hint uses of the boot parameter with unlikely (Andrew Morton) 4. Use better names (balanced is not a good naming convention) Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- Documentation/kernel-parameters.txt |8 +++ include/linux/swap.h| 21 ++-- init/Kconfig| 12 kernel/sysctl.c |2 + mm/page_alloc.c |9 +++ mm/vmscan.c | 97 +++ 6 files changed, 142 insertions(+), 7 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index dd8fe2b..f52b0bd 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined in the file [X86] Set unknown_nmi_panic=1 early on boot. + unmapped_page_control + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL + is enabled. It controls the amount of unmapped memory + that is present in the system. This boot option plus + vm.min_unmapped_ratio (sysctl) provide granular control + over how much unmapped page cache can exist in the system + before kswapd starts reclaiming unmapped page cache pages. + usbcore.autosuspend= [USB] The autosuspend time delay (in seconds) used for newly-detected USB devices (default 2). This diff --git a/include/linux/swap.h b/include/linux/swap.h index ac5c06e..773d7e5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,19 +253,32 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) extern int sysctl_min_unmapped_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; -extern int sysctl_min_slab_ratio; #else -#define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; } #endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +extern bool should_reclaim_unmapped_pages(struct zone *zone); +#else +static inline bool should_reclaim_unmapped_pages(struct zone *zone) +{ + return false; +} +#endif + +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; +extern int sysctl_min_slab_ratio; +#else +#define zone_reclaim_mode 0 +#endif + extern int page_evictable(struct page *page, struct vm_area_struct *vma); extern void scan_mapping_unevictable_pages(struct address_space *); diff --git a/init/Kconfig b/init/Kconfig index 3eb22ad..78c9169 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -782,6 +782,18 @@ endif # NAMESPACES config MM_OWNER bool +config UNMAPPED_PAGECACHE_CONTROL + bool Provide control over unmapped page cache + default n + help + This option adds support for controlling unmapped page cache + via a boot parameter (unmapped_page_control). The boot parameter + with sysctl (vm.min_unmapped_ratio) control the total number + of unmapped pages in the system. This feature is useful if + you want to limit the amount of unmapped page cache or want + to reduce page cache duplication in a virtualized environment. + If unsure say 'N' + config SYSFS_DEPRECATED bool enable deprecated sysfs features to support old userspace tools depends on SYSFS diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e40040e..ab2c60a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = { .extra1 = zero, }, #endif +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA) { .procname = min_unmapped_ratio, .data = sysctl_min_unmapped_ratio, @@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, +#endif #ifdef CONFIG_NUMA { .procname = zone_reclaim_mode, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1845a97..1c9fbab 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9 @@ zonelist_scan: unsigned long mark
Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting
* Rik van Riel r...@redhat.com [2010-12-13 12:02:51]: On 12/11/2010 08:57 AM, Balbir Singh wrote: If the vpcu holding the lock runs more and capped, the timeslice transfer is a heuristic that will not help. That indicates you really need the cap to be per guest, and not per VCPU. Yes, I personally think so too, but I suspect there needs to be a larger agreement on the semantics. The VCPU semantics in terms of power apply to each VCPU as opposed to the entire system (per guest). Having one VCPU spin on a lock (and achieve nothing), because the other one cannot give up the lock due to hitting its CPU cap could lead to showstoppingly bad performance. Yes, that seems right! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Refactor zone_reclaim (v2)
* MinChan Kim minchan@gmail.com [2010-12-14 19:01:26]: Hi Balbir, On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh bal...@linux.vnet.ibm.com wrote: Move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- Â mm/vmscan.c | Â 35 +++ Â 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index e841cae..4e2ad05 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone) Â } Â /* + * Helper function to reclaim unmapped pages, we might add something + * similar to this for slab cache as well. Currently this function + * is shared with __zone_reclaim() + */ +static inline void +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc, + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â unsigned long nr_pages) +{ + Â Â Â int priority; + Â Â Â /* + Â Â Â Â * Free memory by calling shrink zone with increasing + Â Â Â Â * priorities until we have enough memory freed. + Â Â Â Â */ + Â Â Â priority = ZONE_RECLAIM_PRIORITY; + Â Â Â do { + Â Â Â Â Â Â Â shrink_zone(priority, zone, sc); + Â Â Â Â Â Â Â priority--; + Â Â Â } while (priority = 0 sc-nr_reclaimed nr_pages); +} As I said previous version, zone_reclaim_unmapped_pages doesn't have any functions related to reclaim unmapped pages. The function name is rather strange. It would be better to add scan_control setup in function inner to reclaim only unmapped pages. OK, that is an idea worth looking at, I'll revisit this function. Thanks for the review! -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting
* Avi Kivity a...@redhat.com [2010-12-11 09:31:24]: On 12/10/2010 07:03 AM, Balbir Singh wrote: Scheduler people, please flame me with anything I may have done wrong, so I can do it right for a next version :) This is a good problem statement, there are other things to consider as well 1. If a hard limit feature is enabled underneath, donating the timeslice would probably not make too much sense in that case What's the alternative? Consider a two vcpu guest with a 50% hard cap. Suppose the workload involves ping-ponging within the guest. If the scheduler decides to schedule the vcpus without any overlap, then the throughput will be dictated by the time slice. If we allow donation, throughput is limited by context switch latency. If the vpcu holding the lock runs more and capped, the timeslice transfer is a heuristic that will not help. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting
* Avi Kivity a...@redhat.com [2010-12-13 13:57:37]: On 12/11/2010 03:57 PM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-12-11 09:31:24]: On 12/10/2010 07:03 AM, Balbir Singh wrote: Scheduler people, please flame me with anything I may have done wrong, so I can do it right for a next version :) This is a good problem statement, there are other things to consider as well 1. If a hard limit feature is enabled underneath, donating the timeslice would probably not make too much sense in that case What's the alternative? Consider a two vcpu guest with a 50% hard cap. Suppose the workload involves ping-ponging within the guest. If the scheduler decides to schedule the vcpus without any overlap, then the throughput will be dictated by the time slice. If we allow donation, throughput is limited by context switch latency. If the vpcu holding the lock runs more and capped, the timeslice transfer is a heuristic that will not help. Why not? as long as we shift the cap as well. Shifting the cap would break it, no? Anyway, that is something for us to keep track of as we add additional heuristics, not a show stopper. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Provide unmapped page cache control (v2)
The following series implements page cache control, this is a split out version of patch 1 of version 3 of the page cache optimization patches posted earlier at Previous posting https://lkml.org/lkml/2010/11/30/79 The previous revision received lot of comments, I've tried to address as many of those as possible in this revision. The last series was reviewed-by Christoph Lameter. There were comments on overlap with Nick's changes and overlap with them. I don't feel these changes impact Nick's work and integration can/will be considered as the patches evolve, if need be. Detailed Description This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. Data from the previous patchsets can be found at https://lkml.org/lkml/2010/11/30/79 Size measurement CONFIG_UNMAPPED_PAGECACHE_CONTROL and CONFIG_NUMA enabled # size mm/built-in.o textdata bss dec hex filename 419431 1883047 140888 2443366 254866 mm/built-in.o CONFIG_UNMAPPED_PAGECACHE_CONTROL disabled, CONFIG_NUMA enabled # size mm/built-in.o textdata bss dec hex filename 418908 1883023 140888 2442819 254643 mm/built-in.o --- Balbir Singh (3): Move zone_reclaim() outside of CONFIG_NUMA Refactor zone_reclaim, move reusable functionality outside Provide control over unmapped pages Documentation/kernel-parameters.txt |8 ++ include/linux/mmzone.h |4 + include/linux/swap.h| 21 +- init/Kconfig| 12 +++ kernel/sysctl.c | 20 +++-- mm/page_alloc.c |9 ++ mm/vmscan.c | 132 +++ 7 files changed, 175 insertions(+), 31 deletions(-) -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v2)
Changelog v2 Moved sysctl for min_unmapped_ratio as well This patch moves zone_reclaim and associated helpers outside CONFIG_NUMA. This infrastructure is reused in the patches for page cache control that follow. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/mmzone.h |4 ++-- include/linux/swap.h |4 ++-- kernel/sysctl.c| 18 +- mm/vmscan.c|2 -- 4 files changed, 13 insertions(+), 15 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4890662..aeede91 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -302,12 +302,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; -#ifdef CONFIG_NUMA - int node; /* * zone reclaim becomes active if more unmapped pages exist. */ unsigned long min_unmapped_pages; +#ifdef CONFIG_NUMA + int node; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index 84375e4..ac5c06e 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,11 +253,11 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern int sysctl_min_unmapped_ratio; +extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; -extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index a00fdef..e40040e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1211,15 +1211,6 @@ static struct ctl_table vm_table[] = { .extra1 = zero, }, #endif -#ifdef CONFIG_NUMA - { - .procname = zone_reclaim_mode, - .data = zone_reclaim_mode, - .maxlen = sizeof(zone_reclaim_mode), - .mode = 0644, - .proc_handler = proc_dointvec, - .extra1 = zero, - }, { .procname = min_unmapped_ratio, .data = sysctl_min_unmapped_ratio, @@ -1229,6 +1220,15 @@ static struct ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, +#ifdef CONFIG_NUMA + { + .procname = zone_reclaim_mode, + .data = zone_reclaim_mode, + .maxlen = sizeof(zone_reclaim_mode), + .mode = 0644, + .proc_handler = proc_dointvec, + .extra1 = zero, + }, { .procname = min_slab_ratio, .data = sysctl_min_slab_ratio, diff --git a/mm/vmscan.c b/mm/vmscan.c index 42a4859..e841cae 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2740,7 +2740,6 @@ static int __init kswapd_init(void) module_init(kswapd_init) -#ifdef CONFIG_NUMA /* * Zone reclaim mode * @@ -2950,7 +2949,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) return ret; } -#endif /* * page_evictable - test whether a page is evictable -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Refactor zone_reclaim (v2)
Move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- mm/vmscan.c | 35 +++ 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index e841cae..4e2ad05 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone) } /* + * Helper function to reclaim unmapped pages, we might add something + * similar to this for slab cache as well. Currently this function + * is shared with __zone_reclaim() + */ +static inline void +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc, + unsigned long nr_pages) +{ + int priority; + /* +* Free memory by calling shrink zone with increasing +* priorities until we have enough memory freed. +*/ + priority = ZONE_RECLAIM_PRIORITY; + do { + shrink_zone(priority, zone, sc); + priority--; + } while (priority = 0 sc-nr_reclaimed nr_pages); +} + +/* * Try to free up some pages from this zone through reclaim. */ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) @@ -2823,7 +2844,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) const unsigned long nr_pages = 1 order; struct task_struct *p = current; struct reclaim_state reclaim_state; - int priority; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode RECLAIM_WRITE), .may_unmap = !!(zone_reclaim_mode RECLAIM_SWAP), @@ -2847,17 +2867,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) reclaim_state.reclaimed_slab = 0; p-reclaim_state = reclaim_state; - if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) { - /* -* Free memory by calling shrink zone with increasing -* priorities until we have enough memory freed. -*/ - priority = ZONE_RECLAIM_PRIORITY; - do { - shrink_zone(priority, zone, sc); - priority--; - } while (priority = 0 sc.nr_reclaimed nr_pages); - } + if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) + zone_reclaim_unmapped_pages(zone, sc, nr_pages); nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); if (nr_slab_pages0 zone-min_slab_pages) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Provide control over unmapped pages (v2)
Changelog v2 1. Use a config option to enable the code (Andrew Morton) 2. Explain the magic tunables in the code or at-least attempt to explain them (General comment) 3. Hint uses of the boot parameter with unlikely (Andrew Morton) 4. Use better names (balanced is not a good naming convention) 5. Updated Documentation/kernel-parameters.txt (Andrew Morton) Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- Documentation/kernel-parameters.txt |8 +++ include/linux/swap.h| 21 ++-- init/Kconfig| 12 kernel/sysctl.c |2 + mm/page_alloc.c |9 +++ mm/vmscan.c | 97 +++ 6 files changed, 142 insertions(+), 7 deletions(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index dd8fe2b..f52b0bd 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined in the file [X86] Set unknown_nmi_panic=1 early on boot. + unmapped_page_control + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL + is enabled. It controls the amount of unmapped memory + that is present in the system. This boot option plus + vm.min_unmapped_ratio (sysctl) provide granular control + over how much unmapped page cache can exist in the system + before kswapd starts reclaiming unmapped page cache pages. + usbcore.autosuspend= [USB] The autosuspend time delay (in seconds) used for newly-detected USB devices (default 2). This diff --git a/include/linux/swap.h b/include/linux/swap.h index ac5c06e..773d7e5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -253,19 +253,32 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA) extern int sysctl_min_unmapped_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; -extern int sysctl_min_slab_ratio; #else -#define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) { return 0; } #endif +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) +extern bool should_reclaim_unmapped_pages(struct zone *zone); +#else +static inline bool should_reclaim_unmapped_pages(struct zone *zone) +{ + return false; +} +#endif + +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; +extern int sysctl_min_slab_ratio; +#else +#define zone_reclaim_mode 0 +#endif + extern int page_evictable(struct page *page, struct vm_area_struct *vma); extern void scan_mapping_unevictable_pages(struct address_space *); diff --git a/init/Kconfig b/init/Kconfig index 3eb22ad..78c9169 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -782,6 +782,18 @@ endif # NAMESPACES config MM_OWNER bool +config UNMAPPED_PAGECACHE_CONTROL + bool Provide control over unmapped page cache + default n + help + This option adds support for controlling unmapped page cache + via a boot parameter (unmapped_page_control). The boot parameter + with sysctl (vm.min_unmapped_ratio) control the total number + of unmapped pages in the system. This feature is useful if + you want to limit the amount of unmapped page cache or want + to reduce page cache duplication in a virtualized environment. + If unsure say 'N' + config SYSFS_DEPRECATED bool enable deprecated sysfs features to support old userspace tools depends on SYSFS diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e40040e..ab2c60a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = { .extra1 = zero, }, #endif +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA) { .procname = min_unmapped_ratio, .data = sysctl_min_unmapped_ratio, @@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = { .extra1 = zero, .extra2 = one_hundred, }, +#endif #ifdef CONFIG_NUMA { .procname = zone_reclaim_mode, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1845a97..1c9fbab 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9
Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting
* Rik van Riel r...@redhat.com [2010-12-02 14:41:29]: When running SMP virtual machines, it is possible for one VCPU to be spinning on a spinlock, while the VCPU that holds the spinlock is not currently running, because the host scheduler preempted it to run something else. Both Intel and AMD CPUs have a feature that detects when a virtual CPU is spinning on a lock and will trap to the host. The current KVM code sleeps for a bit whenever that happens, which results in eg. a 64 VCPU Windows guest taking forever and a bit to boot up. This is because the VCPU holding the lock is actually running and not sleeping, so the pause is counter-productive. In other workloads a pause can also be counter-productive, with spinlock detection resulting in one guest giving up its CPU time to the others. Instead of spinning, it ends up simply not running much at all. This patch series aims to fix that, by having a VCPU that spins give the remainder of its timeslice to another VCPU in the same guest before yielding the CPU - one that is runnable but got preempted, hopefully the lock holder. Scheduler people, please flame me with anything I may have done wrong, so I can do it right for a next version :) This is a good problem statement, there are other things to consider as well 1. If a hard limit feature is enabled underneath, donating the timeslice would probably not make too much sense in that case 2. The implict assumption is that spinning is bad, but for locks held for short durations, the assumption is not true. I presume by the problem statement above, the h/w does the detection of when to pause, but that is not always correct as you suggest above. 3. With respect to donating timeslices, don't scheduler cgroups and job isolation address that problem today? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-12-02 11:50:36]: On Thu, 2 Dec 2010 10:22:16 +0900 (JST) KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com wrote: On Tue, 30 Nov 2010, Andrew Morton wrote: +#define UNMAPPED_PAGE_RATIO 16 Well. Giving 16 a name didn't really clarify anything. Attentive readers will want to know what this does, why 16 was chosen and what the effects of changing it will be. The meaning is analoguous to the other zone reclaim ratio. But yes it should be justified and defined. Reviewed-by: Christoph Lameter c...@linux.com So you're OK with shoving all this flotsam into 100,000,000 cellphones? This was a pretty outrageous patchset! This is a feature that has been requested over and over for years. Using /proc/vm/drop_caches for fixing situations where one simply has too many page cache pages is not so much fun in the long run. I'm not against page cache limitation feature at all. But, this is too ugly and too destructive fast path. I hope this patch reduce negative impact more. And I think min_mapped_unmapped_pages is ugly. It should be unmapped_pagecache_limit or some because it's for limitation feature. The feature will now be enabled with a CONFIG and boot parameter, I find changing the naming convention now - it is already in use and well known is not a good idea. THe name of the boot parameter can be changed of-course. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Series short description
The following series implements page cache control, this is a split out version of patch 1 of version 3 of the page cache optimization patches posted earlier at http://www.mail-archive.com/kvm@vger.kernel.org/msg43654.html Christoph Lamater recommended splitting out patch 1, which is what this series does Detailed Description This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. For a single VM - running kernbench Enabled Optimal load -j 8 run number 1... Optimal load -j 8 run number 2... Optimal load -j 8 run number 3... Optimal load -j 8 run number 4... Optimal load -j 8 run number 5... Average Optimal load -j 8 Run (std deviation): Elapsed Time 273.726 (1.2683) User Time 190.014 (0.589941) System Time 298.758 (1.72574) Percent CPU 178 (0) Context Switches 119953 (865.74) Sleeps 38758 (795.074) Disabled Optimal load -j 8 run number 1... Optimal load -j 8 run number 2... Optimal load -j 8 run number 3... Optimal load -j 8 run number 4... Optimal load -j 8 run number 5... Average Optimal load -j 8 Run (std deviation): Elapsed Time 272.672 (0.453178) User Time 189.7 (0.718157) System Time 296.77 (0.845606) Percent CPU 178 (0) Context Switches 118822 (277.434) Sleeps 37542.8 (545.922) More data on the test results with the earlier patch is at http://www.mail-archive.com/kvm@vger.kernel.org/msg43655.html --- Balbir Singh (3): Move zone_reclaim() outside of CONFIG_NUMA Refactor zone_reclaim, move reusable functionality outside Provide control over unmapped pages include/linux/mmzone.h |4 +- include/linux/swap.h |5 +- mm/page_alloc.c|7 ++- mm/vmscan.c| 109 +--- 4 files changed, 104 insertions(+), 21 deletions(-) -- Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA
This patch moves zone_reclaim and associated helpers outside CONFIG_NUMA. This infrastructure is reused in the patches for page cache control that follow. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/mmzone.h |4 ++-- mm/vmscan.c|2 -- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4890662..aeede91 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -302,12 +302,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; -#ifdef CONFIG_NUMA - int node; /* * zone reclaim becomes active if more unmapped pages exist. */ unsigned long min_unmapped_pages; +#ifdef CONFIG_NUMA + int node; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/mm/vmscan.c b/mm/vmscan.c index 8cc90d5..325443a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2644,7 +2644,6 @@ static int __init kswapd_init(void) module_init(kswapd_init) -#ifdef CONFIG_NUMA /* * Zone reclaim mode * @@ -2854,7 +2853,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) return ret; } -#endif /* * page_evictable - test whether a page is evictable -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Refactor zone_reclaim
Refactor zone_reclaim, move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- mm/vmscan.c | 35 +++ 1 files changed, 23 insertions(+), 12 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 325443a..0ac444f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2719,6 +2719,27 @@ static long zone_pagecache_reclaimable(struct zone *zone) } /* + * Helper function to reclaim unmapped pages, we might add something + * similar to this for slab cache as well. Currently this function + * is shared with __zone_reclaim() + */ +static inline void +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc, + unsigned long nr_pages) +{ + int priority; + /* +* Free memory by calling shrink zone with increasing +* priorities until we have enough memory freed. +*/ + priority = ZONE_RECLAIM_PRIORITY; + do { + shrink_zone(priority, zone, sc); + priority--; + } while (priority = 0 sc-nr_reclaimed nr_pages); +} + +/* * Try to free up some pages from this zone through reclaim. */ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) @@ -2727,7 +2748,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) const unsigned long nr_pages = 1 order; struct task_struct *p = current; struct reclaim_state reclaim_state; - int priority; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode RECLAIM_WRITE), .may_unmap = !!(zone_reclaim_mode RECLAIM_SWAP), @@ -2751,17 +2771,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) reclaim_state.reclaimed_slab = 0; p-reclaim_state = reclaim_state; - if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) { - /* -* Free memory by calling shrink zone with increasing -* priorities until we have enough memory freed. -*/ - priority = ZONE_RECLAIM_PRIORITY; - do { - shrink_zone(priority, zone, sc); - priority--; - } while (priority = 0 sc.nr_reclaimed nr_pages); - } + if (zone_pagecache_reclaimable(zone) zone-min_unmapped_pages) + zone_reclaim_unmapped_pages(zone, sc, nr_pages); nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE); if (nr_slab_pages0 zone-min_slab_pages) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Provide control over unmapped pages
Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/swap.h |5 ++- mm/page_alloc.c |7 +++-- mm/vmscan.c | 72 +- 3 files changed, 79 insertions(+), 5 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index eba53e7..78b0830 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -252,11 +252,12 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); +extern bool should_balance_unmapped_pages(struct zone *zone); +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 62b7280..4228da3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); + mark = zone-watermark[alloc_flags ALLOC_WMARK_MASK]; if (zone_watermark_ok(zone, order, mark, classzone_idx, alloc_flags)) @@ -4136,10 +4139,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone-spanned_pages = size; zone-present_pages = realsize; -#ifdef CONFIG_NUMA - zone-node = nid; zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) / 100; +#ifdef CONFIG_NUMA + zone-node = nid; zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; #endif zone-name = zone_names[j]; diff --git a/mm/vmscan.c b/mm/vmscan.c index 0ac444f..98950f4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -145,6 +145,21 @@ static DECLARE_RWSEM(shrinker_rwsem); #define scanning_global_lru(sc)(1) #endif +static unsigned long balance_unmapped_pages(int priority, struct zone *zone, + struct scan_control *sc); +static int unmapped_page_control __read_mostly; + +static int __init unmapped_page_control_parm(char *str) +{ + unmapped_page_control = 1; + /* +* XXX: Should we tweak swappiness here? +*/ + return 1; +} +__setup(unmapped_page_control, unmapped_page_control_parm); + + static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, struct scan_control *sc) { @@ -2223,6 +2238,12 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); + /* +* We do unmapped page balancing once here and once +* below, so that we don't lose out +*/ + balance_unmapped_pages(priority, zone, sc); + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) { end_zone = i; @@ -2258,6 +2279,11 @@ loop_again: continue; sc.nr_scanned = 0; + /* +* Balance unmapped pages upfront, this should be +* really cheap +*/ + balance_unmapped_pages(priority, zone, sc); /* * Call soft limit reclaim before calling shrink_zone. @@ -2491,7 +2517,8 @@ void wakeup_kswapd(struct zone *zone, int order) pgdat-kswapd_max_order = order; if (!waitqueue_active(pgdat-kswapd_wait)) return; - if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0)) + if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0) + !should_balance_unmapped_pages(zone)) return; trace_mm_vmscan_wakeup_kswapd(pgdat-node_id, zone_idx(zone), order); @@ -2740,6 +2767,49 @@ zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc, } /* + * Routine to balance unmapped pages, inspired from the code under + * CONFIG_NUMA that does unmapped page and slab page control by keeping + * min_unmapped_pages
Re: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA
* Balbir Singh bal...@linux.vnet.ibm.com [2010-12-01 10:04:08]: * Andrew Morton a...@linux-foundation.org [2010-11-30 14:23:38]: On Tue, 30 Nov 2010 15:45:12 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: This patch moves zone_reclaim and associated helpers outside CONFIG_NUMA. This infrastructure is reused in the patches for page cache control that follow. Thereby adding a nice dollop of bloat to everyone's kernel. I don't think that is justifiable given that the audience for this feature is about eight people :( How's about CONFIG_UNMAPPED_PAGECACHE_CONTROL? OK, I'll add the config, but this code is enabled under CONFIG_NUMA today, so the bloat I agree is more for non NUMA users. I'll make CONFIG_UNMAPPED_PAGECACHE_CONTROL default if CONFIG_NUMA is set. Also this patch instantiates sysctl_min_unmapped_ratio and sysctl_min_slab_ratio on non-NUMA builds but fails to make those tunables actually tunable in procfs. Changes to sysctl.c are needed. Oh! yeah.. I missed it while refactoring, my fault. Reviewed-by: Christoph Lameter c...@linux.com My local MTA failed to deliver the message, trying again. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Refactor zone_reclaim
* Balbir Singh bal...@linux.vnet.ibm.com [2010-12-01 10:16:34]: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-12-01 10:23:29]: On Tue, 30 Nov 2010 15:45:55 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Refactor zone_reclaim, move reusable functionality outside of zone_reclaim. Make zone_reclaim_unmapped_pages modular Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Why is this min_mapped_pages based on zone (IOW, per-zone) ? Kamezawa-San, this has been the design before the refactoring (it is based on zone_reclaim_mode and reclaim based on top of that). I am reusing bits of existing technology. The advantage of it being per-zone is that it integrates well with kswapd. My local MTA failed to deliver the message, trying again. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] Provide control over unmapped pages
* Balbir Singh bal...@linux.vnet.ibm.com [2010-12-01 10:24:21]: * Andrew Morton a...@linux-foundation.org [2010-11-30 14:25:09]: On Tue, 30 Nov 2010 15:46:31 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/swap.h |5 ++- mm/page_alloc.c |7 +++-- mm/vmscan.c | 72 +- 3 files changed, 79 insertions(+), 5 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index eba53e7..78b0830 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -252,11 +252,12 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; This change will need to be moved into the first patch. OK, will do, thanks for pointing it out extern int zone_reclaim(struct zone *, gfp_t, unsigned int); +extern bool should_balance_unmapped_pages(struct zone *zone); +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 62b7280..4228da3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); gack, this is on the page allocator fastpath, isn't it? So 99.% of the world's machines end up doing a pointless call to a pointless function which pointlessly tests a pointless global and pointlessly returns? All because of some whacky KSM thing? The speed and space overhead of this code should be *zero* if !CONFIG_UNMAPPED_PAGECACHE_CONTROL and should be minimal if CONFIG_UNMAPPED_PAGECACHE_CONTROL=y. The way to do the latter is to inline the test of unmapped_page_control into callers and only if it is true (and use unlikely(), please) do we call into the KSM gunk. Will do, should_balance_unmapped_pages() will be a made a no-op in the absence of CONFIG_UNMAPPED_PAGECACHE_CONTROL --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -145,6 +145,21 @@ static DECLARE_RWSEM(shrinker_rwsem); #define scanning_global_lru(sc) (1) #endif +static unsigned long balance_unmapped_pages(int priority, struct zone *zone, + struct scan_control *sc); +static int unmapped_page_control __read_mostly; + +static int __init unmapped_page_control_parm(char *str) +{ + unmapped_page_control = 1; + /* + * XXX: Should we tweak swappiness here? + */ + return 1; +} +__setup(unmapped_page_control, unmapped_page_control_parm); aw c'mon guys, everybody knows that when you add a kernel parameter you document it in Documentation/kernel-parameters.txt. Will do - feeling silly on missing it out, that is where reviews help. static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, struct scan_control *sc) { @@ -2223,6 +2238,12 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); + /* + * We do unmapped page balancing once here and once + * below, so that we don't lose out + */ + balance_unmapped_pages(priority, zone, sc); + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) { end_zone = i; @@ -2258,6 +2279,11 @@ loop_again: continue; sc.nr_scanned = 0; + /* + * Balance unmapped pages upfront, this should be + * really cheap + */ + balance_unmapped_pages(priority, zone, sc); More unjustifiable overhead on a commonly-executed codepath. Will refactor with a CONFIG suggested above. /* * Call soft limit reclaim before calling shrink_zone. @@ -2491,7 +2517,8 @@ void wakeup_kswapd(struct zone *zone, int order) pgdat-kswapd_max_order = order
Re: [PATCH 3/3] Provide control over unmapped pages
* Balbir Singh bal...@linux.vnet.ibm.com [2010-12-01 10:46:32]: * KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com [2010-12-01 09:14:13]: Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/swap.h |5 ++- mm/page_alloc.c |7 +++-- mm/vmscan.c | 72 +- 3 files changed, 79 insertions(+), 5 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index eba53e7..78b0830 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -252,11 +252,12 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); +extern bool should_balance_unmapped_pages(struct zone *zone); +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 62b7280..4228da3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); + You don't have to add extra branch into fast path. mark = zone-watermark[alloc_flags ALLOC_WMARK_MASK]; if (zone_watermark_ok(zone, order, mark, classzone_idx, alloc_flags)) @@ -4136,10 +4139,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone-spanned_pages = size; zone-present_pages = realsize; -#ifdef CONFIG_NUMA - zone-node = nid; zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio) / 100; +#ifdef CONFIG_NUMA + zone-node = nid; zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100; #endif zone-name = zone_names[j]; diff --git a/mm/vmscan.c b/mm/vmscan.c index 0ac444f..98950f4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -145,6 +145,21 @@ static DECLARE_RWSEM(shrinker_rwsem); #define scanning_global_lru(sc) (1) #endif +static unsigned long balance_unmapped_pages(int priority, struct zone *zone, + struct scan_control *sc); +static int unmapped_page_control __read_mostly; + +static int __init unmapped_page_control_parm(char *str) +{ + unmapped_page_control = 1; + /* + * XXX: Should we tweak swappiness here? + */ + return 1; +} +__setup(unmapped_page_control, unmapped_page_control_parm); + + static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, struct scan_control *sc) { @@ -2223,6 +2238,12 @@ loop_again: shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); + /* + * We do unmapped page balancing once here and once + * below, so that we don't lose out + */ + balance_unmapped_pages(priority, zone, sc); You can't invoke any reclaim from here. It is in zone balancing detection phase. It mean your code reclaim pages from zones which has lots free pages too. The goal is to check not only for zone_watermark_ok, but also to see if unmapped pages are way higher than expected values. + if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone), 0, 0)) { end_zone = i; @@ -2258,6 +2279,11 @@ loop_again: continue; sc.nr_scanned = 0; + /* + * Balance unmapped pages upfront, this should be + * really cheap + */ + balance_unmapped_pages(priority, zone, sc); This code break page-cache/slab balancing logic. And this is conflict against Nick's per-zone slab effort. OK, cc'ing Nick for comments. Plus, high-order + priority=5 reclaim Simon's case. (see Free memory never fully used, swapping threads) OK, this path should
Re: [PATCH 3/3] Provide control over unmapped pages
* Balbir Singh bal...@linux.vnet.ibm.com [2010-12-01 10:48:16]: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-12-01 10:32:54]: On Tue, 30 Nov 2010 15:46:31 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: Provide control using zone_reclaim() and a boot parameter. The code reuses functionality from zone_reclaim() to isolate unmapped pages and reclaim them as a priority, ahead of other mapped pages. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/swap.h |5 ++- mm/page_alloc.c |7 +++-- mm/vmscan.c | 72 +- 3 files changed, 79 insertions(+), 5 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index eba53e7..78b0830 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -252,11 +252,12 @@ extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); +extern bool should_balance_unmapped_pages(struct zone *zone); +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; #else #define zone_reclaim_mode 0 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 62b7280..4228da3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1662,6 +1662,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); + Hm, I'm not sure the final vision of this feature. Does this reclaiming feature can't be called directly via balloon driver just before alloc_page() ? That is a separate patch, this is a boot paramter based control approach. Do you need to keep page caches small even when there are free memory on host ? The goal is to avoid duplication, as you know page cache fills itself to consume as much memory as possible. The host generally does not have a lot of free memory in a consolidated environment. My local MTA failed to deliver the message, trying again. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/3] Linux/Guest unmapped page cache control
* Christoph Lameter c...@linux.com [2010-11-03 09:35:33]: On Fri, 29 Oct 2010, Balbir Singh wrote: A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, Interesting use of zone reclaim. I am having a difficult time reviewing the patch since you move and modify functions at the same time. Could you separate that out a bit? Sure, I'll split it out into more readable bits and repost the mm versions first. +#define UNMAPPED_PAGE_RATIO 16 Maybe come up with a scheme that allows better configuration of the mininum? I think in some setting we may want an absolute limit and in other a fraction of something (total zone size or working set?) Are you suggesting a sysctl or computation based on zone size and limit, etc? I understand it to be the latter. +bool should_balance_unmapped_pages(struct zone *zone) +{ + if (unmapped_page_control + (zone_unmapped_file_pages(zone) + UNMAPPED_PAGE_RATIO * zone-min_unmapped_pages)) + return true; + return false; +} Thanks for your review. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 0/3] KVM page cache optimization (v3)
This is version 3 of the page cache control patches From: Balbir Singh bal...@linux.vnet.ibm.com This series has three patches, the first controls the amount of unmapped page cache usage via a boot parameter and sysctl. The second patch controls page and slab cache via the balloon driver. Both the patches make heavy use of the zone_reclaim() functionality already present in the kernel. The last patch in the series is against QEmu to make the ballooning hint optional. V2 was posted a long time back (see http://lwn.net/Articles/391293/) One of the review suggestions was to make the hint optional (discussed in the community call as well). I'd appreciate any test results with the patches. TODO 1. libvirt exploits for optional hint page-cache-control balloon-page-cache provide-memory-hint-during-ballooning --- b/balloon.c | 18 +++- b/balloon.h |4 b/drivers/virtio/virtio_balloon.c | 17 +++ b/hmp-commands.hx |7 + b/hw/virtio-balloon.c | 14 ++- b/hw/virtio-balloon.h |3 b/include/linux/gfp.h |8 + b/include/linux/mmzone.h |2 b/include/linux/swap.h|3 b/include/linux/virtio_balloon.h |3 b/mm/page_alloc.c |9 +- b/mm/vmscan.c | 162 -- b/qmp-commands.hx |7 - include/linux/swap.h |9 -- mm/page_alloc.c |3 mm/vmscan.c |2 16 files changed, 202 insertions(+), 69 deletions(-) -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 1/3] Linux/Guest unmapped page cache control
Selectively control Unmapped Page Cache (nospam version) From: Balbir Singh bal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. Host Usage without boot parameter (memory in KB) MemFree Cached Time 19900 292912 137 17540 296196 139 17900 296124 141 19356 296660 141 Host usage: (memory in KB) RSS Cache mapped swap 2788664 781884 3780359536 Guest Usage with boot parameter (memory in KB) - Memfree Cached Time 244824 74828 144 237840 81764 143 235880 83044 138 239312 80092 148 Host usage: (memory in KB) RSS Cache mapped swap 2700184 958012 334848 398412 TODOS - 1. Balance slab cache as well Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/mmzone.h |2 - include/linux/swap.h |3 + mm/page_alloc.c|9 ++- mm/vmscan.c| 162 4 files changed, 132 insertions(+), 44 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 3984c4e..a591a7a 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -300,12 +300,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; + unsigned long min_unmapped_pages; #ifdef CONFIG_NUMA int node; /* * zone reclaim becomes active if more unmapped pages exist. */ - unsigned long min_unmapped_pages; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index 7cdd633..5d29097 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -251,10 +251,11 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern bool should_balance_unmapped_pages(struct zone *zone); +extern int sysctl_min_unmapped_ratio; #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f12ad18..d8fe29f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1642,6 +1642,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); + mark = zone-watermark[alloc_flags ALLOC_WMARK_MASK]; if (zone_watermark_ok(zone, order, mark, classzone_idx, alloc_flags)) @@ -4101,10 +4104,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone-spanned_pages = size; zone-present_pages = realsize; -#ifdef CONFIG_NUMA
[RFC][PATCH 2/3] Linux/Guest cooperative unmapped page cache control
Balloon unmapped page cache pages first From: Balbir Singh bal...@linux.vnet.ibm.com This patch builds on the ballooning infrastructure by ballooning unmapped page cache pages first. It looks for low hanging fruit first and tries to reclaim clean unmapped pages first. This patch brings zone_reclaim() and other dependencies out of CONFIG_NUMA and then reuses the zone_reclaim_mode logic if __GFP_FREE_CACHE is passed in the gfp_mask. The virtio balloon driver has been changed to use __GFP_FREE_CACHE. During fill_balloon(), the driver looks for hints provided by the hypervisor to reclaim cached memory. By default the hint is off and can be turned on by passing an argument that specifies that we intend to reclaim cached memory. Tests: Test 1 -- I ran a simple filter function that kept frequently ballon a single VM running kernbench. The VM was configured with 2GB of memory and 2 VCPUs. The filter function was a triangular wave function that ballooned the VM under study from 500MB to 1500MB using a triangular wave function continously. The run times of the VM with and without changes are shown below. The run times showed no significant impact of the changes. Withchanges Elapsed Time 223.86 (1.52822) User Time 191.01 (0.65395) System Time 199.468 (2.43616) Percent CPU 174 (1) Context Switches 103182 (595.05) Sleeps 39107.6 (1505.67) Without changes Elapsed Time 225.526 (2.93102) User Time 193.53 (3.53626) System Time 199.832 (3.26281) Percent CPU 173.6 (1.14018) Context Switches 103744 (1311.53) Sleeps 39383.2 (831.865) The key advantage was that it resulted in lesser RSS usage in the host and more cached usage, indicating that the caching had been pushed towards the host. The guest cached memory usage was lower and free memory in the guest was also higher. Test 2 -- I ran kernbench under the memory overcommit manager (6 VM's with 2 vCPUs, 2GB) with KSM and ksmtuned enabled. memory overcommit manager details are at http://github.com/aglitke/mom/wiki. The command line for kernbench was kernbench -M. The tests showed the following Withchanges Elapsed Time 842.936 (12.2247) Elapsed Time 844.266 (25.8047) Elapsed Time 844.696 (11.2433) Elapsed Time 846.08 (14.0249) Elapsed Time 838.58 (7.44609) Elapsed Time 842.362 (4.37463) Withoutchanges Elapsed Time 837.604 (14.1311) Elapsed Time 839.322 (17.1772) Elapsed Time 843.744 (9.21541) Elapsed Time 842.592 (7.48622) Elapsed Time 844.272 (25.486) Elapsed Time 838.858 (7.5044) General observations 1. Free memory in each of guests was higher with changes. The additional free memory was of the order of 120MB per VM 2. Cached memory in each guest was lesser with changes 3. Host free memory was almost constant (independent of changes) 4. Host anonymous memory usage was lesser with the changes The goal of this patch is to free up memory locked up in duplicated cache contents and (1) above shows that we are able to successfully free it up. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- drivers/virtio/virtio_balloon.c | 17 +++-- include/linux/gfp.h |8 +++- include/linux/swap.h|9 +++-- include/linux/virtio_balloon.h |3 +++ mm/page_alloc.c |3 ++- mm/vmscan.c |2 +- 6 files changed, 31 insertions(+), 11 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 0f1da45..70f97ea 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -99,12 +99,24 @@ static void tell_host(struct virtio_balloon *vb, struct virtqueue *vq) static void fill_balloon(struct virtio_balloon *vb, size_t num) { + u32 reclaim_cache_first; + int err; + gfp_t mask = GFP_HIGHUSER | __GFP_NORETRY | __GFP_NOMEMALLOC | + __GFP_NOWARN; + + err = virtio_config_val(vb-vdev, VIRTIO_BALLOON_F_BALLOON_HINT, + offsetof(struct virtio_balloon_config, + reclaim_cache_first), + reclaim_cache_first); + + if (!err reclaim_cache_first) + mask |= __GFP_FREE_CACHE; + /* We can only do one array worth at a time. */ num = min(num, ARRAY_SIZE(vb-pfns)); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns++) { - struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | - __GFP_NOMEMALLOC | __GFP_NOWARN); + struct page *page = alloc_page(mask); if (!page) { if (printk_ratelimit()) dev_printk(KERN_INFO, vb-vdev-dev, @@ -358,6 +370,7 @@ static void __devexit virtballoon_remove(struct virtio_device *vdev) static unsigned int features[] = { VIRTIO_BALLOON_F_MUST_TELL_HOST, VIRTIO_BALLOON_F_STATS_VQ, + VIRTIO_BALLOON_F_BALLOON_HINT, }; static struct
[RFC][PATCH 3/3] QEmu changes to provide balloon hint
Provide memory hint during ballooning From: Balbir Singh bal...@linux.vnet.ibm.com This patch adds an optional hint to the qemu monitor balloon command. The hint tells the guest operating system to consider a class of memory during reclaim. Currently the supported hint is cached memory. The design is generic and can be extended to provide other hints in the future if required. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- balloon.c | 18 ++ balloon.h |4 +++- hmp-commands.hx |7 +-- hw/virtio-balloon.c | 15 +++ hw/virtio-balloon.h |3 +++ qmp-commands.hx |7 --- 6 files changed, 40 insertions(+), 14 deletions(-) diff --git a/balloon.c b/balloon.c index 0021fef..b2bdda5 100644 --- a/balloon.c +++ b/balloon.c @@ -41,11 +41,13 @@ void qemu_add_balloon_handler(QEMUBalloonEvent *func, void *opaque) qemu_balloon_event_opaque = opaque; } -int qemu_balloon(ram_addr_t target, MonitorCompletion cb, void *opaque) +int qemu_balloon(ram_addr_t target, bool reclaim_cache_first, + MonitorCompletion cb, void *opaque) { if (qemu_balloon_event) { trace_balloon_event(qemu_balloon_event_opaque, target); -qemu_balloon_event(qemu_balloon_event_opaque, target, cb, opaque); +qemu_balloon_event(qemu_balloon_event_opaque, target, + reclaim_cache_first, cb, opaque); return 1; } else { return 0; @@ -55,7 +57,7 @@ int qemu_balloon(ram_addr_t target, MonitorCompletion cb, void *opaque) int qemu_balloon_status(MonitorCompletion cb, void *opaque) { if (qemu_balloon_event) { -qemu_balloon_event(qemu_balloon_event_opaque, 0, cb, opaque); +qemu_balloon_event(qemu_balloon_event_opaque, 0, 0, cb, opaque); return 1; } else { return 0; @@ -131,13 +133,21 @@ int do_balloon(Monitor *mon, const QDict *params, MonitorCompletion cb, void *opaque) { int ret; +int val; +const char *cache_hint; +int reclaim_cache_first = 0; if (kvm_enabled() !kvm_has_sync_mmu()) { qerror_report(QERR_KVM_MISSING_CAP, synchronous MMU, balloon); return -1; } -ret = qemu_balloon(qdict_get_int(params, value), cb, opaque); +val = qdict_get_int(params, value); +cache_hint = qdict_get_try_str(params, hint); +if (cache_hint) +reclaim_cache_first = 1; + +ret = qemu_balloon(val, reclaim_cache_first, cb, opaque); if (ret == 0) { qerror_report(QERR_DEVICE_NOT_ACTIVE, balloon); return -1; diff --git a/balloon.h b/balloon.h index d478e28..65d68c1 100644 --- a/balloon.h +++ b/balloon.h @@ -17,11 +17,13 @@ #include monitor.h typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target, +bool reclaim_cache_first, MonitorCompletion cb, void *cb_data); void qemu_add_balloon_handler(QEMUBalloonEvent *func, void *opaque); -int qemu_balloon(ram_addr_t target, MonitorCompletion cb, void *opaque); +int qemu_balloon(ram_addr_t target, bool reclaim_cache_first, + MonitorCompletion cb, void *opaque); int qemu_balloon_status(MonitorCompletion cb, void *opaque); diff --git a/hmp-commands.hx b/hmp-commands.hx index 81999aa..80e42aa 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -925,8 +925,8 @@ ETEXI { .name = balloon, -.args_type = value:M, -.params = target, +.args_type = value:M,hint:s?, +.params = target [cache], .help = request VM to change its memory allocation (in MB), .user_print = monitor_user_noop, .mhandler.cmd_async = do_balloon, @@ -937,6 +937,9 @@ STEXI @item balloon @var{value} @findex balloon Request VM to change its memory allocation to @var{value} (in MB). +An optional @var{hint} can be specified to indicate if the guest +should reclaim from the cached memory in the guest first. The +...@var{hint} may be ignored by the guest. ETEXI { diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c index 8adddea..e363507 100644 --- a/hw/virtio-balloon.c +++ b/hw/virtio-balloon.c @@ -44,6 +44,7 @@ typedef struct VirtIOBalloon size_t stats_vq_offset; MonitorCompletion *stats_callback; void *stats_opaque_callback_data; +uint32_t reclaim_cache_first; } VirtIOBalloon; static VirtIOBalloon *to_virtio_balloon(VirtIODevice *vdev) @@ -181,8 +182,11 @@ static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data) config.num_pages = cpu_to_le32(dev-num_pages); config.actual = cpu_to_le32(dev-actual); - -memcpy(config_data, config, 8); +if (vdev-guest_features (1 VIRTIO_BALLOON_F_BALLOON_HINT)) { +config.reclaim_cache_first = cpu_to_le32(dev-reclaim_cache_first); +memcpy(config_data, config, 12); +} else +memcpy(config_data, config, 8
Re: [PATCH] kvm: add oom notifier for virtio balloon
* Dave Young hidave.darks...@gmail.com [2010-10-05 20:45:21]: Balloon could cause guest memory oom killing and panic. Add oom notify to leak some memory and retry fill balloon after 5 minutes. At the same time add a mutex to protect balloon operations because we need leak balloon in oom notifier and give back freed value. Thanks Anthony Liguori for his sugestion about inflate retrying. Sometimes it will cause endless inflate/oom/delay loop, so I think next step is to add an option to do noretry-when-oom balloon. Signed-off-by: Dave Young hidave.darks...@gmail.com Won't __GFP_NORETRY prevent OOM? Could you please describe how you tested the patch? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: add oom notifier for virtio balloon
* Dave Young hidave.darks...@gmail.com [2010-10-08 21:33:02]: On Fri, Oct 8, 2010 at 9:09 PM, Balbir Singh bal...@linux.vnet.ibm.com wrote: * Dave Young hidave.darks...@gmail.com [2010-10-05 20:45:21]: Balloon could cause guest memory oom killing and panic. Add oom notify to leak some memory and retry fill balloon after 5 minutes. At the same time add a mutex to protect balloon operations because we need leak balloon in oom notifier and give back freed value. Thanks Anthony Liguori for his sugestion about inflate retrying. Sometimes it will cause endless inflate/oom/delay loop, so I think next step is to add an option to do noretry-when-oom balloon. Signed-off-by: Dave Young hidave.darks...@gmail.com Won't __GFP_NORETRY prevent OOM? Could you please describe how you tested the patch? I have not tried __GFP_NORETRY, it should work, but balloon thread will keep wasting cpu resource to allocating. To test the patch, just balloon to small than minimal memory. I use balloon 30 in qemu monitor to limit slackware guest memory usage. The normal memory used is ~40M. Actually we need to differentiate the process which caused oom. If it is balloon thread we should just stop ballooning, if it is others we can do something like this patch, e.g. retry ballooning after 5 minutes. Ideally the balloon thread should never OOM with __GFP_NORETRY (IIRC). The other situation should be dealt with, we should free up any pages we have. I wonder if the timeout should be a sysctl tunable. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-16 14:39:02]: We're talking about an environment which we're always trying to optimize. Imagine that we're always trying to consolidate guests on to smaller numbers of hosts. We're effectively in a state where we _always_ want new guests. If this came at no cost to the guests, you'd be right. But at some point guest performance will be hit by this, so the advantage gained from freeing memory will be balanced by the disadvantage. Also, memory is not the only resource. At some point you become cpu bound; at that point freeing memory doesn't help and in fact may increase your cpu load. We'll probably need control over other resources as well, but IMHO memory is the most precious because it is non-renewable. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-15 09:58:33]: On 06/14/2010 08:45 PM, Balbir Singh wrote: There are two decisions that need to be made: - how much memory a guest should be given - given some guest memory, what's the best use for it The first question can perhaps be answered by looking at guest I/O rates and giving more memory to more active guests. The second question is hard, but not any different than running non-virtualized - except if we can detect sharing or duplication. In this case, dropping a duplicated page is worthwhile, while dropping a shared page provides no benefit. I think there is another way of looking at it, give some free memory 1. Can the guest run more applications or run faster That's my second question. How to best use this memory. More applications == drop the page from cache, faster == keep page in cache. All we need is to select the right page to drop. Do we need to drop to the granularity of the page to drop? I think figuring out the class of pages and making sure that we don't write our own reclaim logic, but work with what we have to identify the class of pages is a good start. 2. Can the host potentially get this memory via ballooning or some other means to start newer guest instances Well, we already have ballooning. The question is can we improve the eviction algorithm. I think the answer to 1 and 2 is yes. How the patch helps answer either question, I'm not sure. I don't think preferential dropping of unmapped page cache is the answer. Preferential dropping as selected by the host, that knows about the setup and if there is duplication involved. While we use the term preferential dropping, remember it is still via LRU and we don't always succeed. It is a best effort (if you can and the unmapped pages are not highly referenced) scenario. How can the host tell if there is duplication? It may know it has some pagecache, but it has no idea whether or to what extent guest pagecache duplicates host pagecache. Well it is possible in host user space, I for example use memory cgroup and through the stats I have a good idea of how much is duplicated. I am ofcourse making an assumption with my setup of the cached mode, that the data in the guest page cache and page cache in the cgroup will be duplicated to a large extent. I did some trivial experiments like drop the data from the guest and look at the cost of bringing it in and dropping the data from both guest and host and look at the cost. I could see a difference. Unfortunately, I did not save the data, so I'll need to redo the experiment. Those tell you how to balance going after the different classes of things that we can reclaim. Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. There are two situations 1. Voluntarily drop cache, if it was setup to do so (the host knows that it caches that information anyway) It doesn't, really. The host only has aggregate information about itself, and no information about the guest. Dropping duplicate pages would be good if we could identify them. Even then, it's better to drop the page from the host, not the guest, unless we know the same page is cached by multiple guests. On the exact pages to drop, please see my comments above on the class of pages to drop. There are reasons for wanting to get the host to cache the data Unless the guest is using cache = none, the data will still hit the host page cache The host can do a better job of optimizing the writeouts But why would the guest voluntarily drop the cache? If there is no memory pressure, dropping caches increases cpu overhead and latency even if the data is still cached on the host. So, there are basically two approaches 1. First patch, proactive - enabled by a boot option 2. When ballooned, we try to (please NOTE try to) reclaim cached pages first. Failing which, we go after regular pages in the alloc_page() call in the balloon driver. 2. Drop the cache on either a special balloon option, again the host knows it caches that very same information, so it prefers to free that up first. Dropping in response to pressure is good. I'm just not convinced the patch helps in selecting the correct page to drop. That is why I've presented data on the experiments I've run and provided more arguments to backup the approach. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-15 10:12:44]: On 06/14/2010 08:16 PM, Balbir Singh wrote: * Dave Hansend...@linux.vnet.ibm.com [2010-06-14 10:09:31]: On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? I agree with this in theory. But, the guest lacks the information about what is truly duplicated and what the costs are for itself and/or the host to recreate it. Unmapped page cache may be the best proxy that we have at the moment for easy to recreate, but I think it's still too poor a match to make these patches useful. That is why the policy (in the next set) will come from the host. As to whether the data is truly duplicated, my experiments show up to 60% of the page cache is duplicated. Isn't that incredibly workload dependent? We can't expect the host admin to know whether duplication will occur or not. I was referring to cache = (policy) we use based on the setup. I don't think the duplication is too workload specific. Moreover, we could use aggressive policies and restrict page cache usage or do it selectively on ballooning. We could also add other options to make the ballooning option truly optional, so that the system management software decides. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-15 12:44:31]: On 06/15/2010 10:49 AM, Balbir Singh wrote: All we need is to select the right page to drop. Do we need to drop to the granularity of the page to drop? I think figuring out the class of pages and making sure that we don't write our own reclaim logic, but work with what we have to identify the class of pages is a good start. Well, the class of pages are 'pages that are duplicated on the host'. Unmapped page cache pages are 'pages that might be duplicated on the host'. IMO, that's not close enough. Agreed, but what happens in reality with the code is that it drops not-so-frequently-used cache (still reusing the reclaim mechanism), but prioritizing cached memory. How can the host tell if there is duplication? It may know it has some pagecache, but it has no idea whether or to what extent guest pagecache duplicates host pagecache. Well it is possible in host user space, I for example use memory cgroup and through the stats I have a good idea of how much is duplicated. I am ofcourse making an assumption with my setup of the cached mode, that the data in the guest page cache and page cache in the cgroup will be duplicated to a large extent. I did some trivial experiments like drop the data from the guest and look at the cost of bringing it in and dropping the data from both guest and host and look at the cost. I could see a difference. Unfortunately, I did not save the data, so I'll need to redo the experiment. I'm sure we can detect it experimentally, but how do we do it programatically at run time (without dropping all the pages). Situations change, and I don't think we can infer from a few experiments that we'll have a similar amount of sharing. The cost of an incorrect decision is too high IMO (not that I think the kernel always chooses the right pages now, but I'd like to avoid regressions from the unvirtualized state). btw, when running with a disk controller that has a very large cache, we might also see duplication between guest and host. So, if this is a good idea, it shouldn't be enabled just for virtualization, but for any situation where we have a sizeable cache behind us. It depends, once the disk controller has the cache and the pages in the guest are not-so-frequently-used we can drop them. Please remember we still use the LRU to identify these pages. It doesn't, really. The host only has aggregate information about itself, and no information about the guest. Dropping duplicate pages would be good if we could identify them. Even then, it's better to drop the page from the host, not the guest, unless we know the same page is cached by multiple guests. On the exact pages to drop, please see my comments above on the class of pages to drop. Well, we disagree about that. There is some value in dropping duplicated pages (not always), but that's not what the patch does. It drops unmapped pagecache pages, which may or may not be duplicated. There are reasons for wanting to get the host to cache the data There are also reasons to get the guest to cache the data - it's more efficient to access it in the guest. Unless the guest is using cache = none, the data will still hit the host page cache The host can do a better job of optimizing the writeouts True, especially for non-raw storage. But even there we have to fsync all the time to keep the metadata right. But why would the guest voluntarily drop the cache? If there is no memory pressure, dropping caches increases cpu overhead and latency even if the data is still cached on the host. So, there are basically two approaches 1. First patch, proactive - enabled by a boot option 2. When ballooned, we try to (please NOTE try to) reclaim cached pages first. Failing which, we go after regular pages in the alloc_page() call in the balloon driver. Doesn't that mean you may evict a RU mapped page ahead of an LRU unmapped page, just in the hope that it is double-cached? Maybe we need the guest and host to talk to each other about which pages to keep. Yeah.. I guess that falls into the domain of CMM. 2. Drop the cache on either a special balloon option, again the host knows it caches that very same information, so it prefers to free that up first. Dropping in response to pressure is good. I'm just not convinced the patch helps in selecting the correct page to drop. That is why I've presented data on the experiments I've run and provided more arguments to backup the approach. I'm still unconvinced, sorry. The reason for making this optional is to let the administrators decide how they want to use the memory in the system. In some situations it might be a big no-no to waste memory, in some cases it might be acceptable. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-15 12:54:31]: On 06/15/2010 10:52 AM, Balbir Singh wrote: That is why the policy (in the next set) will come from the host. As to whether the data is truly duplicated, my experiments show up to 60% of the page cache is duplicated. Isn't that incredibly workload dependent? We can't expect the host admin to know whether duplication will occur or not. I was referring to cache = (policy) we use based on the setup. I don't think the duplication is too workload specific. Moreover, we could use aggressive policies and restrict page cache usage or do it selectively on ballooning. We could also add other options to make the ballooning option truly optional, so that the system management software decides. Consider a read-only workload that exactly fits in guest cache. Without trimming, the guest will keep hitting its own cache, and the host will see no access to the cache at all. So the host (assuming it is under even low pressure) will evict those pages, and the guest will happily use its own cache. If we start to trim, the guest will have to go to disk. That's the best case. Now for the worst case. A random access workload that misses the cache on both guest and host. Now every page is duplicated, and trimming guest pages allows the host to increase its cache, and potentially reduce misses. In this case trimming duplicated pages works. Real life will see a mix of this. Often used pages won't be duplicated, and less often used pages may see some duplication, especially if the host cache portion dedicated to the guest is bigger than the guest cache. I can see that trimming duplicate pages helps, but (a) I'd like to be sure they are duplicates and (b) often trimming them from the host is better than trimming them from the guest. Lets see the behaviour with these patches The first patch is a proactive approach to keep more memory around. Enabling the parameter implies we are OK paying the cost of some overhead. My data shows that leaves a significant amount of free memory with a small 5% (in my case) overhead. This brings us back to what you can do with free memory. The second patch shows no overhead and selectively tries to use free cache to return back on memory pressure (as indicated by the balloon driver). We've discussed the reasons for doing this 1. In the situations where cache is duplicated this should benefit us. Your contention is that we need to be specific about the duplication. That falls under the realm of CMM. 2. In the case of slab cache, duplication does not matter, it is a free page, that should be reclaimed ahead of mapped pages ideally. If the slab grows, it will get another new page. What is the cost of (1) In the worst case, we select a non-duplicated page, but for us to select it, it should be inactive, in that case we do I/O to bring back the page. Trimming from the guest is worthwhile if the pages are not used very often (but enough that caching them in the host is worth it) and if the host cache can serve more than one guest. If we can identify those pages, we don't risk degrading best-case workloads (as defined above). (note ksm to some extent identifies those pages, though it is a bit expensive, and doesn't share with the host pagecache). I see that you are hinting towards finding exact duplicates, I don't know if the cost and complexity justify it. I hope more users can try the patches with and without the boot parameter and provide additional feedback. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-14 09:28:19]: On Mon, 14 Jun 2010 00:01:45 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * Balbir Singh bal...@linux.vnet.ibm.com [2010-06-08 21:21:46]: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singh bal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. Are there any major objections to this patch? This kind of patch needs how it works well measurement. - How did you measure the effect of the patch ? kernbench is not enough, of course. I can run other benchmarks as well, I will do so - Why don't you believe LRU ? And if LRU doesn't work well, should it be fixed by a knob rather than generic approach ? - No side effects ? I believe in LRU, just that the problem I am trying to solve is of using double the memory for caching the same data (consider kvm running in cache=writethrough or writeback mode, both the hypervisor and the guest OS maintain a page cache of the same data). As the VM's grow the overhead is substantial. In my runs I found upto 60% duplication in some cases. - Linux vm guys tend to say, free memory is bad memory. ok, for what free memory created by your patch is used ? IOW, I can't see the benefit. If free memory that your patch created will be used for another page-cache, it will be dropped soon by your patch itself. Free memory is good for cases when you want to do more in the same system. I agree that in a bare metail environment that might be partially true. I don't have a problem with frequently used data being cached, but I am targetting a consolidated environment at the moment. Moreover, the administrator has control via a boot option, so it is non-instrusive in many ways. If your patch just drops duplicated, but no more necessary for other kvm, I agree your patch may increase available size of page-caches. But you just drops unmapped pages. unmapped and unused are the best targets, I plan to add slab cache control later. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-14 16:00:21]: On Mon, 14 Jun 2010 12:19:55 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: - Why don't you believe LRU ? And if LRU doesn't work well, should it be fixed by a knob rather than generic approach ? - No side effects ? I believe in LRU, just that the problem I am trying to solve is of using double the memory for caching the same data (consider kvm running in cache=writethrough or writeback mode, both the hypervisor and the guest OS maintain a page cache of the same data). As the VM's grow the overhead is substantial. In my runs I found upto 60% duplication in some cases. - Linux vm guys tend to say, free memory is bad memory. ok, for what free memory created by your patch is used ? IOW, I can't see the benefit. If free memory that your patch created will be used for another page-cache, it will be dropped soon by your patch itself. Free memory is good for cases when you want to do more in the same system. I agree that in a bare metail environment that might be partially true. I don't have a problem with frequently used data being cached, but I am targetting a consolidated environment at the moment. Moreover, the administrator has control via a boot option, so it is non-instrusive in many ways. It sounds that what you want is to improve performance etc. but to make it easy sizing the system and to help admins. Right ? Right, to allow freeing up of using double the memory to cache data. From performance perspective, I don't see any advantage to drop caches which can be dropped easily. I just use cpus for the purpose it may no be necessary. It is not that easy, in a virtualized environment, you do directly reclaim, but use a mechanism like ballooning and that too requires a smart software to decide where to balloon from. This patch (optionally if enabled) optimizes that by 1. Reducing double caching 2. Not requiring newer smarts or a management software to monitor and balloon 3. Allows better estimation of free memory by avoiding double caching 4. Allows immediate use of free memory for other applications or startup of newer guest instances. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 11:09:44]: On 06/11/2010 07:56 AM, Balbir Singh wrote: Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Which page will be preferred for eviction with this patch set? In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. Still it seems to me you are subverting the normal order of reclaim. I don't see why an unmapped page cache or slab cache item should be evicted before a mapped page. Certainly the cost of rebuilding a dentry compared to the gain from evicting it, is much higher than that of reestablishing a mapped page. Subverting to aviod memory duplication, the word subverting is overloaded, let me try to reason a bit. First let me explain the problem Memory is a precious resource in a consolidated environment. We don't want to waste memory via page cache duplication (cache=writethrough and cache=writeback mode). Now here is what we are trying to do 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. 2. In the case of page cache (specifically unmapped page cache), there is duplication already, so why not go after unmapped page caches when the system is under memory pressure? In the case of 1, we don't force a dentry to be freed, but rather a freed page in the slab cache to be reclaimed ahead of forcing reclaim of mapped pages. Does the problem statement make sense? If so, do you agree with 1 and 2? Is there major concern about subverting regular reclaim? Does subverting it make sense in the duplicated scenario? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 15:40:28]: On 06/14/2010 11:48 AM, Balbir Singh wrote: In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. Still it seems to me you are subverting the normal order of reclaim. I don't see why an unmapped page cache or slab cache item should be evicted before a mapped page. Certainly the cost of rebuilding a dentry compared to the gain from evicting it, is much higher than that of reestablishing a mapped page. Subverting to aviod memory duplication, the word subverting is overloaded, Right, should have used a different one. let me try to reason a bit. First let me explain the problem Memory is a precious resource in a consolidated environment. We don't want to waste memory via page cache duplication (cache=writethrough and cache=writeback mode). Now here is what we are trying to do 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. 2. In the case of page cache (specifically unmapped page cache), there is duplication already, so why not go after unmapped page caches when the system is under memory pressure? In the case of 1, we don't force a dentry to be freed, but rather a freed page in the slab cache to be reclaimed ahead of forcing reclaim of mapped pages. Sounds like this should be done unconditionally, then. An empty slab page is worth less than an unmapped pagecache page at all times, no? In a consolidated environment, even at the cost of some CPU to run shrinkers, I think potentially yes. Does the problem statement make sense? If so, do you agree with 1 and 2? Is there major concern about subverting regular reclaim? Does subverting it make sense in the duplicated scenario? In the case of 2, how do you know there is duplication? You know the guest caches the page, but you have no information about the host. Since the page is cached in the guest, the host doesn't see it referenced, and is likely to drop it. True, that is why the first patch is controlled via a boot parameter that the host can pass. For the second patch, I think we'll need something like a balloon size cache? with the cache argument being optional. If there is no duplication, then you may have dropped a recently-used page and will likely cause a major fault soon. Yes, agreed. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-14 08:12:56]: On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. You don't have to be freeing entire slab pages for the reclaim to have been useful. You could just be making space so that _future_ allocations fill in the slab holes you just created. You may not be freeing pages, but you're reducing future system pressure. If unmapped page cache is the easiest thing to evict, then it should be the first thing that goes when a balloon request comes in, which is the case this patch is trying to handle. If it isn't the easiest thing to evict, then we _shouldn't_ evict it. Like I said earlier, a lot of that works correctly as you said, but it is also an idealization. If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-14 10:09:31]: On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? I agree with this in theory. But, the guest lacks the information about what is truly duplicated and what the costs are for itself and/or the host to recreate it. Unmapped page cache may be the best proxy that we have at the moment for easy to recreate, but I think it's still too poor a match to make these patches useful. That is why the policy (in the next set) will come from the host. As to whether the data is truly duplicated, my experiments show up to 60% of the page cache is duplicated. The first patch today is again enabled by the host. Both of them are expected to be useful in the cache != none case. The data I have shows more details including the performance and overhead. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 18:34:58]: On 06/14/2010 06:12 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. You don't have to be freeing entire slab pages for the reclaim to have been useful. You could just be making space so that _future_ allocations fill in the slab holes you just created. You may not be freeing pages, but you're reducing future system pressure. Depends. If you've evicted something that will be referenced soon, you're increasing system pressure. I don't think slab pages care about being referenced soon, they are either allocated or freed. A page is just a storage unit for the data structure. A new one can be allocated on demand. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 19:34:00]: On 06/14/2010 06:55 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote: On 06/14/2010 06:33 PM, Dave Hansen wrote: At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? Isn't the knob in this proposal the balloon? AFAICT, the idea here is to change how the guest reacts to being ballooned, but the trigger itself would not change. I think the patch was made on the following assumptions: 1. Guests will keep filling their memory with relatively worthless page cache that they don't really need. 2. When they do this, it hurts the overall system with no real gain for anyone. In the case of a ballooned guest, they _won't_ keep filling memory. The balloon will prevent them. So, I guess I was just going down the path of considering if this would be useful without ballooning in place. To me, it's really hard to justify _with_ ballooning in place. There are two decisions that need to be made: - how much memory a guest should be given - given some guest memory, what's the best use for it The first question can perhaps be answered by looking at guest I/O rates and giving more memory to more active guests. The second question is hard, but not any different than running non-virtualized - except if we can detect sharing or duplication. In this case, dropping a duplicated page is worthwhile, while dropping a shared page provides no benefit. I think there is another way of looking at it, give some free memory 1. Can the guest run more applications or run faster 2. Can the host potentially get this memory via ballooning or some other means to start newer guest instances I think the answer to 1 and 2 is yes. How the patch helps answer either question, I'm not sure. I don't think preferential dropping of unmapped page cache is the answer. Preferential dropping as selected by the host, that knows about the setup and if there is duplication involved. While we use the term preferential dropping, remember it is still via LRU and we don't always succeed. It is a best effort (if you can and the unmapped pages are not highly referenced) scenario. My issue is that changing the type of object being preferentially reclaimed just changes the type of workload that would prematurely suffer from reclaim. In this case, workloads that use a lot of unmapped pagecache would suffer. btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? Those tell you how to balance going after the different classes of things that we can reclaim. Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. There are two situations 1. Voluntarily drop cache, if it was setup to do so (the host knows that it caches that information anyway) 2. Drop the cache on either a special balloon option, again the host knows it caches that very same information, so it prefers to free that up first. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control
* Balbir Singh bal...@linux.vnet.ibm.com [2010-06-08 21:21:46]: Selectively control Unmapped Page Cache (nospam version) From: Balbir Singh bal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. Are there any major objections to this patch? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-11 14:05:53]: On Fri, 11 Jun 2010 10:16:32 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-11 10:54:41]: On Thu, 10 Jun 2010 17:07:32 -0700 Dave Hansen d...@linux.vnet.ibm.com wrote: On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Hmm. I'm not fan of estimating working set size by calculation based on some numbers without considering history or feedback. Can't we use some kind of feedback algorithm as hi-low-watermark, random walk or GA (or somehing more smart) to detect the size ? Could you please clarify at what level you are suggesting size detection? I assume it is outside the OS, right? OS includes kernel and system programs ;) I can think of both way in kernel and in user approarh and they should be complement to each other. An example of kernel-based approach is. 1. add a shrinker callback(A) for balloon-driver-for-guest as guest kswapd. 2. add a shrinker callback(B) for balloon-driver-for-host as host kswapd. (I guess current balloon driver is only for host. Please imagine.) (A) increases free memory in Guest. (B) increases free memory in Host. This is an example of feedback based memory resizing between host and guest. I think (B) is necessary at least before considering complecated things. B is left to the hypervisor and the memory policy running on it. My patches address Linux running as a guest, with a Linux hypervisor at the moment, but that can be extended to other balloon drivers as well. To implement something clever, (A) and (B) should take into account that how frequently memory reclaim in guest (which requires some I/O) happens. Yes, I think the policy in the hypervisor needs to look at those details as well. If doing outside kernel, I think using memcg is better than depends on balloon driver. But co-operative balloon and memcg may show us something good. Yes, agreed. Co-operative is better, if there is no co-operation than memcg might be used for enforcement. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] KVM: busy-spin detector
* Marcelo Tosatti mtosa...@redhat.com [2010-06-10 23:25:51]: The following patch implements a simple busy-spin detector. It considers a vcpu as busy-spinning if there are two consecutive exits due to external interrupt on the same RIP, and sleeps for 100us in that case. It is very likely that if the vcpu is making progress it will either exit for other reasons or change RIP. The percentage numbers below represent improvement in kernel build time in comparison with mainline (RHEL 5.4 guest). Interesting approach, is there a reason to tie it in with pause loop exits? Can't we do something more generic or even para-virtish. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] KVM: busy-spin detector
* Huang, Zhiteng zhiteng.hu...@intel.com [2010-06-11 23:03:25]: PLE-like design may be more generic than para-virtish when it comes to Windows guest. Hmm.. sounds reasonable Is this busy-spin actually a Lock Holder Preemption problem? Yep, I was hinting towards solving that problem. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] KVM: busy-spin detector
* Marcelo Tosatti mtosa...@redhat.com [2010-06-11 14:46:27]: Interesting approach, is there a reason to tie it in with pause loop exits? Hum, i don't see any. PLE exits provide the same detection, but more accurately. Can't we do something more generic or even para-virtish. This is pretty generic already? Or what do you mean? The advantage is it does not require paravirt modifications in the guest (at the expense of guessing what the guest is doing). Agreed, but one needs to depend on newer hardware to get this feature to work. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-10 12:43:11]: On 06/08/2010 06:51 PM, Balbir Singh wrote: Balloon unmapped page cache pages first From: Balbir Singhbal...@linux.vnet.ibm.com This patch builds on the ballooning infrastructure by ballooning unmapped page cache pages first. It looks for low hanging fruit first and tries to reclaim clean unmapped pages first. I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Many workloads have many unmapped cache pages, for example static web serving and the all-important kernel build. I've tested kernbench, you can see the results in the original posting and there is no observable overhead as a result of the patch in my run. The key advantage was that it resulted in lesser RSS usage in the host and more cached usage, indicating that the caching had been pushed towards the host. The guest cached memory usage was lower and free memory in the guest was also higher. Caching in the host is only helpful if the cache can be shared, otherwise it's better to cache in the guest. Hmm.. so we would need a ballon cache hint from the monitor, so that it is not unconditional? Overall my results show the following 1. No drastic reduction of guest unmapped cache, just sufficient to show lesser RSS in the host. More freeable memory (as in cached memory + free memory) visible on the host. 2. No significant impact on the benchmark (numbers) running in the guest. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-11 10:54:41]: On Thu, 10 Jun 2010 17:07:32 -0700 Dave Hansen d...@linux.vnet.ibm.com wrote: On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Hmm. I'm not fan of estimating working set size by calculation based on some numbers without considering history or feedback. Can't we use some kind of feedback algorithm as hi-low-watermark, random walk or GA (or somehing more smart) to detect the size ? Could you please clarify at what level you are suggesting size detection? I assume it is outside the OS, right? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-10 17:07:32]: On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Which page will be preferred for eviction with this patch set? In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC/T/D][PATCH 0/2] KVM page cache optimization (v2)
This is version 2 of the page cache control patches for KVM. This series has two patches, the first controls the amount of unmapped page cache usage via a boot parameter and sysctl. The second patch controls page and slab cache via the balloon driver. Both the patches make heavy use of the zone_reclaim() functionality already present in the kernel. page-cache-control balloon-page-cache -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC][PATCH 1/2] Linux/Guest unmapped page cache control
Selectively control Unmapped Page Cache (nospam version) From: Balbir Singh bal...@linux.vnet.ibm.com This patch implements unmapped page cache control via preferred page cache reclaim. The current patch hooks into kswapd and reclaims page cache if the user has requested for unmapped page control. This is useful in the following scenario - In a virtualized environment with cache=writethrough, we see double caching - (one in the host and one in the guest). As we try to scale guests, cache usage across the system grows. The goal of this patch is to reclaim page cache when Linux is running as a guest and get the host to hold the page cache and manage it. There might be temporary duplication, but in the long run, memory in the guests would be used for mapped pages. - The option is controlled via a boot option and the administrator can selectively turn it on, on a need to use basis. A lot of the code is borrowed from zone_reclaim_mode logic for __zone_reclaim(). One might argue that the with ballooning and KSM this feature is not very useful, but even with ballooning, we need extra logic to balloon multiple VM machines and it is hard to figure out the correct amount of memory to balloon. With these patches applied, each guest has a sufficient amount of free memory available, that can be easily seen and reclaimed by the balloon driver. The additional memory in the guest can be reused for additional applications or used to start additional guests/balance memory in the host. KSM currently does not de-duplicate host and guest page cache. The goal of this patch is to help automatically balance unmapped page cache when instructed to do so. There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO and the number of pages to reclaim when unmapped_page_control argument is supplied. These numbers were chosen to avoid aggressiveness in reaping page cache ever so frequently, at the same time providing control. The sysctl for min_unmapped_ratio provides further control from within the guest on the amount of unmapped pages to reclaim. The patch is applied against mmotm feb-11-2010. TODt Usage without boot parameter (memory in KB) MemFree Cached Time 19900 292912 137 17540 296196 139 17900 296124 141 19356 296660 141 Host usage: (memory in KB) RSS Cache mapped swap 2788664 781884 3780359536 Guest Usage with boot parameter (memory in KB) - Memfree Cached Time 244824 74828 144 237840 81764 143 235880 83044 138 239312 80092 148 Host usage: (memory in KB) RSS Cache mapped swap 2700184 958012 334848 398412 TODOS - 1. Balance slab cache as well 2. Invoke the balance routines from the balloon driver Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/mmzone.h |2 - include/linux/swap.h |3 + mm/page_alloc.c|9 ++- mm/vmscan.c| 165 4 files changed, 134 insertions(+), 45 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b4d109e..9f96b6d 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -293,12 +293,12 @@ struct zone { */ unsigned long lowmem_reserve[MAX_NR_ZONES]; + unsigned long min_unmapped_pages; #ifdef CONFIG_NUMA int node; /* * zone reclaim becomes active if more unmapped pages exist. */ - unsigned long min_unmapped_pages; unsigned long min_slab_pages; #endif struct per_cpu_pageset __percpu *pageset; diff --git a/include/linux/swap.h b/include/linux/swap.h index ff4acea..f92f1ee 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -251,10 +251,11 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages); extern int vm_swappiness; extern int remove_mapping(struct address_space *mapping, struct page *page); extern long vm_total_pages; +extern bool should_balance_unmapped_pages(struct zone *zone); +extern int sysctl_min_unmapped_ratio; #ifdef CONFIG_NUMA extern int zone_reclaim_mode; -extern int sysctl_min_unmapped_ratio; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); #else diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 431214b..fee9420 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1641,6 +1641,9 @@ zonelist_scan: unsigned long mark; int ret; + if (should_balance_unmapped_pages(zone)) + wakeup_kswapd(zone, order); + mark = zone-watermark[alloc_flags ALLOC_WMARK_MASK]; if (zone_watermark_ok(zone, order, mark, classzone_idx, alloc_flags)) @@ -4069,10 +4072,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat
[RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
Balloon unmapped page cache pages first From: Balbir Singh bal...@linux.vnet.ibm.com This patch builds on the ballooning infrastructure by ballooning unmapped page cache pages first. It looks for low hanging fruit first and tries to reclaim clean unmapped pages first. This patch brings zone_reclaim() and other dependencies out of CONFIG_NUMA and then reuses the zone_reclaim_mode logic if __GFP_FREE_CACHE is passed in the gfp_mask. The virtio balloon driver has been changed to use __GFP_FREE_CACHE. Tests: I ran a simple filter function that kept frequently ballon a single VM running kernbench. The VM was configured with 2GB of memory and 2 VCPUs. The filter function was a triangular wave function that ballooned the VM under study from 500MB to 1500MB using a triangular wave function continously. The run times of the VM with and without changes are shown below. The run times showed no significant impact of the changes. Withchanges Elapsed Time 223.86 (1.52822) User Time 191.01 (0.65395) System Time 199.468 (2.43616) Percent CPU 174 (1) Context Switches 103182 (595.05) Sleeps 39107.6 (1505.67) Without changes Elapsed Time 225.526 (2.93102) User Time 193.53 (3.53626) System Time 199.832 (3.26281) Percent CPU 173.6 (1.14018) Context Switches 103744 (1311.53) Sleeps 39383.2 (831.865) The key advantage was that it resulted in lesser RSS usage in the host and more cached usage, indicating that the caching had been pushed towards the host. The guest cached memory usage was lower and free memory in the guest was also higher. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- drivers/virtio/virtio_balloon.c |3 ++- include/linux/gfp.h |8 +++- include/linux/swap.h|9 +++-- mm/page_alloc.c |3 ++- mm/vmscan.c |2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 0f1da45..609a9c2 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -104,7 +104,8 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns++) { struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | - __GFP_NOMEMALLOC | __GFP_NOWARN); + __GFP_NOMEMALLOC | __GFP_NOWARN | + __GFP_FREE_CACHE); if (!page) { if (printk_ratelimit()) dev_printk(KERN_INFO, vb-vdev-dev, diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 975609c..9048259 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -61,12 +61,18 @@ struct vm_area_struct; #endif /* + * While allocating pages, try to free cache pages first. Note the + * heavy dependency on zone_reclaim_mode logic + */ +#define __GFP_FREE_CACHE ((__force gfp_t)0x40u) /* Free cache first */ + +/* * This may seem redundant, but it's a way of annotating false positives vs. * allocations that simply cannot be supported (e.g. page tables). */ #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) -#define __GFP_BITS_SHIFT 22/* Room for 22 __GFP_FOO bits */ +#define __GFP_BITS_SHIFT 23/* Room for 22 __GFP_FOO bits */ #define __GFP_BITS_MASK ((__force gfp_t)((1 __GFP_BITS_SHIFT) - 1)) /* This equals 0, but use constants in case they ever change */ diff --git a/include/linux/swap.h b/include/linux/swap.h index f92f1ee..f77c603 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -254,16 +254,13 @@ extern long vm_total_pages; extern bool should_balance_unmapped_pages(struct zone *zone); extern int sysctl_min_unmapped_ratio; -#ifdef CONFIG_NUMA -extern int zone_reclaim_mode; extern int sysctl_min_slab_ratio; extern int zone_reclaim(struct zone *, gfp_t, unsigned int); + +#ifdef CONFIG_NUMA +extern int zone_reclaim_mode; #else #define zone_reclaim_mode 0 -static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order) -{ - return 0; -} #endif extern int page_evictable(struct page *page, struct vm_area_struct *vma); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fee9420..d977b36 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1649,7 +1649,8 @@ zonelist_scan: classzone_idx, alloc_flags)) goto try_this_zone; - if (zone_reclaim_mode == 0) + if (zone_reclaim_mode == 0 + !(gfp_mask __GFP_FREE_CACHE)) goto this_zone_full; ret = zone_reclaim(zone, gfp_mask, order); diff --git a/mm/vmscan.c b/mm/vmscan.c index 27bc536..393bee5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2624,6 +2624,7 @@ module_init(kswapd_init) * the watermarks
Re: KVM and the OOM-Killer
* Athanasius k...@miggy.org [2010-05-14 08:33:34]: On Thu, May 13, 2010 at 01:20:31PM +0100, James Stevens wrote: We have a KVM host with 48Gb of RAM and run about 20 KVM clients on it. After some time - different time depending on the kernel version - the VM host kernel will start OOM-Killing the VM clients, even when there is lots of free RAM (10Gb) and free SWAP (34Gb). It seems going to a 64 bit kernel is what you want, but I thought it worth mentioning the available method to say try not to OOM-kill *this* process: echo -16 /proc/pid/oom_adj A lot of this is being changed, but not yet committed. There are patches out there to deal with the lowmem issue. Meanwhile, do follow the suggestions on oom_adj and moving to 64 bit. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM and the OOM-Killer
* James Stevens james.stev...@jrcs.co.uk [2010-05-14 09:10:19]: echo -16 /proc/pid/oom_adj Thanks for that - yes, I know about oom_adj, but it doesn't (totally) work. udevd has a default of -17 and it got killed anyway. Also, the only thing this server runs is VMs so if they can't be killed oom-killer will just run through the everything else (syslogd, sshd, klogd, udevd, hald, agetty etc) - so on balance its a case of which is worse? Without those daemons the system can become inaccessible and could become unstable, so on balance it may be better to let it kill the VMs. My current work-around is :- sync; echo 3 /proc/sys/vm/drop_caches Have you looked at memory cgroups and using that with limits with VMs? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM and the OOM-Killer
* James Stevens james.stev...@jrcs.co.uk [2010-05-14 09:43:04]: Have you looked at memory cgroups and using that with limits with VMs? The problem was *NOT* that my VMs exhausted all memory. I know that is what normally triggers oom-killer, but you have to understand this mine was a very different scenario, hence I wanted to bring it to people's attention. I had about 10Gb of *FREE* HIGH and 34GB of *FREE* SWAP when oom-killer was activated - yep, didn't make sense to me either. If you want to study the logs :- I understand, You could potentially encapsulate all else - except your VM's in a small cgroup and frequently reclaim from there using the memory cgroup. If drop caches works for you, that is good too. I am surprised that cache allocations are causing lowmem exhaustion. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][RESEND]Fix GFP flags passed from the virtio balloon driver
Fix GFP flags passed from the virtio balloon driver From: Balbir Singh bal...@linux.vnet.ibm.com The virtio balloon driver can dig into the reservation pools of the OS to satisfy a balloon request. This is not advisable and other balloon drivers (drivers/xen/balloon.c) avoid this as well. The patch also avoids printing a warning if allocation fails. Comments? Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- drivers/virtio/virtio_balloon.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index 369f2ee..f8ffe8c 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -102,7 +102,8 @@ static void fill_balloon(struct virtio_balloon *vb, size_t num) num = min(num, ARRAY_SIZE(vb-pfns)); for (vb-num_pfns = 0; vb-num_pfns num; vb-num_pfns++) { - struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY); + struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY | + __GFP_NOMEMALLOC | __GFP_NOWARN); if (!page) { if (printk_ratelimit()) dev_printk(KERN_INFO, vb-vdev-dev, -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html