Performance counter - vPMU

2011-11-14 Thread Balbir Singh
Hi,

I saw the a presentation on virtualizing performance counters at
http://www.linux-kvm.org/wiki/images/6/6d/Kvm-forum-2011-performance-monitoring.pdf.
Has the code been merged? Can I get something to play with/provide
feedback?

Balbir Singh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance counter - vPMU

2011-11-14 Thread Balbir Singh
On Mon, Nov 14, 2011 at 5:48 PM, Gleb Natapov g...@redhat.com wrote:

 Not yet merged. You can take it from here https://lkml.org/lkml/2011/11/10/215

Thank you very much, Gleb!

Balbir Singh.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-04-01 Thread Balbir Singh
* KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com [2011-04-01 16:56:57]:

 Hi
 
   1) zone reclaim doesn't work if the system has multiple node and the
  workload is file cache oriented (eg file server, web server, mail 
   server, et al). 
  because zone recliam make some much free pages than zone-pages_min and
  then new page cache request consume nearest node memory and then it
  bring next zone reclaim. Then, memory utilization is reduced and
  unnecessary LRU discard is increased dramatically.
   
  SGI folks added CPUSET specific solution in past. 
   (cpuset.memory_spread_page)
  But global recliam still have its issue. zone recliam is HPC workload 
   specific 
  feature and HPC folks has no motivation to don't use CPUSET.
  
  I am afraid you misread the patches and the intent. The intent to
  explictly enable control of unmapped pages and has nothing
  specifically to do with multiple nodes at this point. The control is
  system wide and carefully enabled by the administrator.
 
 Hm. OK, I may misread.
 Can you please explain the reason why de-duplication feature need to 
 selectable and
 disabled by defaut. explicity enable mean this feature want to spot corner 
 case issue??


Yes, because given a selection of choices (including what you
mentioned in the review), it would be nice to have
this selectable.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-04-01 Thread Balbir Singh
* Andrew Morton a...@linux-foundation.org [2011-03-30 22:32:31]:

 On Thu, 31 Mar 2011 10:57:03 +0530 Balbir Singh bal...@linux.vnet.ibm.com 
 wrote:
 
  * Andrew Morton a...@linux-foundation.org [2011-03-30 16:36:07]:
  
   On Wed, 30 Mar 2011 11:00:26 +0530
   Balbir Singh bal...@linux.vnet.ibm.com wrote:
   
Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79
   
   It would be nice if the data for the current patchset was present in
   the current patchset's changelog!
  
  
  Sure, since there were no major changes, I put in a URL. The main
  change was the documentation update. 
 
 Well some poor schmuck has to copy and paste the data into the
 changelog so it's still there in five years time.  It's better to carry
 this info around in the patch's own metedata, and to maintain
 and update it.


Agreed, will do. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-04-01 Thread Balbir Singh
* KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com [2011-04-01 22:21:26]:

  * KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com [2011-04-01 16:56:57]:
  
   Hi
   
 1) zone reclaim doesn't work if the system has multiple node and the
workload is file cache oriented (eg file server, web server, mail 
 server, et al). 
because zone recliam make some much free pages than 
 zone-pages_min and
then new page cache request consume nearest node memory and then it
bring next zone reclaim. Then, memory utilization is reduced and
unnecessary LRU discard is increased dramatically.
 
SGI folks added CPUSET specific solution in past. 
 (cpuset.memory_spread_page)
But global recliam still have its issue. zone recliam is HPC 
 workload specific 
feature and HPC folks has no motivation to don't use CPUSET.

I am afraid you misread the patches and the intent. The intent to
explictly enable control of unmapped pages and has nothing
specifically to do with multiple nodes at this point. The control is
system wide and carefully enabled by the administrator.
   
   Hm. OK, I may misread.
   Can you please explain the reason why de-duplication feature need to 
   selectable and
   disabled by defaut. explicity enable mean this feature want to spot 
   corner case issue??
  
  Yes, because given a selection of choices (including what you
  mentioned in the review), it would be nice to have
  this selectable.
 
 It's no good answer. :-/

I am afraid I cannot please you with my answers

 Who need the feature and who shouldn't use it? It this enough valuable for 
 enough large
 people? That's my question point.
 

You can see the use cases documented, including when running Linux as
a guest under other hypervisors, today we have a choice of not using
host page cache with cache=none, but nothing the other way round.
There are other use cases for embedded folks (in terms of controlling
unmapped page cache), please see previous discussions.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-31 Thread Balbir Singh
* KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com [2011-03-31 14:40:33]:

  
  The following series implements page cache control,
  this is a split out version of patch 1 of version 3 of the
  page cache optimization patches posted earlier at
  Previous posting http://lwn.net/Articles/425851/ and analysis
  at http://lwn.net/Articles/419713/
  
  Detailed Description
  
  This patch implements unmapped page cache control via preferred
  page cache reclaim. The current patch hooks into kswapd and reclaims
  page cache if the user has requested for unmapped page control.
  This is useful in the following scenario
  - In a virtualized environment with cache=writethrough, we see
double caching - (one in the host and one in the guest). As
we try to scale guests, cache usage across the system grows.
The goal of this patch is to reclaim page cache when Linux is running
as a guest and get the host to hold the page cache and manage it.
There might be temporary duplication, but in the long run, memory
in the guests would be used for mapped pages.
 
  - The option is controlled via a boot option and the administrator
can selectively turn it on, on a need to use basis.
  
  A lot of the code is borrowed from zone_reclaim_mode logic for
  __zone_reclaim(). One might argue that the with ballooning and
  KSM this feature is not very useful, but even with ballooning,
  we need extra logic to balloon multiple VM machines and it is hard
  to figure out the correct amount of memory to balloon. With these
  patches applied, each guest has a sufficient amount of free memory
  available, that can be easily seen and reclaimed by the balloon driver.
  The additional memory in the guest can be reused for additional
  applications or used to start additional guests/balance memory in
  the host.
 
 If anyone think this series works, They are just crazy. This patch reintroduce
 two old issues.
 
 1) zone reclaim doesn't work if the system has multiple node and the
workload is file cache oriented (eg file server, web server, mail server, 
 et al). 
because zone recliam make some much free pages than zone-pages_min and
then new page cache request consume nearest node memory and then it
bring next zone reclaim. Then, memory utilization is reduced and
unnecessary LRU discard is increased dramatically.
 
SGI folks added CPUSET specific solution in past. 
 (cpuset.memory_spread_page)
But global recliam still have its issue. zone recliam is HPC workload 
 specific 
feature and HPC folks has no motivation to don't use CPUSET.


I am afraid you misread the patches and the intent. The intent to
explictly enable control of unmapped pages and has nothing
specifically to do with multiple nodes at this point. The control is
system wide and carefully enabled by the administrator.
 
 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to
decide to filter out mapped pages. It made a lot of problems for DB servers
and large application servers. Because, if the system has a lot of mapped
pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2)
reclaim latency become terribly slow and hangup detectors misdetect its
state and start to force reboot. That was big problem of RHEL5 based 
 banking
system.
So, sc-may_unmap should be killed in future. Don't increase uses.
 

Can you remove sc-may_unmap without removing zone_reclaim()? The LRU
churn can be addressed at the time of isolation, I'll send out an
incremental patch for that.

 
 But, I agree that now we have to concern slightly large VM change parhaps
 (or parhaps not). Ok, it's good opportunity to fill out some thing.
 Historically, Linux MM has free memory are waste memory policy, and It
 worked completely fine. But now we have a few exceptions.
 
 1) RT, embedded and finance systems. They really hope to avoid reclaim
latency (ie avoid foreground reclaim completely) and they can accept 
to make slightly much free pages before memory shortage.
 
 2) VM guest
VM host and VM guest naturally makes two level page cache model. and
Linux page cache + two level don't work fine. It has two issues
1) hard to visualize real memory consumption. That makes harder to 
   works baloon fine. And google want to visualize memory utilization
   to pack in more jobs.
2) hard to make in kernel memory utilization improvement mechanism.
 
 
 And, now we have four proposal of utilization related issues.
 
 1) cleancache (from Oracle)

Cleancache requires both hypervisor and guest support. With these
patches, Linux can run well under hypverisor if we know the hypversior
does a lot of the IO and maintains the cache.

 2) VirtFS (from IBM)
 3) kstaled (from Google)
 4) unmapped page reclaim (from you)
 
 Probably, we can't merge all of them and we need to consolidate some 
 requirement and implementations.
 
 
 cleancache seems most straight forward 

Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-31 Thread Balbir Singh
* Dave Chinner da...@fromorbit.com [2011-04-01 08:40:33]:

 On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote:
  
  The following series implements page cache control,
  this is a split out version of patch 1 of version 3 of the
  page cache optimization patches posted earlier at
  Previous posting http://lwn.net/Articles/425851/ and analysis
  at http://lwn.net/Articles/419713/
  
  Detailed Description
  
  This patch implements unmapped page cache control via preferred
  page cache reclaim. The current patch hooks into kswapd and reclaims
  page cache if the user has requested for unmapped page control.
  This is useful in the following scenario
  - In a virtualized environment with cache=writethrough, we see
double caching - (one in the host and one in the guest). As
we try to scale guests, cache usage across the system grows.
The goal of this patch is to reclaim page cache when Linux is running
as a guest and get the host to hold the page cache and manage it.
There might be temporary duplication, but in the long run, memory
in the guests would be used for mapped pages.
 
 What does this do that cache=none for the VMs and using the page
 cache inside the guest doesn't acheive? That avoids double caching
 and doesn't require any new complexity inside the host OS to
 acheive...


There was a long discussion on cache=none in the first posting and the
downsides/impact on throughput. Please see
http://www.mail-archive.com/kvm@vger.kernel.org/msg30655.html 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-30 Thread Balbir Singh
* Andrew Morton a...@linux-foundation.org [2011-03-30 16:36:07]:

 On Wed, 30 Mar 2011 11:00:26 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:
 
  Data from the previous patchsets can be found at
  https://lkml.org/lkml/2010/11/30/79
 
 It would be nice if the data for the current patchset was present in
 the current patchset's changelog!


Sure, since there were no major changes, I put in a URL. The main
change was the documentation update. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages (v5)

2011-03-30 Thread Balbir Singh
* Andrew Morton a...@linux-foundation.org [2011-03-30 16:35:45]:

 On Wed, 30 Mar 2011 11:02:38 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:
 
  Changelog v4
  1. Added documentation for max_unmapped_pages
  2. Better #ifdef'ing of max_unmapped_pages and min_unmapped_pages
  
  Changelog v2
  1. Use a config option to enable the code (Andrew Morton)
  2. Explain the magic tunables in the code or at-least attempt
 to explain them (General comment)
  3. Hint uses of the boot parameter with unlikely (Andrew Morton)
  4. Use better names (balanced is not a good naming convention)
  
  Provide control using zone_reclaim() and a boot parameter. The
  code reuses functionality from zone_reclaim() to isolate unmapped
  pages and reclaim them as a priority, ahead of other mapped pages.
  
 
 This:
 
 akpm:/usr/src/25 grep '^+#' 
 patches/provide-control-over-unmapped-pages-v5.patch 
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +#else
 +#endif
 +#ifdef CONFIG_NUMA
 +#else
 +#define zone_reclaim_mode 0
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +#endif
 +#endif
 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 
 is getting out of control.  What happens if we just make the feature
 non-configurable?


I added the configuration based on review comments I received. If the
feature is made non-configurable, it should be easy to remove them or
just set the default value to y in the config.
 
  +static int __init unmapped_page_control_parm(char *str)
  +{
  +   unmapped_page_control = 1;
  +   /*
  +* XXX: Should we tweak swappiness here?
  +*/
  +   return 1;
  +}
  +__setup(unmapped_page_control, unmapped_page_control_parm);
 
 That looks like a pain - it requires a reboot to change the option,
 which makes testing harder and slower.  Methinks you're being a bit
 virtualization-centric here!

:-) The reason for the boot parameter is to ensure that people know
what they are doing.

 
  +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
  +static inline void reclaim_unmapped_pages(int priority,
  +   struct zone *zone, struct scan_control *sc)
  +{
  +   return 0;
  +}
  +#endif
  +
   static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
struct scan_control *sc)
   {
  @@ -2371,6 +2394,12 @@ loop_again:
  shrink_active_list(SWAP_CLUSTER_MAX, zone,
  sc, priority, 0);
   
  +   /*
  +* We do unmapped page reclaim once here and once
  +* below, so that we don't lose out
  +*/
  +   reclaim_unmapped_pages(priority, zone, sc);
 
 Doing this here seems wrong.  balance_pgdat() does two passes across
 the zones.  The first pass is a read-only work-out-what-to-do pass and
 the second pass is a now-reclaim-some-stuff pass.  But here we've stuck
 a do-some-reclaiming operation inside the first, work-out-what-to-do pass.


The reason is primarily for balancing, zone_watermark's do not give us
a good idea of whether unmapped pages are balanced, hence the code.
 
 
  @@ -2408,6 +2437,11 @@ loop_again:
  continue;
   
  sc.nr_scanned = 0;
  +   /*
  +* Reclaim unmapped pages upfront, this should be
  +* really cheap
 
 Comment is mysterious.  Why is it cheap?

Cheap because we do a quick check to see if unmapped pages exceed a
threshold. If selective users enable this functionality (which is
expected), the use case is primarily for embedded and virtualization
folks, this should be a simple check.

 
  +*/
  +   reclaim_unmapped_pages(priority, zone, sc);
 
 
 I dunno, the whole thing seems rather nasty to me.
 
 It sticks a magical reclaim-unmapped-pages operation right in the
 middle of regular page reclaim.  This means that reclaim will walk the
 LRU looking at mapped and unmapped pages.  Then it will walk some more,
 looking at only unmapped pages

[PATCH 0/3] Unmapped page cache control (v5)

2011-03-29 Thread Balbir Singh

The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
Previous posting http://lwn.net/Articles/425851/ and analysis
at http://lwn.net/Articles/419713/

Detailed Description

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario
- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim, a similar
max_unmapped_ratio sysctl is added and helps in the decision making
process of when reclaim should occur. This is tunable and set by
default to 16 (based on tradeoff's seen between aggressiveness in
balancing versus size of unmapped pages). Distro's and administrators
can further tweak this for desired control.

Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79

---

Balbir Singh (3):
  Move zone_reclaim() outside of CONFIG_NUMA
  Refactor zone_reclaim code
  Provide control over unmapped pages


 Documentation/kernel-parameters.txt |8 ++
 Documentation/sysctl/vm.txt |   19 +
 include/linux/mmzone.h  |   11 +++
 include/linux/swap.h|   25 ++-
 init/Kconfig|   12 +++
 kernel/sysctl.c |   29 ++--
 mm/page_alloc.c |   35 +-
 mm/vmscan.c |  123 +++
 8 files changed, 229 insertions(+), 33 deletions(-)

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v5)

2011-03-29 Thread Balbir Singh
This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
Reviewed-by: Christoph Lameter c...@linux.com
---
 include/linux/mmzone.h |4 ++--
 include/linux/swap.h   |4 ++--
 kernel/sysctl.c|   16 
 mm/page_alloc.c|6 +++---
 mm/vmscan.c|2 --
 5 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 628f07b..59cbed0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -306,12 +306,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
-#ifdef CONFIG_NUMA
-   int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
unsigned long   min_unmapped_pages;
+#ifdef CONFIG_NUMA
+   int node;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ed6ebe6..ce8f686 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -264,11 +264,11 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+extern int sysctl_min_unmapped_ratio;
+extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 927fc5a..e3a8ce4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1214,14 +1214,6 @@ static struct ctl_table vm_table[] = {
.proc_handler   = proc_dointvec_unsigned,
},
 #endif
-#ifdef CONFIG_NUMA
-   {
-   .procname   = zone_reclaim_mode,
-   .data   = zone_reclaim_mode,
-   .maxlen = sizeof(zone_reclaim_mode),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_unsigned,
-   },
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1231,6 +1223,14 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#ifdef CONFIG_NUMA
+   {
+   .procname   = zone_reclaim_mode,
+   .data   = zone_reclaim_mode,
+   .maxlen = sizeof(zone_reclaim_mode),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_unsigned,
+   },
{
.procname   = min_slab_ratio,
.data   = sysctl_min_slab_ratio,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e1b52a..1d32865 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4249,10 +4249,10 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
 
zone-spanned_pages = size;
zone-present_pages = realsize;
-#ifdef CONFIG_NUMA
-   zone-node = nid;
zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
/ 100;
+#ifdef CONFIG_NUMA
+   zone-node = nid;
zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
 #endif
zone-name = zone_names[j];
@@ -5157,7 +5157,6 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int 
write,
return 0;
 }
 
-#ifdef CONFIG_NUMA
 int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
@@ -5174,6 +5173,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table 
*table, int write,
return 0;
 }
 
+#ifdef CONFIG_NUMA
 int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..4923160 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2874,7 +2874,6 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-#ifdef CONFIG_NUMA
 /*
  * Zone reclaim mode
  *
@@ -3084,7 +3083,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, 
unsigned int order)
 
return ret;
 }
-#endif
 
 /*
  * page_evictable - test whether a page is evictable

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Refactor zone_reclaim code (v5)

2011-03-29 Thread Balbir Singh
Changelog v3
1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages

Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
Reviewed-by: Christoph Lameter c...@linux.com
---
 mm/vmscan.c |   35 +++
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4923160..5b24e74 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2949,6 +2949,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_pages(struct zone *zone, struct scan_control *sc,
+   unsigned long nr_pages)
+{
+   int priority;
+   /*
+* Free memory by calling shrink zone with increasing
+* priorities until we have enough memory freed.
+*/
+   priority = ZONE_RECLAIM_PRIORITY;
+   do {
+   shrink_zone(priority, zone, sc);
+   priority--;
+   } while (priority = 0  sc-nr_reclaimed  nr_pages);
+}
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int 
order)
@@ -2957,7 +2978,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
const unsigned long nr_pages = 1  order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
-   int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode  RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode  RECLAIM_SWAP),
@@ -2981,17 +3001,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p-reclaim_state = reclaim_state;
 
-   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages) {
-   /*
-* Free memory by calling shrink zone with increasing
-* priorities until we have enough memory freed.
-*/
-   priority = ZONE_RECLAIM_PRIORITY;
-   do {
-   shrink_zone(priority, zone, sc);
-   priority--;
-   } while (priority = 0  sc.nr_reclaimed  nr_pages);
-   }
+   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages)
+   zone_reclaim_pages(zone, sc, nr_pages);
 
nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0  zone-min_slab_pages) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Provide control over unmapped pages (v5)

2011-03-29 Thread Balbir Singh
Changelog v4
1. Added documentation for max_unmapped_pages
2. Better #ifdef'ing of max_unmapped_pages and min_unmapped_pages

Changelog v2
1. Use a config option to enable the code (Andrew Morton)
2. Explain the magic tunables in the code or at-least attempt
   to explain them (General comment)
3. Hint uses of the boot parameter with unlikely (Andrew Morton)
4. Use better names (balanced is not a good naming convention)

Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
Reviewed-by: Christoph Lameter c...@linux.com
---
 Documentation/kernel-parameters.txt |8 +++
 Documentation/sysctl/vm.txt |   19 +++-
 include/linux/mmzone.h  |7 +++
 include/linux/swap.h|   25 --
 init/Kconfig|   12 +
 kernel/sysctl.c |   13 +
 mm/page_alloc.c |   29 
 mm/vmscan.c |   88 +++
 8 files changed, 194 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index d4e67a5..f522c34 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2520,6 +2520,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
[X86]
Set unknown_nmi_panic=1 early on boot.
 
+   unmapped_page_control
+   [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
+   is enabled. It controls the amount of unmapped memory
+   that is present in the system. This boot option plus
+   vm.min_unmapped_ratio (sysctl) provide granular control
+   over how much unmapped page cache can exist in the 
system
+   before kswapd starts reclaiming unmapped page cache 
pages.
+
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2).  This
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 30289fa..1c722f7 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -381,11 +381,14 @@ and may not be fast.
 
 min_unmapped_ratio:
 
-This is available only on NUMA kernels.
+This is available only on NUMA kernels or when unmapped page cache
+control is enabled.
 
 This is a percentage of the total pages in each zone. Zone reclaim will
 only occur if more than this percentage of pages are in a state that
-zone_reclaim_mode allows to be reclaimed.
+zone_reclaim_mode allows to be reclaimed. If unmapped page cache control
+is enabled, this is the minimum level to which the cache will be shrunk
+down to.
 
 If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
 against all file-backed unmapped pages including swapcache pages and tmpfs
@@ -396,6 +399,18 @@ The default is 1 percent.
 
 ==
 
+max_unmapped_ratio:
+
+This is available only when unmapped page cache control is enabled.
+
+This is a percentage of the total pages in each zone. Zone reclaim will
+only occur if more than this percentage of pages are in a state and
+unmapped page cache control is enabled.
+
+The default is 16 percent.
+
+==
+
 mmap_min_addr
 
 This file indicates the amount of address space  which a user process will
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 59cbed0..caa29ad 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -309,7 +309,12 @@ struct zone {
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
unsigned long   min_unmapped_pages;
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+   unsigned long   max_unmapped_pages;
+#endif
 #ifdef CONFIG_NUMA
int node;
unsigned long   min_slab_pages;
@@ -776,6 +781,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct 
ctl_table *, int,
void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+int sysctl_max_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
+   void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ce8f686..86cafc5 100644

Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-02-14 Thread Balbir Singh
* MinChan Kim minchan@gmail.com [2011-02-10 14:41:44]:

 I don't know why the part of message is deleted only when I send you.
 Maybe it's gmail bug.
 
 I hope mail sending is successful in this turn. :)
 
 On Thu, Feb 10, 2011 at 2:33 PM, Minchan Kim minchan@gmail.com wrote:
  Sorry for late response.
 
  On Fri, Jan 28, 2011 at 8:18 PM, Balbir Singh bal...@linux.vnet.ibm.com 
  wrote:
  * MinChan Kim minchan@gmail.com [2011-01-28 16:24:19]:
 
  
   But the assumption for LRU order to change happens only if the page
   cannot be successfully freed, which means it is in some way active..
   and needs to be moved no?
 
  1. holded page by someone
  2. mapped pages
  3. active pages
 
  1 is rare so it isn't the problem.
  Of course, in case of 3, we have to activate it so no problem.
  The problem is 2.
 
 
  2 is a problem, but due to the size aspects not a big one. Like you
  said even lumpy reclaim affects it. May be the reclaim code could
  honour may_unmap much earlier.
 
  Even if it is, it's a trade-off to get a big contiguous memory. I
  don't want to add new mess. (In addition, lumpy is weak by compaction
  as time goes by)
  What I have in mind for preventing LRU ignore is that put the page
  into original position instead of head of lru. Maybe it can help the
  situation both lumpy and your case. But it's another story.
 
  How about the idea?
 
  I borrow the idea from CFLRU[1]
  - PCFLRU(Page-Cache First LRU)
 
  When we allocates new page for page cache, we adds the page into LRU's tail.
  When we map the page cache into page table, we rotate the page into LRU's 
  head.
 
  So, inactive list's result is following as.
 
  M.P : mapped page
  N.P : none-mapped page
 
  HEAD-M.P-M.P-M.P-M.P-N.P-N.P-N.P-N.P-N.P-TAIL
 
  Admin can set threshold window size which determines stop reclaiming
  none-mapped page contiguously.
 
  I think it needs some tweak of page cache/page mapping functions but
  we can use kswapd/direct reclaim without change.
 
  Also, it can change page reclaim policy totally but it's just what you
  want, I think.
 

I am not sure how this would work, moreover the idea behind
min_unmapped_pages is to keep sufficient unmapped pages around for the
FS metadata and has been working with the existing code for zone
reclaim. What you propose is more drastic re-org of the LRU and I am
not sure I have the apetite for it.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3][RESEND] Provide control over unmapped pages (v4)

2011-02-09 Thread Balbir Singh
On 02/09/2011 05:27 AM, Andrew Morton wrote:
 On Tue, 01 Feb 2011 22:25:45 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:
 
 Changelog v4
 1. Add max_unmapped_ratio and use that as the upper limit
 to check when to shrink the unmapped page cache (Christoph
 Lameter)

 Changelog v2
 1. Use a config option to enable the code (Andrew Morton)
 2. Explain the magic tunables in the code or at-least attempt
to explain them (General comment)
 3. Hint uses of the boot parameter with unlikely (Andrew Morton)
 4. Use better names (balanced is not a good naming convention)

 Provide control using zone_reclaim() and a boot parameter. The
 code reuses functionality from zone_reclaim() to isolate unmapped
 pages and reclaim them as a priority, ahead of other mapped pages.
 A new sysctl for max_unmapped_ratio is provided and set to 16,
 indicating 16% of the total zone pages are unmapped, we start
 shrinking unmapped page cache.
 
 We'll need some documentation for sysctl_max_unmapped_ratio, please. 
 In Documentation/sysctl/vm.txt, I suppose.
 
 It will be interesting to find out what this ratio refers to.  it
 apears to be a percentage.  We've had problem in the past where 1% was
 way too much and we had to change the kernel to provide much
 finer-grained control.
 

Sure, I'll update the Documentation as a part of this patchset. Yes, the current
min_unmapped_ratio is a percentage and so is max_unmapped_ratio. 
min_unmapped_ratio
already exists, adding max_ should not affect granularity of control. It
will be worth relooking at the granularity based on user feedback and
experience. We won't break ABI if we add additional interfaces to help
granularity.



 ...

 --- a/include/linux/mmzone.h
 +++ b/include/linux/mmzone.h
 @@ -306,7 +306,10 @@ struct zone {
  /*
   * zone reclaim becomes active if more unmapped pages exist.
   */
 +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
  unsigned long   min_unmapped_pages;
 +unsigned long   max_unmapped_pages;
 +#endif
 
 This change breaks the connection between min_unmapped_pages and its
 documentation, and fails to document max_unmapped_pages.
 

I'll fix that

 Also, afacit if CONFIG_NUMA=y and CONFIG_UNMAPPED_PAGE_CONTROL=n,
 max_unmapped_pages will be present in the kernel image and will appear
 in /proc but it won't actually do anything.  Seems screwed up and
 misleading.
 


Good catch! In one of the emails Christoph mentioned that max_unmapped_ratio
might be helpful even in the general case (but we need to work on that later).
For now, I'll fix this and repose.


 ...

 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +/*
 + * Routine to reclaim unmapped pages, inspired from the code under
 + * CONFIG_NUMA that does unmapped page and slab page control by keeping
 + * min_unmapped_pages in the zone. We currently reclaim just unmapped
 + * pages, slab control will come in soon, at which point this routine
 + * should be called reclaim cached pages
 + */
 +unsigned long reclaim_unmapped_pages(int priority, struct zone *zone,
 +struct scan_control *sc)
 +{
 +if (unlikely(unmapped_page_control) 
 +(zone_unmapped_file_pages(zone)  zone-min_unmapped_pages)) {
 +struct scan_control nsc;
 +unsigned long nr_pages;
 +
 +nsc = *sc;
 +
 +nsc.swappiness = 0;
 +nsc.may_writepage = 0;
 +nsc.may_unmap = 0;
 +nsc.nr_reclaimed = 0;
 +
 +nr_pages = zone_unmapped_file_pages(zone) -
 +zone-min_unmapped_pages;
 +/*
 + * We don't want to be too aggressive with our
 + * reclaim, it is our best effort to control
 + * unmapped pages
 + */
 +nr_pages = 3;
 +
 +zone_reclaim_pages(zone, nsc, nr_pages);
 +return nsc.nr_reclaimed;
 +}
 +return 0;
 +}
 
 This returns an undocumented ulong which is never used by callers.
 

Good catch! I;ll remove the return value, I don't expect it to be used
to check how much we could reclaim.

Thanks for the review!

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3][RESEND] Provide unmapped page cache control (v4)

2011-02-01 Thread Balbir Singh
NOTE: Resending the series with the Reviewed-by tags updated

The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
Previous posting http://lwn.net/Articles/419564/

The previous few revision received lot of comments, I've tried to
address as many of those as possible in this revision.

Detailed Description

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario
- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim, a similar
max_unmapped_ratio sysctl is added and helps in the decision making
process of when reclaim should occur. This is tunable and set by
default to 16 (based on tradeoff's seen between aggressiveness in
balancing versus size of unmapped pages). Distro's and administrators
can further tweak this for desired control.

Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79

---

Balbir Singh (3):
  Move zone_reclaim() outside of CONFIG_NUMA
  Refactor zone_reclaim code
  Provide control over unmapped pages


 Documentation/kernel-parameters.txt |8 ++
 include/linux/mmzone.h  |9 ++-
 include/linux/swap.h|   23 +--
 init/Kconfig|   12 +++
 kernel/sysctl.c |   29 ++--
 mm/page_alloc.c |   31 -
 mm/vmscan.c |  122 +++
 7 files changed, 202 insertions(+), 32 deletions(-)

-- 
Balbir Singh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3][RESEND] Move zone_reclaim() outside of CONFIG_NUMA (v4)

2011-02-01 Thread Balbir Singh
This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
Reviewed-by: Christoph Lameter c...@linux.com
---
 include/linux/mmzone.h |4 ++--
 include/linux/swap.h   |4 ++--
 kernel/sysctl.c|   18 +-
 mm/page_alloc.c|6 +++---
 mm/vmscan.c|2 --
 5 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02ecb01..2485acc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -303,12 +303,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
-#ifdef CONFIG_NUMA
-   int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
unsigned long   min_unmapped_pages;
+#ifdef CONFIG_NUMA
+   int node;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5e3355a..7b75626 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -255,11 +255,11 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+extern int sysctl_min_unmapped_ratio;
+extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index bc86bb3..12e8f26 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1224,15 +1224,6 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
-#ifdef CONFIG_NUMA
-   {
-   .procname   = zone_reclaim_mode,
-   .data   = zone_reclaim_mode,
-   .maxlen = sizeof(zone_reclaim_mode),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   .extra1 = zero,
-   },
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1242,6 +1233,15 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#ifdef CONFIG_NUMA
+   {
+   .procname   = zone_reclaim_mode,
+   .data   = zone_reclaim_mode,
+   .maxlen = sizeof(zone_reclaim_mode),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   .extra1 = zero,
+   },
{
.procname   = min_slab_ratio,
.data   = sysctl_min_slab_ratio,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aede3a4..7b56473 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4167,10 +4167,10 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
 
zone-spanned_pages = size;
zone-present_pages = realsize;
-#ifdef CONFIG_NUMA
-   zone-node = nid;
zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
/ 100;
+#ifdef CONFIG_NUMA
+   zone-node = nid;
zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
 #endif
zone-name = zone_names[j];
@@ -5084,7 +5084,6 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int 
write,
return 0;
 }
 
-#ifdef CONFIG_NUMA
 int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
@@ -5101,6 +5100,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table 
*table, int write,
return 0;
 }
 
+#ifdef CONFIG_NUMA
 int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 47a5096..5899f2f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2868,7 +2868,6 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-#ifdef CONFIG_NUMA
 /*
  * Zone reclaim mode
  *
@@ -3078,7 +3077,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, 
unsigned int order)
 
return ret;
 }
-#endif
 
 /*
  * page_evictable - test whether a page is evictable

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3][RESEND] Refactor zone_reclaim code (v4)

2011-02-01 Thread Balbir Singh
Changelog v3
1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages

Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
Reviewed-by: Christoph Lameter c...@linux.com
---
 mm/vmscan.c |   35 +++
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5899f2f..02cc82e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2943,6 +2943,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_pages(struct zone *zone, struct scan_control *sc,
+   unsigned long nr_pages)
+{
+   int priority;
+   /*
+* Free memory by calling shrink zone with increasing
+* priorities until we have enough memory freed.
+*/
+   priority = ZONE_RECLAIM_PRIORITY;
+   do {
+   shrink_zone(priority, zone, sc);
+   priority--;
+   } while (priority = 0  sc-nr_reclaimed  nr_pages);
+}
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int 
order)
@@ -2951,7 +2972,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
const unsigned long nr_pages = 1  order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
-   int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode  RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode  RECLAIM_SWAP),
@@ -2975,17 +2995,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p-reclaim_state = reclaim_state;
 
-   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages) {
-   /*
-* Free memory by calling shrink zone with increasing
-* priorities until we have enough memory freed.
-*/
-   priority = ZONE_RECLAIM_PRIORITY;
-   do {
-   shrink_zone(priority, zone, sc);
-   priority--;
-   } while (priority = 0  sc.nr_reclaimed  nr_pages);
-   }
+   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages)
+   zone_reclaim_pages(zone, sc, nr_pages);
 
nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0  zone-min_slab_pages) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3][RESEND] Provide control over unmapped pages (v4)

2011-02-01 Thread Balbir Singh
Changelog v4
1. Add max_unmapped_ratio and use that as the upper limit
to check when to shrink the unmapped page cache (Christoph
Lameter)

Changelog v2
1. Use a config option to enable the code (Andrew Morton)
2. Explain the magic tunables in the code or at-least attempt
   to explain them (General comment)
3. Hint uses of the boot parameter with unlikely (Andrew Morton)
4. Use better names (balanced is not a good naming convention)

Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.
A new sysctl for max_unmapped_ratio is provided and set to 16,
indicating 16% of the total zone pages are unmapped, we start
shrinking unmapped page cache.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
Reviewed-by: Christoph Lameter c...@linux.com
---
 Documentation/kernel-parameters.txt |8 +++
 include/linux/mmzone.h  |5 ++
 include/linux/swap.h|   23 -
 init/Kconfig|   12 +
 kernel/sysctl.c |   11 
 mm/page_alloc.c |   25 ++
 mm/vmscan.c |   87 +++
 7 files changed, 166 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index fee5f57..65a4ee6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2500,6 +2500,14 @@ and is between 256 and 4096 characters. It is defined in 
the file
[X86]
Set unknown_nmi_panic=1 early on boot.
 
+   unmapped_page_control
+   [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
+   is enabled. It controls the amount of unmapped memory
+   that is present in the system. This boot option plus
+   vm.min_unmapped_ratio (sysctl) provide granular control
+   over how much unmapped page cache can exist in the 
system
+   before kswapd starts reclaiming unmapped page cache 
pages.
+
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2).  This
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2485acc..18f0f09 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -306,7 +306,10 @@ struct zone {
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
unsigned long   min_unmapped_pages;
+   unsigned long   max_unmapped_pages;
+#endif
 #ifdef CONFIG_NUMA
int node;
unsigned long   min_slab_pages;
@@ -773,6 +776,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct 
ctl_table *, int,
void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+int sysctl_max_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
+   void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7b75626..ae62a03 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -255,19 +255,34 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 extern int sysctl_min_unmapped_ratio;
+extern int sysctl_max_unmapped_ratio;
+
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
 #else
-#define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
 {
return 0;
 }
 #endif
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+extern bool should_reclaim_unmapped_pages(struct zone *zone);
+#else
+static inline bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+   return false;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
+extern int sysctl_min_slab_ratio;
+#else
+#define zone_reclaim_mode 0
+#endif
+
 extern int page_evictable(struct page *page, struct vm_area_struct *vma);
 extern void scan_mapping_unevictable_pages(struct address_space *);
 
diff --git a/init/Kconfig b/init/Kconfig
index 4f6cdbf..2dfbc09 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -828,6 +828,18 @@ config SCHED_AUTOGROUP
 config MM_OWNER
bool
 
+config UNMAPPED_PAGECACHE_CONTROL
+   bool Provide

Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-31 Thread Balbir Singh
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2011-01-31 08:58:53]:

 On Fri, 28 Jan 2011 09:20:02 -0600 (CST)
 Christoph Lameter c...@linux.com wrote:
 
  On Fri, 28 Jan 2011, KAMEZAWA Hiroyuki wrote:
  
 I see it as a tradeoff of when to check? add_to_page_cache or when we
 are want more free memory (due to allocation). It is OK to wakeup
 kswapd while allocating memory, somehow for this purpose (global page
 cache), add_to_page_cache or add_to_page_cache_locked does not seem
 the right place to hook into. I'd be open to comments/suggestions
 though from others as well.
  
   I don't like add hook here.
   AND I don't want to run kswapd because 'kswapd' has been a sign as
   there are memory shortage. (reusing code is ok.)
  
   How about adding new daemon ? Recently, khugepaged, ksmd works for
   managing memory. Adding one more daemon for special purpose is not
   very bad, I think. Then, you can do
- wake up without hook
- throttle its work.
- balance the whole system rather than zone.
  I think per-node balance is enough...
  
  
  I think we already have enough kernel daemons floating around. They are
  multiplying in an amazing way. What would be useful is to map all
  the memory management background stuff into a process. May call this memd
  instead? Perhaps we can fold khugepaged into kswapd as well etc.
  
 
 Making kswapd slow for whis additional, requested by user, not by system
 work is good thing ? I think workqueue works enough well, it's scale based on
 workloads, if using thread is bad.


Making it slow is a generic statement, kswapd
is supposed to do background reclaim, in this case a special request
for unmapped pages, specifically and deliberately requested by the
admin via a boot option.
 
-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-28 Thread Balbir Singh
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2011-01-28 16:56:05]:

 On Fri, 28 Jan 2011 16:24:19 +0900
 Minchan Kim minchan@gmail.com wrote:
 
  On Fri, Jan 28, 2011 at 3:48 PM, Balbir Singh bal...@linux.vnet.ibm.com 
  wrote:
   * MinChan Kim minchan@gmail.com [2011-01-28 14:44:50]:
  
   On Fri, Jan 28, 2011 at 11:56 AM, Balbir Singh
   bal...@linux.vnet.ibm.com wrote:
On Thu, Jan 27, 2011 at 4:42 AM, Minchan Kim minchan@gmail.com 
wrote:
[snip]
   
index 7b56473..2ac8549 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1660,6 +1660,9 @@ zonelist_scan:
                       unsigned long mark;
                       int ret;
   
+                       if (should_reclaim_unmapped_pages(zone))
+                               wakeup_kswapd(zone, order, 
classzone_idx);
+
   
Do we really need the check in fastpath?
There are lost of caller of alloc_pages.
Many of them are not related to mapped pages.
Could we move the check into add_to_page_cache_locked?
   
The check is a simple check to see if the unmapped pages need
balancing, the reason I placed this check here is to allow other
allocations to benefit as well, if there are some unmapped pages to be
freed. add_to_page_cache_locked (check under a critical section) is
even worse, IMHO.
  
   It just moves the overhead from general into specific case(ie,
   allocates page for just page cache).
   Another cases(ie, allocates pages for other purpose except page cache,
   ex device drivers or fs allocation for internal using) aren't
   affected.
   So, It would be better.
  
   The goal in this patch is to remove only page cache page, isn't it?
   So I think we could the balance check in add_to_page_cache and trigger 
   reclaim.
   If we do so, what's the problem?
  
  
   I see it as a tradeoff of when to check? add_to_page_cache or when we
   are want more free memory (due to allocation). It is OK to wakeup
   kswapd while allocating memory, somehow for this purpose (global page
   cache), add_to_page_cache or add_to_page_cache_locked does not seem
   the right place to hook into. I'd be open to comments/suggestions
   though from others as well.
 
 I don't like add hook here.
 AND I don't want to run kswapd because 'kswapd' has been a sign as
 there are memory shortage. (reusing code is ok.)
 
 How about adding new daemon ? Recently, khugepaged, ksmd works for
 managing memory. Adding one more daemon for special purpose is not
 very bad, I think. Then, you can do
  - wake up without hook
  - throttle its work.
  - balance the whole system rather than zone.
I think per-node balance is enough...
 
 
 
 


Honestly, I did look at that option, but balancing via kswapd seemed
like the best option. Creating a new thread/daemon did not make sense
because

1. The control is very lose
2. kswapd can deal with it while balancing other things, in fact
imagine kswapd waking up to free memory, but there being other free
memory easily available. Parallel reclaim, zone lock contention
addition does not help, IMHO.
3. kswapd does not indicate memory shortage per-se, please see
min_free_kbytes_sysctl_handler, kswapd is to balance the nodes/zone.
If you tune min_free_kbytes and kswapd runs, it does not mean memory
shortage on the system

 
 
 
  
   
   
   
                       mark = zone-watermark[alloc_flags  
ALLOC_WMARK_MASK];
                       if (zone_watermark_ok(zone, order, mark,
                                   classzone_idx, alloc_flags))
@@ -4167,8 +4170,12 @@ static void __paginginit 
free_area_init_core(struct pglist_data *pgdat,
   
               zone-spanned_pages = size;
               zone-present_pages = realsize;
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
               zone-min_unmapped_pages = 
(realsize*sysctl_min_unmapped_ratio)
                                               / 100;
+               zone-max_unmapped_pages = 
(realsize*sysctl_max_unmapped_ratio)
+                                               / 100;
+#endif
 #ifdef CONFIG_NUMA
               zone-node = nid;
               zone-min_slab_pages = (realsize * 
sysctl_min_slab_ratio) / 100;
@@ -5084,6 +5091,7 @@ int min_free_kbytes_sysctl_handler(ctl_table 
*table, int write,
       return 0;
 }
   
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
 int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int 
write,
       void __user *buffer, size_t *length, loff_t *ppos)
 {
@@ -5100,6 +5108,23 @@ int 
sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
       return 0;
 }
   
+int sysctl_max_unmapped_ratio_sysctl_handler(ctl_table *table, int 
write,
+       void __user *buffer, size_t *length, loff_t *ppos)
+{
+       struct zone *zone

Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-28 Thread Balbir Singh
* MinChan Kim minchan@gmail.com [2011-01-28 16:24:19]:

 
  But the assumption for LRU order to change happens only if the page
  cannot be successfully freed, which means it is in some way active..
  and needs to be moved no?
 
 1. holded page by someone
 2. mapped pages
 3. active pages
 
 1 is rare so it isn't the problem.
 Of course, in case of 3, we have to activate it so no problem.
 The problem is 2.


2 is a problem, but due to the size aspects not a big one. Like you
said even lumpy reclaim affects it. May be the reclaim code could
honour may_unmap much earlier. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-28 Thread Balbir Singh
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2011-01-28 17:17:44]:

 On Fri, 28 Jan 2011 13:49:28 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:
 
  * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2011-01-28 16:56:05]:
 
   BTW, it seems this doesn't work when some apps use huge shmem.
   How to handle the issue ?
  
  
  Could you elaborate further? 
  
 ==
 static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
 {
 unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
 unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
 zone_page_state(zone, NR_ACTIVE_FILE);
 
 /*
  * It's possible for there to be more file mapped pages than
  * accounted for by the pages on the file LRU lists because
  * tmpfs pages accounted for as ANON can also be FILE_MAPPED
  */
 return (file_lru  file_mapped) ? (file_lru - file_mapped) : 0;
 }

Yes, I did :) The word huge confused me. I am not sure if there is an
easy accounting fix for this one, though given the approximate nature
of the control, I am not sure it would matter very much. But you do
have a very good point.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-27 Thread Balbir Singh
On Thu, Jan 27, 2011 at 4:42 AM, Minchan Kim minchan@gmail.com wrote:
[snip]

 index 7b56473..2ac8549 100644
 --- a/mm/page_alloc.c
 +++ b/mm/page_alloc.c
 @@ -1660,6 +1660,9 @@ zonelist_scan:
                        unsigned long mark;
                        int ret;

 +                       if (should_reclaim_unmapped_pages(zone))
 +                               wakeup_kswapd(zone, order, classzone_idx);
 +

 Do we really need the check in fastpath?
 There are lost of caller of alloc_pages.
 Many of them are not related to mapped pages.
 Could we move the check into add_to_page_cache_locked?

The check is a simple check to see if the unmapped pages need
balancing, the reason I placed this check here is to allow other
allocations to benefit as well, if there are some unmapped pages to be
freed. add_to_page_cache_locked (check under a critical section) is
even worse, IMHO.



                        mark = zone-watermark[alloc_flags  
 ALLOC_WMARK_MASK];
                        if (zone_watermark_ok(zone, order, mark,
                                    classzone_idx, alloc_flags))
 @@ -4167,8 +4170,12 @@ static void __paginginit free_area_init_core(struct 
 pglist_data *pgdat,

                zone-spanned_pages = size;
                zone-present_pages = realsize;
 +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
                zone-min_unmapped_pages = 
 (realsize*sysctl_min_unmapped_ratio)
                                                / 100;
 +               zone-max_unmapped_pages = 
 (realsize*sysctl_max_unmapped_ratio)
 +                                               / 100;
 +#endif
  #ifdef CONFIG_NUMA
                zone-node = nid;
                zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 
 100;
 @@ -5084,6 +5091,7 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, 
 int write,
        return 0;
  }

 +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
  int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
        void __user *buffer, size_t *length, loff_t *ppos)
  {
 @@ -5100,6 +5108,23 @@ int 
 sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
        return 0;
  }

 +int sysctl_max_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
 +       void __user *buffer, size_t *length, loff_t *ppos)
 +{
 +       struct zone *zone;
 +       int rc;
 +
 +       rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
 +       if (rc)
 +               return rc;
 +
 +       for_each_zone(zone)
 +               zone-max_unmapped_pages = (zone-present_pages *
 +                               sysctl_max_unmapped_ratio) / 100;
 +       return 0;
 +}
 +#endif
 +
  #ifdef CONFIG_NUMA
  int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
        void __user *buffer, size_t *length, loff_t *ppos)
 diff --git a/mm/vmscan.c b/mm/vmscan.c
 index 02cc82e..6377411 100644
 --- a/mm/vmscan.c
 +++ b/mm/vmscan.c
 @@ -159,6 +159,29 @@ static DECLARE_RWSEM(shrinker_rwsem);
  #define scanning_global_lru(sc)        (1)
  #endif

 +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
 +static unsigned long reclaim_unmapped_pages(int priority, struct zone *zone,
 +                                               struct scan_control *sc);
 +static int unmapped_page_control __read_mostly;
 +
 +static int __init unmapped_page_control_parm(char *str)
 +{
 +       unmapped_page_control = 1;
 +       /*
 +        * XXX: Should we tweak swappiness here?
 +        */
 +       return 1;
 +}
 +__setup(unmapped_page_control, unmapped_page_control_parm);
 +
 +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
 +static inline unsigned long reclaim_unmapped_pages(int priority,
 +                               struct zone *zone, struct scan_control *sc)
 +{
 +       return 0;
 +}
 +#endif
 +
  static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
                                                  struct scan_control *sc)
  {
 @@ -2359,6 +2382,12 @@ loop_again:
                                shrink_active_list(SWAP_CLUSTER_MAX, zone,
                                                        sc, priority, 0);

 +                       /*
 +                        * We do unmapped page reclaim once here and once
 +                        * below, so that we don't lose out
 +                        */
 +                       reclaim_unmapped_pages(priority, zone, sc);
 +
                        if (!zone_watermark_ok_safe(zone, order,
                                        high_wmark_pages(zone), 0, 0)) {
                                end_zone = i;
 @@ -2396,6 +2425,11 @@ loop_again:
                                continue;

                        sc.nr_scanned = 0;
 +                       /*
 +                        * Reclaim unmapped pages upfront, this should be
 +                        * really cheap
 +                        */
 +                       reclaim_unmapped_pages(priority, zone, 

Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-27 Thread Balbir Singh
* Christoph Lameter c...@linux.com [2011-01-26 10:57:37]:

 
 Reviewed-by: Christoph Lameter c...@linux.com


Thanks for the review! 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v4)

2011-01-27 Thread Balbir Singh
* Christoph Lameter c...@linux.com [2011-01-26 10:56:56]:

 
 Reviewed-by: Christoph Lameter c...@linux.com


Thanks for the review! 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-27 Thread Balbir Singh
* MinChan Kim minchan@gmail.com [2011-01-28 14:44:50]:

 On Fri, Jan 28, 2011 at 11:56 AM, Balbir Singh
 bal...@linux.vnet.ibm.com wrote:
  On Thu, Jan 27, 2011 at 4:42 AM, Minchan Kim minchan@gmail.com wrote:
  [snip]
 
  index 7b56473..2ac8549 100644
  --- a/mm/page_alloc.c
  +++ b/mm/page_alloc.c
  @@ -1660,6 +1660,9 @@ zonelist_scan:
                         unsigned long mark;
                         int ret;
 
  +                       if (should_reclaim_unmapped_pages(zone))
  +                               wakeup_kswapd(zone, order, classzone_idx);
  +
 
  Do we really need the check in fastpath?
  There are lost of caller of alloc_pages.
  Many of them are not related to mapped pages.
  Could we move the check into add_to_page_cache_locked?
 
  The check is a simple check to see if the unmapped pages need
  balancing, the reason I placed this check here is to allow other
  allocations to benefit as well, if there are some unmapped pages to be
  freed. add_to_page_cache_locked (check under a critical section) is
  even worse, IMHO.
 
 It just moves the overhead from general into specific case(ie,
 allocates page for just page cache).
 Another cases(ie, allocates pages for other purpose except page cache,
 ex device drivers or fs allocation for internal using) aren't
 affected.
 So, It would be better.
 
 The goal in this patch is to remove only page cache page, isn't it?
 So I think we could the balance check in add_to_page_cache and trigger 
 reclaim.
 If we do so, what's the problem?


I see it as a tradeoff of when to check? add_to_page_cache or when we
are want more free memory (due to allocation). It is OK to wakeup
kswapd while allocating memory, somehow for this purpose (global page
cache), add_to_page_cache or add_to_page_cache_locked does not seem
the right place to hook into. I'd be open to comments/suggestions
though from others as well.
 
 
 
 
                         mark = zone-watermark[alloc_flags  
  ALLOC_WMARK_MASK];
                         if (zone_watermark_ok(zone, order, mark,
                                     classzone_idx, alloc_flags))
  @@ -4167,8 +4170,12 @@ static void __paginginit 
  free_area_init_core(struct pglist_data *pgdat,
 
                 zone-spanned_pages = size;
                 zone-present_pages = realsize;
  +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
                 zone-min_unmapped_pages = 
  (realsize*sysctl_min_unmapped_ratio)
                                                 / 100;
  +               zone-max_unmapped_pages = 
  (realsize*sysctl_max_unmapped_ratio)
  +                                               / 100;
  +#endif
   #ifdef CONFIG_NUMA
                 zone-node = nid;
                 zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) 
  / 100;
  @@ -5084,6 +5091,7 @@ int min_free_kbytes_sysctl_handler(ctl_table 
  *table, int write,
         return 0;
   }
 
  +#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
   int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
         void __user *buffer, size_t *length, loff_t *ppos)
   {
  @@ -5100,6 +5108,23 @@ int 
  sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
         return 0;
   }
 
  +int sysctl_max_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
  +       void __user *buffer, size_t *length, loff_t *ppos)
  +{
  +       struct zone *zone;
  +       int rc;
  +
  +       rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
  +       if (rc)
  +               return rc;
  +
  +       for_each_zone(zone)
  +               zone-max_unmapped_pages = (zone-present_pages *
  +                               sysctl_max_unmapped_ratio) / 100;
  +       return 0;
  +}
  +#endif
  +
   #ifdef CONFIG_NUMA
   int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
         void __user *buffer, size_t *length, loff_t *ppos)
  diff --git a/mm/vmscan.c b/mm/vmscan.c
  index 02cc82e..6377411 100644
  --- a/mm/vmscan.c
  +++ b/mm/vmscan.c
  @@ -159,6 +159,29 @@ static DECLARE_RWSEM(shrinker_rwsem);
   #define scanning_global_lru(sc)        (1)
   #endif
 
  +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
  +static unsigned long reclaim_unmapped_pages(int priority, struct zone 
  *zone,
  +                                               struct scan_control *sc);
  +static int unmapped_page_control __read_mostly;
  +
  +static int __init unmapped_page_control_parm(char *str)
  +{
  +       unmapped_page_control = 1;
  +       /*
  +        * XXX: Should we tweak swappiness here?
  +        */
  +       return 1;
  +}
  +__setup(unmapped_page_control, unmapped_page_control_parm);
  +
  +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
  +static inline unsigned long reclaim_unmapped_pages(int priority,
  +                               struct zone *zone, struct scan_control 
  *sc)
  +{
  +       return 0;
  +}
  +#endif
  +
   static struct zone_reclaim_stat

[PATCH 0/3] Unmapped Page Cache Control (v4)

2011-01-24 Thread Balbir Singh
The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
Previous posting http://lwn.net/Articles/419564/

The previous few revision received lot of comments, I've tried to
address as many of those as possible in this revision.

Detailed Description

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario
- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim, a similar
max_unmapped_ratio sysctl is added and helps in the decision making
process of when reclaim should occur. This is tunable and set by
default to 16 (based on tradeoff's seen between aggressiveness in
balancing versus size of unmapped pages). Distro's and administrators
can further tweak this for desired control.

Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79


---

Balbir Singh (3):
  Move zone_reclaim() outside of CONFIG_NUMA
  Refactor zone_reclaim code
  Provide control over unmapped pages


 Documentation/kernel-parameters.txt |8 ++
 include/linux/mmzone.h  |9 ++-
 include/linux/swap.h|   23 +--
 init/Kconfig|   12 +++
 kernel/sysctl.c |   29 ++--
 mm/page_alloc.c |   31 -
 mm/vmscan.c |  122 +++
 7 files changed, 202 insertions(+), 32 deletions(-)

-- 
Balbir Singh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v4)

2011-01-24 Thread Balbir Singh
This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 include/linux/mmzone.h |4 ++--
 include/linux/swap.h   |4 ++--
 kernel/sysctl.c|   18 +-
 mm/page_alloc.c|6 +++---
 mm/vmscan.c|2 --
 5 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 02ecb01..2485acc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -303,12 +303,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
-#ifdef CONFIG_NUMA
-   int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
unsigned long   min_unmapped_pages;
+#ifdef CONFIG_NUMA
+   int node;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5e3355a..7b75626 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -255,11 +255,11 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+extern int sysctl_min_unmapped_ratio;
+extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index bc86bb3..12e8f26 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1224,15 +1224,6 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
-#ifdef CONFIG_NUMA
-   {
-   .procname   = zone_reclaim_mode,
-   .data   = zone_reclaim_mode,
-   .maxlen = sizeof(zone_reclaim_mode),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   .extra1 = zero,
-   },
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1242,6 +1233,15 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#ifdef CONFIG_NUMA
+   {
+   .procname   = zone_reclaim_mode,
+   .data   = zone_reclaim_mode,
+   .maxlen = sizeof(zone_reclaim_mode),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   .extra1 = zero,
+   },
{
.procname   = min_slab_ratio,
.data   = sysctl_min_slab_ratio,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aede3a4..7b56473 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4167,10 +4167,10 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
 
zone-spanned_pages = size;
zone-present_pages = realsize;
-#ifdef CONFIG_NUMA
-   zone-node = nid;
zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
/ 100;
+#ifdef CONFIG_NUMA
+   zone-node = nid;
zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
 #endif
zone-name = zone_names[j];
@@ -5084,7 +5084,6 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int 
write,
return 0;
 }
 
-#ifdef CONFIG_NUMA
 int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
@@ -5101,6 +5100,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table 
*table, int write,
return 0;
 }
 
+#ifdef CONFIG_NUMA
 int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 47a5096..5899f2f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2868,7 +2868,6 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-#ifdef CONFIG_NUMA
 /*
  * Zone reclaim mode
  *
@@ -3078,7 +3077,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, 
unsigned int order)
 
return ret;
 }
-#endif
 
 /*
  * page_evictable - test whether a page is evictable

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Refactor zone_reclaim code (v4)

2011-01-24 Thread Balbir Singh
Changelog v3
1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages

Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
Reviewed-by: Christoph Lameter c...@linux.com
---
 mm/vmscan.c |   35 +++
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5899f2f..02cc82e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2943,6 +2943,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_pages(struct zone *zone, struct scan_control *sc,
+   unsigned long nr_pages)
+{
+   int priority;
+   /*
+* Free memory by calling shrink zone with increasing
+* priorities until we have enough memory freed.
+*/
+   priority = ZONE_RECLAIM_PRIORITY;
+   do {
+   shrink_zone(priority, zone, sc);
+   priority--;
+   } while (priority = 0  sc-nr_reclaimed  nr_pages);
+}
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int 
order)
@@ -2951,7 +2972,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
const unsigned long nr_pages = 1  order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
-   int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode  RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode  RECLAIM_SWAP),
@@ -2975,17 +2995,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p-reclaim_state = reclaim_state;
 
-   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages) {
-   /*
-* Free memory by calling shrink zone with increasing
-* priorities until we have enough memory freed.
-*/
-   priority = ZONE_RECLAIM_PRIORITY;
-   do {
-   shrink_zone(priority, zone, sc);
-   priority--;
-   } while (priority = 0  sc.nr_reclaimed  nr_pages);
-   }
+   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages)
+   zone_reclaim_pages(zone, sc, nr_pages);
 
nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0  zone-min_slab_pages) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Provide control over unmapped pages (v4)

2011-01-24 Thread Balbir Singh
Changelog v4
1. Add max_unmapped_ratio and use that as the upper limit
to check when to shrink the unmapped page cache (Christoph
Lameter)

Changelog v2
1. Use a config option to enable the code (Andrew Morton)
2. Explain the magic tunables in the code or at-least attempt
   to explain them (General comment)
3. Hint uses of the boot parameter with unlikely (Andrew Morton)
4. Use better names (balanced is not a good naming convention)

Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.
A new sysctl for max_unmapped_ratio is provided and set to 16,
indicating 16% of the total zone pages are unmapped, we start
shrinking unmapped page cache.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 Documentation/kernel-parameters.txt |8 +++
 include/linux/mmzone.h  |5 ++
 include/linux/swap.h|   23 -
 init/Kconfig|   12 +
 kernel/sysctl.c |   11 
 mm/page_alloc.c |   25 ++
 mm/vmscan.c |   87 +++
 7 files changed, 166 insertions(+), 5 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index fee5f57..65a4ee6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2500,6 +2500,14 @@ and is between 256 and 4096 characters. It is defined in 
the file
[X86]
Set unknown_nmi_panic=1 early on boot.
 
+   unmapped_page_control
+   [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
+   is enabled. It controls the amount of unmapped memory
+   that is present in the system. This boot option plus
+   vm.min_unmapped_ratio (sysctl) provide granular control
+   over how much unmapped page cache can exist in the 
system
+   before kswapd starts reclaiming unmapped page cache 
pages.
+
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2).  This
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2485acc..18f0f09 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -306,7 +306,10 @@ struct zone {
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
unsigned long   min_unmapped_pages;
+   unsigned long   max_unmapped_pages;
+#endif
 #ifdef CONFIG_NUMA
int node;
unsigned long   min_slab_pages;
@@ -773,6 +776,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct 
ctl_table *, int,
void __user *, size_t *, loff_t *);
 int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+int sysctl_max_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
+   void __user *, size_t *, loff_t *);
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7b75626..ae62a03 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -255,19 +255,34 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 extern int sysctl_min_unmapped_ratio;
+extern int sysctl_max_unmapped_ratio;
+
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
 #else
-#define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
 {
return 0;
 }
 #endif
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+extern bool should_reclaim_unmapped_pages(struct zone *zone);
+#else
+static inline bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+   return false;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
+extern int sysctl_min_slab_ratio;
+#else
+#define zone_reclaim_mode 0
+#endif
+
 extern int page_evictable(struct page *page, struct vm_area_struct *vma);
 extern void scan_mapping_unevictable_pages(struct address_space *);
 
diff --git a/init/Kconfig b/init/Kconfig
index 4f6cdbf..2dfbc09 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -828,6 +828,18 @@ config SCHED_AUTOGROUP
 config MM_OWNER
bool
 
+config UNMAPPED_PAGECACHE_CONTROL
+   bool Provide control over unmapped page cache

Re: [PATCH 1/2] Refactor zone_reclaim code (v4)

2011-01-24 Thread Balbir Singh
* Balbir Singh bal...@linux.vnet.ibm.com [2011-01-25 10:40:09]:

 Changelog v3
 1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages
 
 Refactor zone_reclaim, move reusable functionality outside
 of zone_reclaim. Make zone_reclaim_unmapped_pages modular
 
 Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
 Reviewed-by: Christoph Lameter c...@linux.com

I got the patch numbering wrong due to a internet connection going down
in the middle of stg mail, restarting with specified patches goofed up
the numbering. I can resend the patches with the correct numbering if
desired. This patch should be numbered 2/3

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)

2011-01-23 Thread Balbir Singh
* Christoph Lameter c...@linux.com [2011-01-21 09:55:17]:

 On Fri, 21 Jan 2011, Balbir Singh wrote:
 
  * Christoph Lameter c...@linux.com [2011-01-20 09:00:09]:
 
   On Thu, 20 Jan 2011, Balbir Singh wrote:
  
+   unmapped_page_control
+   [KNL] Available if 
CONFIG_UNMAPPED_PAGECACHE_CONTROL
+   is enabled. It controls the amount of unmapped 
memory
+   that is present in the system. This boot option 
plus
+   vm.min_unmapped_ratio (sysctl) provide granular 
control
  
   min_unmapped_ratio is there to guarantee that zone reclaim does not
   reclaim all unmapped pages.
  
   What you want here is a max_unmapped_ratio.
  
 
  I thought about that, the logic for reusing min_unmapped_ratio was to
  keep a limit beyond which unmapped page cache shrinking should stop.
 
 Right. That is the role of it. Its a minimum to leave. You want a maximum
 size of the pagte cache.

In this case we want the maximum to be as small as the minimum, but
from a general design perspective maximum does make sense.

 
  I think you are suggesting max_unmapped_ratio as the point at which
  shrinking should begin, right?
 
 The role of min_unmapped_ratio is to never reclaim more pagecache if we
 reach that ratio even if we have to go off node for an allocation.
 
 AFAICT What you propose is a maximum size of the page cache. If the number
 of page cache pages goes beyond that then you trim the page cache in
 background reclaim.
 
+   reclaim_unmapped_pages(priority, zone, sc);
+
if (!zone_watermark_ok_safe(zone, order,
  
   H. Okay that means background reclaim does it. If so then we also want
   zone reclaim to be able to work in the background I think.
 
  Anything specific you had in mind, works for me in testing, but is
  there anything specific that stands out in your mind that needs to be
  done?
 
 Hmmm. So this would also work in a NUMA configuration, right. Limiting the
 sizes of the page cache would avoid zone reclaim through these limit. Page
 cache size would be limited by the max_unmapped_ratio.
 
 zone_reclaim only would come into play if other allocations make the
 memory on the node so tight that we would have to evict more page
 cache pages in direct reclaim.
 Then zone_reclaim could go down to shrink the page cache size to
 min_unmapped_ratio.


I'll repost with max_unmapped_ration changes

Thanks for the review! 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[REPOST][PATCH 0/3] Unmapped page cache control (v3)

2011-01-20 Thread Balbir Singh

The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
Previous posting http://lwn.net/Articles/419564/

The previous few revision received lot of comments, I've tried to
address as many of those as possible in this revision. The
last series was reviewed-by Christoph Lameter.

There were comments on overlap with Nick's changes and overlap
with them. I don't feel these changes impact Nick's work and
integration can/will be considered as the patches evolve, if
need be.

Detailed Description

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario
- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79

Size measurement

CONFIG_UNMAPPED_PAGECACHE_CONTROL and CONFIG_NUMA enabled
# size mm/built-in.o 
   textdata bss dec hex filename
 419431 1883047  140888 2443366  254866 mm/built-in.o

CONFIG_UNMAPPED_PAGECACHE_CONTROL disabled, CONFIG_NUMA enabled
# size mm/built-in.o 
   textdata bss dec hex filename
 418908 1883023  140888 2442819  254643 mm/built-in.o


---

Balbir Singh (3):
  Move zone_reclaim() outside of CONFIG_NUMA
  Refactor zone_reclaim code
  Provide control over unmapped pages


 Documentation/kernel-parameters.txt |8 ++
 include/linux/mmzone.h  |4 +
 include/linux/swap.h|   21 +-
 init/Kconfig|   12 +++
 kernel/sysctl.c |   20 +++--
 mm/page_alloc.c |9 ++
 mm/vmscan.c |  132 +++
 7 files changed, 175 insertions(+), 31 deletions(-)

-- 
Balbir Singh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)

2011-01-20 Thread Balbir Singh
This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 include/linux/mmzone.h |4 ++--
 include/linux/swap.h   |4 ++--
 kernel/sysctl.c|   18 +-
 mm/vmscan.c|2 --
 4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4890662..aeede91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -302,12 +302,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
-#ifdef CONFIG_NUMA
-   int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
unsigned long   min_unmapped_pages;
+#ifdef CONFIG_NUMA
+   int node;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 84375e4..ac5c06e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,11 +253,11 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+extern int sysctl_min_unmapped_ratio;
+extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a00fdef..e40040e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,15 +1211,6 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
-#ifdef CONFIG_NUMA
-   {
-   .procname   = zone_reclaim_mode,
-   .data   = zone_reclaim_mode,
-   .maxlen = sizeof(zone_reclaim_mode),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   .extra1 = zero,
-   },
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1229,6 +1220,15 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#ifdef CONFIG_NUMA
+   {
+   .procname   = zone_reclaim_mode,
+   .data   = zone_reclaim_mode,
+   .maxlen = sizeof(zone_reclaim_mode),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   .extra1 = zero,
+   },
{
.procname   = min_slab_ratio,
.data   = sysctl_min_slab_ratio,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42a4859..e841cae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2740,7 +2740,6 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-#ifdef CONFIG_NUMA
 /*
  * Zone reclaim mode
  *
@@ -2950,7 +2949,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, 
unsigned int order)
 
return ret;
 }
-#endif
 
 /*
  * page_evictable - test whether a page is evictable

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)

2011-01-20 Thread Balbir Singh
Changelog v3
1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages

Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 mm/vmscan.c |   35 +++
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e841cae..3b25423 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_pages(struct zone *zone, struct scan_control *sc,
+   unsigned long nr_pages)
+{
+   int priority;
+   /*
+* Free memory by calling shrink zone with increasing
+* priorities until we have enough memory freed.
+*/
+   priority = ZONE_RECLAIM_PRIORITY;
+   do {
+   shrink_zone(priority, zone, sc);
+   priority--;
+   } while (priority = 0  sc-nr_reclaimed  nr_pages);
+}
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int 
order)
@@ -2823,7 +2844,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
const unsigned long nr_pages = 1  order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
-   int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode  RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode  RECLAIM_SWAP),
@@ -2847,17 +2867,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p-reclaim_state = reclaim_state;
 
-   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages) {
-   /*
-* Free memory by calling shrink zone with increasing
-* priorities until we have enough memory freed.
-*/
-   priority = ZONE_RECLAIM_PRIORITY;
-   do {
-   shrink_zone(priority, zone, sc);
-   priority--;
-   } while (priority = 0  sc.nr_reclaimed  nr_pages);
-   }
+   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages)
+   zone_reclaim_pages(zone, sc, nr_pages);
 
nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0  zone-min_slab_pages) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)

2011-01-20 Thread Balbir Singh
Changelog v2
1. Use a config option to enable the code (Andrew Morton)
2. Explain the magic tunables in the code or at-least attempt
   to explain them (General comment)
3. Hint uses of the boot parameter with unlikely (Andrew Morton)
4. Use better names (balanced is not a good naming convention)

Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 Documentation/kernel-parameters.txt |8 +++
 include/linux/swap.h|   21 ++--
 init/Kconfig|   12 
 kernel/sysctl.c |2 +
 mm/page_alloc.c |9 +++
 mm/vmscan.c |   97 +++
 6 files changed, 142 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index dd8fe2b..f52b0bd 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined in 
the file
[X86]
Set unknown_nmi_panic=1 early on boot.
 
+   unmapped_page_control
+   [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
+   is enabled. It controls the amount of unmapped memory
+   that is present in the system. This boot option plus
+   vm.min_unmapped_ratio (sysctl) provide granular control
+   over how much unmapped page cache can exist in the 
system
+   before kswapd starts reclaiming unmapped page cache 
pages.
+
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2).  This
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ac5c06e..773d7e5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,19 +253,32 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 extern int sysctl_min_unmapped_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
 #else
-#define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
 {
return 0;
 }
 #endif
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+extern bool should_reclaim_unmapped_pages(struct zone *zone);
+#else
+static inline bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+   return false;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
+extern int sysctl_min_slab_ratio;
+#else
+#define zone_reclaim_mode 0
+#endif
+
 extern int page_evictable(struct page *page, struct vm_area_struct *vma);
 extern void scan_mapping_unevictable_pages(struct address_space *);
 
diff --git a/init/Kconfig b/init/Kconfig
index 3eb22ad..78c9169 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -782,6 +782,18 @@ endif # NAMESPACES
 config MM_OWNER
bool
 
+config UNMAPPED_PAGECACHE_CONTROL
+   bool Provide control over unmapped page cache
+   default n
+   help
+ This option adds support for controlling unmapped page cache
+ via a boot parameter (unmapped_page_control). The boot parameter
+ with sysctl (vm.min_unmapped_ratio) control the total number
+ of unmapped pages in the system. This feature is useful if
+ you want to limit the amount of unmapped page cache or want
+ to reduce page cache duplication in a virtualized environment.
+ If unsure say 'N'
+
 config SYSFS_DEPRECATED
bool enable deprecated sysfs features to support old userspace tools
depends on SYSFS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e40040e..ab2c60a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#endif
 #ifdef CONFIG_NUMA
{
.procname   = zone_reclaim_mode,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1845a97..1c9fbab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1662,6 +1662,9 @@ zonelist_scan:
unsigned long mark

Re: [REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)

2011-01-20 Thread Balbir Singh
* Christoph Lameter c...@linux.com [2011-01-20 08:49:27]:

 On Thu, 20 Jan 2011, Balbir Singh wrote:
 
  --- a/include/linux/swap.h
  +++ b/include/linux/swap.h
  @@ -253,11 +253,11 @@ extern int vm_swappiness;
   extern int remove_mapping(struct address_space *mapping, struct page 
  *page);
   extern long vm_total_pages;
 
  +extern int sysctl_min_unmapped_ratio;
  +extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
   #ifdef CONFIG_NUMA
   extern int zone_reclaim_mode;
  -extern int sysctl_min_unmapped_ratio;
   extern int sysctl_min_slab_ratio;
  -extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
   #else
   #define zone_reclaim_mode 0
 
 So the end result of this patch is that zone reclaim is compiled
 into vmscan.o even on !NUMA configurations but since zone_reclaim_mode ==
 0 noone can ever call that code?


The third patch, fixes this with the introduction of a config
(cut-copy-paste below). If someone were to bisect to this point, what
you say is correct.

+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) ||
defined(CONFIG_NUMA)
 extern int sysctl_min_unmapped_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
 #else
-#define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned
int order)
 {
return 0;
 }
 #endif

Thanks for the review! 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)

2011-01-20 Thread Balbir Singh
* Christoph Lameter c...@linux.com [2011-01-20 08:50:40]:

 
 Reviewed-by: Christoph Lameter c...@linux.com


Thanks for the review! 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)

2011-01-20 Thread Balbir Singh
* Christoph Lameter c...@linux.com [2011-01-20 09:00:09]:

 On Thu, 20 Jan 2011, Balbir Singh wrote:
 
  +   unmapped_page_control
  +   [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
  +   is enabled. It controls the amount of unmapped memory
  +   that is present in the system. This boot option plus
  +   vm.min_unmapped_ratio (sysctl) provide granular control
 
 min_unmapped_ratio is there to guarantee that zone reclaim does not
 reclaim all unmapped pages.
 
 What you want here is a max_unmapped_ratio.


I thought about that, the logic for reusing min_unmapped_ratio was to
keep a limit beyond which unmapped page cache shrinking should stop.
I think you are suggesting max_unmapped_ratio as the point at which
shrinking should begin, right?
 
 
   {
  @@ -2297,6 +2320,12 @@ loop_again:
  shrink_active_list(SWAP_CLUSTER_MAX, zone,
  sc, priority, 0);
 
  +   /*
  +* We do unmapped page reclaim once here and once
  +* below, so that we don't lose out
  +*/
  +   reclaim_unmapped_pages(priority, zone, sc);
  +
  if (!zone_watermark_ok_safe(zone, order,
 
 H. Okay that means background reclaim does it. If so then we also want
 zone reclaim to be able to work in the background I think.

Anything specific you had in mind, works for me in testing, but is
there anything specific that stands out in your mind that needs to be
done?

Thanks for the review!
 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages (v2)

2010-12-23 Thread Balbir Singh
* MinChan Kim minchan@gmail.com [2010-12-14 20:02:45]:

  +                       if (should_reclaim_unmapped_pages(zone))
  +                               wakeup_kswapd(zone, order);
 
 I think we can put the logic into zone_watermark_okay.


I did some checks and zone_watermark_ok is used in several places for
a generic check like this -- for example prior to zone_reclaim(), if
in get_page_from_freelist() we skip zones based on the return value.
The compaction code uses it as well, the impact would be deeper. The
compaction code uses it to check whether an allocation will succeed or
not, I don't want unmapped page control to impact that.
 
  +                       /*
  +                        * We do unmapped page reclaim once here and once
  +                        * below, so that we don't lose out
  +                        */
  +                       reclaim_unmapped_pages(priority, zone, sc);
 
 It can make unnecessary stir of lru pages.
 How about this?
 zone_watermark_ok returns ZONE_UNMAPPED_PAGE_FULL.
 wakeup_kswapd(..., please reclaim unmapped page cache).
 If kswapd is woke up by unmapped page full, kswapd sets up sc with unmap = 0.
 If the kswapd try to reclaim unmapped page, shrink_page_list doesn't
 rotate non-unmapped pages.

With may_unmap set to 0 and may_writepage set to 0, I don't think this
should be a major problem, like I said this code is already enabled if
zone_reclaim_mode != 0 and CONFIG_NUMA is set.

  +unsigned long reclaim_unmapped_pages(int priority, struct zone *zone,
  +                                               struct scan_control *sc)
  +{
  +       if (unlikely(unmapped_page_control) 
  +               (zone_unmapped_file_pages(zone)  
  zone-min_unmapped_pages)) {
  +               struct scan_control nsc;
  +               unsigned long nr_pages;
  +
  +               nsc = *sc;
  +
  +               nsc.swappiness = 0;
  +               nsc.may_writepage = 0;
  +               nsc.may_unmap = 0;
  +               nsc.nr_reclaimed = 0;
 
 This logic can be put in zone_reclaim_unmapped_pages.
 

Now that I refactored the code and called it zone_reclaim_pages, I
expect the correct sc to be passed to it. This code is reused between
zone_reclaim() and reclaim_unmapped_pages(). In the former,
zone_reclaim does the setup.

 If we want really this, how about the new cache lru idea as Kame suggests?
 For example, add_to_page_cache_lru adds the page into cache lru.
 page_add_file_rmap moves the page into inactive file.
 page_remove_rmap moves the page into lru cache, again.
 We can count the unmapped pages and if the size exceeds limit, we can
 wake up kswapd.
 whenever the memory pressure happens, first of all, reclaimer try to
 reclaim cache lru.

We already have a file LRU and that has active/inactive lists, I don't
think a special mapped/unmapped list makes sense at this point.


-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Refactor zone_reclaim (v2)

2010-12-23 Thread Balbir Singh
* MinChan Kim minchan@gmail.com [2010-12-15 07:38:42]:

 On Tue, Dec 14, 2010 at 8:45 PM, Balbir Singh bal...@linux.vnet.ibm.com 
 wrote:
  * MinChan Kim minchan@gmail.com [2010-12-14 19:01:26]:
 
  Hi Balbir,
 
  On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh
  bal...@linux.vnet.ibm.com wrote:
   Move reusable functionality outside of zone_reclaim.
   Make zone_reclaim_unmapped_pages modular
  
   Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
   ---
    mm/vmscan.c |   35 +++
    1 files changed, 23 insertions(+), 12 deletions(-)
  
   diff --git a/mm/vmscan.c b/mm/vmscan.c
   index e841cae..4e2ad05 100644
   --- a/mm/vmscan.c
   +++ b/mm/vmscan.c
   @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct 
   zone *zone)
    }
  
    /*
   + * Helper function to reclaim unmapped pages, we might add something
   + * similar to this for slab cache as well. Currently this function
   + * is shared with __zone_reclaim()
   + */
   +static inline void
   +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
   +                               unsigned long nr_pages)
   +{
   +       int priority;
   +       /*
   +        * Free memory by calling shrink zone with increasing
   +        * priorities until we have enough memory freed.
   +        */
   +       priority = ZONE_RECLAIM_PRIORITY;
   +       do {
   +               shrink_zone(priority, zone, sc);
   +               priority--;
   +       } while (priority = 0  sc-nr_reclaimed  nr_pages);
   +}
 
  As I said previous version, zone_reclaim_unmapped_pages doesn't have
  any functions related to reclaim unmapped pages.
 
  The scan control point has the right arguments for implementing
  reclaim of unmapped pages.
 
 I mean you should set up scan_control setup in this function.
 Current zone_reclaim_unmapped_pages doesn't have any specific routine
 related to reclaim unmapped pages.
 Otherwise, change the function name with just zone_reclaim_pages. I
 think you don't want it.

Done, I renamed the function to zone_reclaim_pages.

Thanks!

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Unmapped Page Control (v3)

2010-12-23 Thread Balbir Singh
The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
Previous posting http://lwn.net/Articles/419564/

For those with LWN.net access, there is a detailed coverage
of the patchset at http://lwn.net/Articles/419713/

The previous few revision received lot of comments, I've tried to
address as many of those as possible in this revision. An earlier
series was reviewed-by Christoph Lameter.

There were comments on overlap with Nick's changes and overlap
with them. I don't feel these changes impact Nick's work and
integration can/will be considered as the patches evolve, if
need be.

Detailed Description

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario
- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79

Size measurement

CONFIG_UNMAPPED_PAGECACHE_CONTROL and CONFIG_NUMA enabled
# size mm/built-in.o 
   textdata bss dec hex filename
 419431 1883047  140888 2443366  254866 mm/built-in.o

CONFIG_UNMAPPED_PAGECACHE_CONTROL disabled, CONFIG_NUMA enabled
# size mm/built-in.o 
   textdata bss dec hex filename
 418908 1883023  140888 2442819  254643 mm/built-in.o


---

Balbir Singh (3):
  Move zone_reclaim() outside of CONFIG_NUMA
  Refactor zone_reclaim code
  Provide control over unmapped pages


 Documentation/kernel-parameters.txt |8 ++
 include/linux/mmzone.h  |4 +
 include/linux/swap.h|   21 +-
 init/Kconfig|   12 +++
 kernel/sysctl.c |   20 +++--
 mm/page_alloc.c |9 ++
 mm/vmscan.c |  132 +++
 7 files changed, 175 insertions(+), 31 deletions(-)

-- 
Balbir Singh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)

2010-12-23 Thread Balbir Singh
This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 include/linux/mmzone.h |4 ++--
 include/linux/swap.h   |4 ++--
 kernel/sysctl.c|   18 +-
 mm/vmscan.c|2 --
 4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4890662..aeede91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -302,12 +302,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
-#ifdef CONFIG_NUMA
-   int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
unsigned long   min_unmapped_pages;
+#ifdef CONFIG_NUMA
+   int node;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 84375e4..ac5c06e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,11 +253,11 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+extern int sysctl_min_unmapped_ratio;
+extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a00fdef..e40040e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,15 +1211,6 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
-#ifdef CONFIG_NUMA
-   {
-   .procname   = zone_reclaim_mode,
-   .data   = zone_reclaim_mode,
-   .maxlen = sizeof(zone_reclaim_mode),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   .extra1 = zero,
-   },
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1229,6 +1220,15 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#ifdef CONFIG_NUMA
+   {
+   .procname   = zone_reclaim_mode,
+   .data   = zone_reclaim_mode,
+   .maxlen = sizeof(zone_reclaim_mode),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   .extra1 = zero,
+   },
{
.procname   = min_slab_ratio,
.data   = sysctl_min_slab_ratio,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42a4859..e841cae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2740,7 +2740,6 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-#ifdef CONFIG_NUMA
 /*
  * Zone reclaim mode
  *
@@ -2950,7 +2949,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, 
unsigned int order)
 
return ret;
 }
-#endif
 
 /*
  * page_evictable - test whether a page is evictable

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Refactor zone_reclaim code (v3)

2010-12-23 Thread Balbir Singh
Changelog v3
1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages

Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 mm/vmscan.c |   35 +++
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e841cae..3b25423 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_pages(struct zone *zone, struct scan_control *sc,
+   unsigned long nr_pages)
+{
+   int priority;
+   /*
+* Free memory by calling shrink zone with increasing
+* priorities until we have enough memory freed.
+*/
+   priority = ZONE_RECLAIM_PRIORITY;
+   do {
+   shrink_zone(priority, zone, sc);
+   priority--;
+   } while (priority = 0  sc-nr_reclaimed  nr_pages);
+}
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int 
order)
@@ -2823,7 +2844,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
const unsigned long nr_pages = 1  order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
-   int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode  RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode  RECLAIM_SWAP),
@@ -2847,17 +2867,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p-reclaim_state = reclaim_state;
 
-   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages) {
-   /*
-* Free memory by calling shrink zone with increasing
-* priorities until we have enough memory freed.
-*/
-   priority = ZONE_RECLAIM_PRIORITY;
-   do {
-   shrink_zone(priority, zone, sc);
-   priority--;
-   } while (priority = 0  sc.nr_reclaimed  nr_pages);
-   }
+   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages)
+   zone_reclaim_pages(zone, sc, nr_pages);
 
nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0  zone-min_slab_pages) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Provide control over unmapped pages (v3)

2010-12-23 Thread Balbir Singh
Changelog v2
1. Use a config option to enable the code (Andrew Morton)
2. Explain the magic tunables in the code or at-least attempt
   to explain them (General comment)
3. Hint uses of the boot parameter with unlikely (Andrew Morton)
4. Use better names (balanced is not a good naming convention)

Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 Documentation/kernel-parameters.txt |8 +++
 include/linux/swap.h|   21 ++--
 init/Kconfig|   12 
 kernel/sysctl.c |2 +
 mm/page_alloc.c |9 +++
 mm/vmscan.c |   97 +++
 6 files changed, 142 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index dd8fe2b..f52b0bd 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined in 
the file
[X86]
Set unknown_nmi_panic=1 early on boot.
 
+   unmapped_page_control
+   [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
+   is enabled. It controls the amount of unmapped memory
+   that is present in the system. This boot option plus
+   vm.min_unmapped_ratio (sysctl) provide granular control
+   over how much unmapped page cache can exist in the 
system
+   before kswapd starts reclaiming unmapped page cache 
pages.
+
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2).  This
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ac5c06e..773d7e5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,19 +253,32 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 extern int sysctl_min_unmapped_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
 #else
-#define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
 {
return 0;
 }
 #endif
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+extern bool should_reclaim_unmapped_pages(struct zone *zone);
+#else
+static inline bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+   return false;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
+extern int sysctl_min_slab_ratio;
+#else
+#define zone_reclaim_mode 0
+#endif
+
 extern int page_evictable(struct page *page, struct vm_area_struct *vma);
 extern void scan_mapping_unevictable_pages(struct address_space *);
 
diff --git a/init/Kconfig b/init/Kconfig
index 3eb22ad..78c9169 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -782,6 +782,18 @@ endif # NAMESPACES
 config MM_OWNER
bool
 
+config UNMAPPED_PAGECACHE_CONTROL
+   bool Provide control over unmapped page cache
+   default n
+   help
+ This option adds support for controlling unmapped page cache
+ via a boot parameter (unmapped_page_control). The boot parameter
+ with sysctl (vm.min_unmapped_ratio) control the total number
+ of unmapped pages in the system. This feature is useful if
+ you want to limit the amount of unmapped page cache or want
+ to reduce page cache duplication in a virtualized environment.
+ If unsure say 'N'
+
 config SYSFS_DEPRECATED
bool enable deprecated sysfs features to support old userspace tools
depends on SYSFS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e40040e..ab2c60a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#endif
 #ifdef CONFIG_NUMA
{
.procname   = zone_reclaim_mode,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1845a97..1c9fbab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1662,6 +1662,9 @@ zonelist_scan:
unsigned long mark

Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-14 Thread Balbir Singh
* Rik van Riel r...@redhat.com [2010-12-13 12:02:51]:

 On 12/11/2010 08:57 AM, Balbir Singh wrote:
 
 If the vpcu holding the lock runs more and capped, the timeslice
 transfer is a heuristic that will not help.
 
 That indicates you really need the cap to be per guest, and
 not per VCPU.


Yes, I personally think so too, but I suspect there needs to be a
larger agreement on the semantics. The VCPU semantics in terms of
power apply to each VCPU as opposed to the entire system (per guest).
 
 Having one VCPU spin on a lock (and achieve nothing), because
 the other one cannot give up the lock due to hitting its CPU
 cap could lead to showstoppingly bad performance.

Yes, that seems right!

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Refactor zone_reclaim (v2)

2010-12-14 Thread Balbir Singh
* MinChan Kim minchan@gmail.com [2010-12-14 19:01:26]:

 Hi Balbir,
 
 On Fri, Dec 10, 2010 at 11:31 PM, Balbir Singh
 bal...@linux.vnet.ibm.com wrote:
  Move reusable functionality outside of zone_reclaim.
  Make zone_reclaim_unmapped_pages modular
 
  Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
  ---
   mm/vmscan.c |   35 +++
   1 files changed, 23 insertions(+), 12 deletions(-)
 
  diff --git a/mm/vmscan.c b/mm/vmscan.c
  index e841cae..4e2ad05 100644
  --- a/mm/vmscan.c
  +++ b/mm/vmscan.c
  @@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone 
  *zone)
   }
 
   /*
  + * Helper function to reclaim unmapped pages, we might add something
  + * similar to this for slab cache as well. Currently this function
  + * is shared with __zone_reclaim()
  + */
  +static inline void
  +zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
  +                               unsigned long nr_pages)
  +{
  +       int priority;
  +       /*
  +        * Free memory by calling shrink zone with increasing
  +        * priorities until we have enough memory freed.
  +        */
  +       priority = ZONE_RECLAIM_PRIORITY;
  +       do {
  +               shrink_zone(priority, zone, sc);
  +               priority--;
  +       } while (priority = 0  sc-nr_reclaimed  nr_pages);
  +}
 
 As I said previous version, zone_reclaim_unmapped_pages doesn't have
 any functions related to reclaim unmapped pages.
 The function name is rather strange.
 It would be better to add scan_control setup in function inner to
 reclaim only unmapped pages.

OK, that is an idea worth looking at, I'll revisit this function.

Thanks for the review!

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-13 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-12-11 09:31:24]:

 On 12/10/2010 07:03 AM, Balbir Singh wrote:
 
   Scheduler people, please flame me with anything I may have done
   wrong, so I can do it right for a next version :)
 
 
 This is a good problem statement, there are other things to consider
 as well
 
 1. If a hard limit feature is enabled underneath, donating the
 timeslice would probably not make too much sense in that case
 
 What's the alternative?
 
 Consider a two vcpu guest with a 50% hard cap.  Suppose the workload
 involves ping-ponging within the guest.  If the scheduler decides to
 schedule the vcpus without any overlap, then the throughput will be
 dictated by the time slice.  If we allow donation, throughput is
 limited by context switch latency.


If the vpcu holding the lock runs more and capped, the timeslice
transfer is a heuristic that will not help. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-13 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-12-13 13:57:37]:

 On 12/11/2010 03:57 PM, Balbir Singh wrote:
 * Avi Kivitya...@redhat.com  [2010-12-11 09:31:24]:
 
   On 12/10/2010 07:03 AM, Balbir Singh wrote:
   
  Scheduler people, please flame me with anything I may have done
  wrong, so I can do it right for a next version :)
   
   
   This is a good problem statement, there are other things to consider
   as well
   
   1. If a hard limit feature is enabled underneath, donating the
   timeslice would probably not make too much sense in that case
 
   What's the alternative?
 
   Consider a two vcpu guest with a 50% hard cap.  Suppose the workload
   involves ping-ponging within the guest.  If the scheduler decides to
   schedule the vcpus without any overlap, then the throughput will be
   dictated by the time slice.  If we allow donation, throughput is
   limited by context switch latency.
 
 
 If the vpcu holding the lock runs more and capped, the timeslice
 transfer is a heuristic that will not help.
 
 Why not?  as long as we shift the cap as well.


Shifting the cap would break it, no? Anyway, that is something for us
to keep track of as we add additional heuristics, not a show stopper. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Provide unmapped page cache control (v2)

2010-12-10 Thread Balbir Singh

The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
Previous posting https://lkml.org/lkml/2010/11/30/79

The previous revision received lot of comments, I've tried to
address as many of those as possible in this revision. The
last series was reviewed-by Christoph Lameter.

There were comments on overlap with Nick's changes and overlap
with them. I don't feel these changes impact Nick's work and
integration can/will be considered as the patches evolve, if
need be.

Detailed Description

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario
- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79

Size measurement

CONFIG_UNMAPPED_PAGECACHE_CONTROL and CONFIG_NUMA enabled
# size mm/built-in.o 
   textdata bss dec hex filename
 419431 1883047  140888 2443366  254866 mm/built-in.o

CONFIG_UNMAPPED_PAGECACHE_CONTROL disabled, CONFIG_NUMA enabled
# size mm/built-in.o 
   textdata bss dec hex filename
 418908 1883023  140888 2442819  254643 mm/built-in.o


---

Balbir Singh (3):
  Move zone_reclaim() outside of CONFIG_NUMA
  Refactor zone_reclaim, move reusable functionality outside
  Provide control over unmapped pages


 Documentation/kernel-parameters.txt |8 ++
 include/linux/mmzone.h  |4 +
 include/linux/swap.h|   21 +-
 init/Kconfig|   12 +++
 kernel/sysctl.c |   20 +++--
 mm/page_alloc.c |9 ++
 mm/vmscan.c |  132 +++
 7 files changed, 175 insertions(+), 31 deletions(-)

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v2)

2010-12-10 Thread Balbir Singh
Changelog v2
Moved sysctl for min_unmapped_ratio as well

This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 include/linux/mmzone.h |4 ++--
 include/linux/swap.h   |4 ++--
 kernel/sysctl.c|   18 +-
 mm/vmscan.c|2 --
 4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4890662..aeede91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -302,12 +302,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
-#ifdef CONFIG_NUMA
-   int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
unsigned long   min_unmapped_pages;
+#ifdef CONFIG_NUMA
+   int node;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 84375e4..ac5c06e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,11 +253,11 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+extern int sysctl_min_unmapped_ratio;
+extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a00fdef..e40040e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,15 +1211,6 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
-#ifdef CONFIG_NUMA
-   {
-   .procname   = zone_reclaim_mode,
-   .data   = zone_reclaim_mode,
-   .maxlen = sizeof(zone_reclaim_mode),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   .extra1 = zero,
-   },
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1229,6 +1220,15 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#ifdef CONFIG_NUMA
+   {
+   .procname   = zone_reclaim_mode,
+   .data   = zone_reclaim_mode,
+   .maxlen = sizeof(zone_reclaim_mode),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   .extra1 = zero,
+   },
{
.procname   = min_slab_ratio,
.data   = sysctl_min_slab_ratio,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42a4859..e841cae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2740,7 +2740,6 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-#ifdef CONFIG_NUMA
 /*
  * Zone reclaim mode
  *
@@ -2950,7 +2949,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, 
unsigned int order)
 
return ret;
 }
-#endif
 
 /*
  * page_evictable - test whether a page is evictable

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Refactor zone_reclaim (v2)

2010-12-10 Thread Balbir Singh
Move reusable functionality outside of zone_reclaim.
Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 mm/vmscan.c |   35 +++
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e841cae..4e2ad05 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
+   unsigned long nr_pages)
+{
+   int priority;
+   /*
+* Free memory by calling shrink zone with increasing
+* priorities until we have enough memory freed.
+*/
+   priority = ZONE_RECLAIM_PRIORITY;
+   do {
+   shrink_zone(priority, zone, sc);
+   priority--;
+   } while (priority = 0  sc-nr_reclaimed  nr_pages);
+}
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int 
order)
@@ -2823,7 +2844,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
const unsigned long nr_pages = 1  order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
-   int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode  RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode  RECLAIM_SWAP),
@@ -2847,17 +2867,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p-reclaim_state = reclaim_state;
 
-   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages) {
-   /*
-* Free memory by calling shrink zone with increasing
-* priorities until we have enough memory freed.
-*/
-   priority = ZONE_RECLAIM_PRIORITY;
-   do {
-   shrink_zone(priority, zone, sc);
-   priority--;
-   } while (priority = 0  sc.nr_reclaimed  nr_pages);
-   }
+   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages)
+   zone_reclaim_unmapped_pages(zone, sc, nr_pages);
 
nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0  zone-min_slab_pages) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Provide control over unmapped pages (v2)

2010-12-10 Thread Balbir Singh
Changelog v2
1. Use a config option to enable the code (Andrew Morton)
2. Explain the magic tunables in the code or at-least attempt
   to explain them (General comment)
3. Hint uses of the boot parameter with unlikely (Andrew Morton)
4. Use better names (balanced is not a good naming convention)
5. Updated Documentation/kernel-parameters.txt (Andrew Morton)

Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 Documentation/kernel-parameters.txt |8 +++
 include/linux/swap.h|   21 ++--
 init/Kconfig|   12 
 kernel/sysctl.c |2 +
 mm/page_alloc.c |9 +++
 mm/vmscan.c |   97 +++
 6 files changed, 142 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index dd8fe2b..f52b0bd 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined in 
the file
[X86]
Set unknown_nmi_panic=1 early on boot.
 
+   unmapped_page_control
+   [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
+   is enabled. It controls the amount of unmapped memory
+   that is present in the system. This boot option plus
+   vm.min_unmapped_ratio (sysctl) provide granular control
+   over how much unmapped page cache can exist in the 
system
+   before kswapd starts reclaiming unmapped page cache 
pages.
+
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2).  This
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ac5c06e..773d7e5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,19 +253,32 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 extern int sysctl_min_unmapped_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
 #else
-#define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
 {
return 0;
 }
 #endif
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+extern bool should_reclaim_unmapped_pages(struct zone *zone);
+#else
+static inline bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+   return false;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
+extern int sysctl_min_slab_ratio;
+#else
+#define zone_reclaim_mode 0
+#endif
+
 extern int page_evictable(struct page *page, struct vm_area_struct *vma);
 extern void scan_mapping_unevictable_pages(struct address_space *);
 
diff --git a/init/Kconfig b/init/Kconfig
index 3eb22ad..78c9169 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -782,6 +782,18 @@ endif # NAMESPACES
 config MM_OWNER
bool
 
+config UNMAPPED_PAGECACHE_CONTROL
+   bool Provide control over unmapped page cache
+   default n
+   help
+ This option adds support for controlling unmapped page cache
+ via a boot parameter (unmapped_page_control). The boot parameter
+ with sysctl (vm.min_unmapped_ratio) control the total number
+ of unmapped pages in the system. This feature is useful if
+ you want to limit the amount of unmapped page cache or want
+ to reduce page cache duplication in a virtualized environment.
+ If unsure say 'N'
+
 config SYSFS_DEPRECATED
bool enable deprecated sysfs features to support old userspace tools
depends on SYSFS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e40040e..ab2c60a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#endif
 #ifdef CONFIG_NUMA
{
.procname   = zone_reclaim_mode,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1845a97..1c9fbab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1662,6 +1662,9

Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-09 Thread Balbir Singh
* Rik van Riel r...@redhat.com [2010-12-02 14:41:29]:

 When running SMP virtual machines, it is possible for one VCPU to be
 spinning on a spinlock, while the VCPU that holds the spinlock is not
 currently running, because the host scheduler preempted it to run
 something else.
 
 Both Intel and AMD CPUs have a feature that detects when a virtual
 CPU is spinning on a lock and will trap to the host.
 
 The current KVM code sleeps for a bit whenever that happens, which
 results in eg. a 64 VCPU Windows guest taking forever and a bit to
 boot up.  This is because the VCPU holding the lock is actually
 running and not sleeping, so the pause is counter-productive.
 
 In other workloads a pause can also be counter-productive, with
 spinlock detection resulting in one guest giving up its CPU time
 to the others.  Instead of spinning, it ends up simply not running
 much at all.
 
 This patch series aims to fix that, by having a VCPU that spins
 give the remainder of its timeslice to another VCPU in the same
 guest before yielding the CPU - one that is runnable but got 
 preempted, hopefully the lock holder.
 
 Scheduler people, please flame me with anything I may have done
 wrong, so I can do it right for a next version :)


This is a good problem statement, there are other things to consider
as well

1. If a hard limit feature is enabled underneath, donating the
timeslice would probably not make too much sense in that case
2. The implict assumption is that spinning is bad, but for locks
held for short durations, the assumption is not true. I presume
by the problem statement above, the h/w does the detection of
when to pause, but that is not always correct as you suggest above.
3. With respect to donating timeslices, don't scheduler cgroups
and job isolation address that problem today?
 
-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages

2010-12-04 Thread Balbir Singh
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-12-02 11:50:36]:

 On Thu,  2 Dec 2010 10:22:16 +0900 (JST)
 KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com wrote:
 
   On Tue, 30 Nov 2010, Andrew Morton wrote:
   
 +#define UNMAPPED_PAGE_RATIO 16
   
Well.  Giving 16 a name didn't really clarify anything.  Attentive
readers will want to know what this does, why 16 was chosen and what
the effects of changing it will be.
   
   The meaning is analoguous to the other zone reclaim ratio. But yes it
   should be justified and defined.
   
 Reviewed-by: Christoph Lameter c...@linux.com
   
So you're OK with shoving all this flotsam into 100,000,000 cellphones?
This was a pretty outrageous patchset!
   
   This is a feature that has been requested over and over for years. Using
   /proc/vm/drop_caches for fixing situations where one simply has too many
   page cache pages is not so much fun in the long run.
  
  I'm not against page cache limitation feature at all. But, this is
  too ugly and too destructive fast path. I hope this patch reduce negative
  impact more.
  
 
 And I think min_mapped_unmapped_pages is ugly. It should be
 unmapped_pagecache_limit or some because it's for limitation feature.


The feature will now be enabled with a CONFIG and boot parameter, I
find changing the naming convention now - it is already in use and
well known is not a good idea. THe name of the boot parameter can be
changed of-course. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] Series short description

2010-11-30 Thread Balbir Singh
The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
http://www.mail-archive.com/kvm@vger.kernel.org/msg43654.html

Christoph Lamater recommended splitting out patch 1, which
is what this series does

Detailed Description

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.



For a single VM - running kernbench

Enabled

Optimal load -j 8 run number 1...
Optimal load -j 8 run number 2...
Optimal load -j 8 run number 3...
Optimal load -j 8 run number 4...
Optimal load -j 8 run number 5...
Average Optimal load -j 8 Run (std deviation):
Elapsed Time 273.726 (1.2683)
User Time 190.014 (0.589941)
System Time 298.758 (1.72574)
Percent CPU 178 (0)
Context Switches 119953 (865.74)
Sleeps 38758 (795.074)

Disabled

Optimal load -j 8 run number 1...
Optimal load -j 8 run number 2...
Optimal load -j 8 run number 3...
Optimal load -j 8 run number 4...
Optimal load -j 8 run number 5...
Average Optimal load -j 8 Run (std deviation):
Elapsed Time 272.672 (0.453178)
User Time 189.7 (0.718157)
System Time 296.77 (0.845606)
Percent CPU 178 (0)
Context Switches 118822 (277.434)
Sleeps 37542.8 (545.922)

More data on the test results with the earlier patch is
at http://www.mail-archive.com/kvm@vger.kernel.org/msg43655.html

---

Balbir Singh (3):
  Move zone_reclaim() outside of CONFIG_NUMA
  Refactor zone_reclaim, move reusable functionality outside
  Provide control over unmapped pages


 include/linux/mmzone.h |4 +-
 include/linux/swap.h   |5 +-
 mm/page_alloc.c|7 ++-
 mm/vmscan.c|  109 +---
 4 files changed, 104 insertions(+), 21 deletions(-)

-- 
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA

2010-11-30 Thread Balbir Singh
This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 include/linux/mmzone.h |4 ++--
 mm/vmscan.c|2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4890662..aeede91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -302,12 +302,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
-#ifdef CONFIG_NUMA
-   int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
unsigned long   min_unmapped_pages;
+#ifdef CONFIG_NUMA
+   int node;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8cc90d5..325443a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2644,7 +2644,6 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-#ifdef CONFIG_NUMA
 /*
  * Zone reclaim mode
  *
@@ -2854,7 +2853,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, 
unsigned int order)
 
return ret;
 }
-#endif
 
 /*
  * page_evictable - test whether a page is evictable

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Refactor zone_reclaim

2010-11-30 Thread Balbir Singh
Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 mm/vmscan.c |   35 +++
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 325443a..0ac444f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2719,6 +2719,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_unmapped_pages(struct zone *zone, struct scan_control *sc,
+   unsigned long nr_pages)
+{
+   int priority;
+   /*
+* Free memory by calling shrink zone with increasing
+* priorities until we have enough memory freed.
+*/
+   priority = ZONE_RECLAIM_PRIORITY;
+   do {
+   shrink_zone(priority, zone, sc);
+   priority--;
+   } while (priority = 0  sc-nr_reclaimed  nr_pages);
+}
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int 
order)
@@ -2727,7 +2748,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
const unsigned long nr_pages = 1  order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
-   int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode  RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode  RECLAIM_SWAP),
@@ -2751,17 +2771,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p-reclaim_state = reclaim_state;
 
-   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages) {
-   /*
-* Free memory by calling shrink zone with increasing
-* priorities until we have enough memory freed.
-*/
-   priority = ZONE_RECLAIM_PRIORITY;
-   do {
-   shrink_zone(priority, zone, sc);
-   priority--;
-   } while (priority = 0  sc.nr_reclaimed  nr_pages);
-   }
+   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages)
+   zone_reclaim_unmapped_pages(zone, sc, nr_pages);
 
nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0  zone-min_slab_pages) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Provide control over unmapped pages

2010-11-30 Thread Balbir Singh
Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 include/linux/swap.h |5 ++-
 mm/page_alloc.c  |7 +++--
 mm/vmscan.c  |   72 +-
 3 files changed, 79 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index eba53e7..78b0830 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -252,11 +252,12 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern bool should_balance_unmapped_pages(struct zone *zone);
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 62b7280..4228da3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1662,6 +1662,9 @@ zonelist_scan:
unsigned long mark;
int ret;
 
+   if (should_balance_unmapped_pages(zone))
+   wakeup_kswapd(zone, order);
+
mark = zone-watermark[alloc_flags  ALLOC_WMARK_MASK];
if (zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
@@ -4136,10 +4139,10 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
 
zone-spanned_pages = size;
zone-present_pages = realsize;
-#ifdef CONFIG_NUMA
-   zone-node = nid;
zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
/ 100;
+#ifdef CONFIG_NUMA
+   zone-node = nid;
zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
 #endif
zone-name = zone_names[j];
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0ac444f..98950f4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -145,6 +145,21 @@ static DECLARE_RWSEM(shrinker_rwsem);
 #define scanning_global_lru(sc)(1)
 #endif
 
+static unsigned long balance_unmapped_pages(int priority, struct zone *zone,
+   struct scan_control *sc);
+static int unmapped_page_control __read_mostly;
+
+static int __init unmapped_page_control_parm(char *str)
+{
+   unmapped_page_control = 1;
+   /*
+* XXX: Should we tweak swappiness here?
+*/
+   return 1;
+}
+__setup(unmapped_page_control, unmapped_page_control_parm);
+
+
 static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
  struct scan_control *sc)
 {
@@ -2223,6 +2238,12 @@ loop_again:
shrink_active_list(SWAP_CLUSTER_MAX, zone,
sc, priority, 0);
 
+   /*
+* We do unmapped page balancing once here and once
+* below, so that we don't lose out
+*/
+   balance_unmapped_pages(priority, zone, sc);
+
if (!zone_watermark_ok_safe(zone, order,
high_wmark_pages(zone), 0, 0)) {
end_zone = i;
@@ -2258,6 +2279,11 @@ loop_again:
continue;
 
sc.nr_scanned = 0;
+   /*
+* Balance unmapped pages upfront, this should be
+* really cheap
+*/
+   balance_unmapped_pages(priority, zone, sc);
 
/*
 * Call soft limit reclaim before calling shrink_zone.
@@ -2491,7 +2517,8 @@ void wakeup_kswapd(struct zone *zone, int order)
pgdat-kswapd_max_order = order;
if (!waitqueue_active(pgdat-kswapd_wait))
return;
-   if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
+   if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0) 
+   !should_balance_unmapped_pages(zone))
return;
 
trace_mm_vmscan_wakeup_kswapd(pgdat-node_id, zone_idx(zone), order);
@@ -2740,6 +2767,49 @@ zone_reclaim_unmapped_pages(struct zone *zone, struct 
scan_control *sc,
 }
 
 /*
+ * Routine to balance unmapped pages, inspired from the code under
+ * CONFIG_NUMA that does unmapped page and slab page control by keeping
+ * min_unmapped_pages

Re: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA

2010-11-30 Thread Balbir Singh
* Balbir Singh bal...@linux.vnet.ibm.com [2010-12-01 10:04:08]:

 * Andrew Morton a...@linux-foundation.org [2010-11-30 14:23:38]:
 
  On Tue, 30 Nov 2010 15:45:12 +0530
  Balbir Singh bal...@linux.vnet.ibm.com wrote:
  
   This patch moves zone_reclaim and associated helpers
   outside CONFIG_NUMA. This infrastructure is reused
   in the patches for page cache control that follow.
   
  
  Thereby adding a nice dollop of bloat to everyone's kernel.  I don't
  think that is justifiable given that the audience for this feature is
  about eight people :(
  
  How's about CONFIG_UNMAPPED_PAGECACHE_CONTROL?
 
 
 OK, I'll add the config, but this code is enabled under CONFIG_NUMA
 today, so the bloat I agree is more for non NUMA users. I'll make
 CONFIG_UNMAPPED_PAGECACHE_CONTROL default if CONFIG_NUMA is set.
  
  Also this patch instantiates sysctl_min_unmapped_ratio and
  sysctl_min_slab_ratio on non-NUMA builds but fails to make those
  tunables actually tunable in procfs.  Changes to sysctl.c are
  needed.
  
 
 Oh! yeah.. I missed it while refactoring, my fault.
 
   Reviewed-by: Christoph Lameter c...@linux.com
  

My local MTA failed to deliver the message, trying again.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Refactor zone_reclaim

2010-11-30 Thread Balbir Singh
* Balbir Singh bal...@linux.vnet.ibm.com [2010-12-01 10:16:34]:

 * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-12-01 10:23:29]:
 
  On Tue, 30 Nov 2010 15:45:55 +0530
  Balbir Singh bal...@linux.vnet.ibm.com wrote:
  
   Refactor zone_reclaim, move reusable functionality outside
   of zone_reclaim. Make zone_reclaim_unmapped_pages modular
   
   Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
  
  Why is this min_mapped_pages based on zone (IOW, per-zone) ?
 
 
 Kamezawa-San, this has been the design before the refactoring (it is
 based on zone_reclaim_mode and reclaim based on top of that).  I am
 reusing bits of existing technology. The advantage of it being
 per-zone is that it integrates well with kswapd. 


My local MTA failed to deliver the message, trying again. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] Provide control over unmapped pages

2010-11-30 Thread Balbir Singh
* Balbir Singh bal...@linux.vnet.ibm.com [2010-12-01 10:24:21]:

 * Andrew Morton a...@linux-foundation.org [2010-11-30 14:25:09]:
 
  On Tue, 30 Nov 2010 15:46:31 +0530
  Balbir Singh bal...@linux.vnet.ibm.com wrote:
  
   Provide control using zone_reclaim() and a boot parameter. The
   code reuses functionality from zone_reclaim() to isolate unmapped
   pages and reclaim them as a priority, ahead of other mapped pages.
   
   Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
   ---
include/linux/swap.h |5 ++-
mm/page_alloc.c  |7 +++--
mm/vmscan.c  |   72 
   +-
3 files changed, 79 insertions(+), 5 deletions(-)
   
   diff --git a/include/linux/swap.h b/include/linux/swap.h
   index eba53e7..78b0830 100644
   --- a/include/linux/swap.h
   +++ b/include/linux/swap.h
   @@ -252,11 +252,12 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page 
   *page);
extern long vm_total_pages;

   -#ifdef CONFIG_NUMA
   -extern int zone_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
extern int sysctl_min_slab_ratio;
  
  This change will need to be moved into the first patch.
  
 
 OK, will do, thanks for pointing it out
 
extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
   +extern bool should_balance_unmapped_pages(struct zone *zone);
   +#ifdef CONFIG_NUMA
   +extern int zone_reclaim_mode;
#else
#define zone_reclaim_mode 0
static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int 
   order)
   diff --git a/mm/page_alloc.c b/mm/page_alloc.c
   index 62b7280..4228da3 100644
   --- a/mm/page_alloc.c
   +++ b/mm/page_alloc.c
   @@ -1662,6 +1662,9 @@ zonelist_scan:
 unsigned long mark;
 int ret;

   + if (should_balance_unmapped_pages(zone))
   + wakeup_kswapd(zone, order);
  
  gack, this is on the page allocator fastpath, isn't it?  So
  99.% of the world's machines end up doing a pointless call to a
  pointless function which pointlessly tests a pointless global and
  pointlessly returns?  All because of some whacky KSM thing?
  
  The speed and space overhead of this code should be *zero* if
  !CONFIG_UNMAPPED_PAGECACHE_CONTROL and should be minimal if
  CONFIG_UNMAPPED_PAGECACHE_CONTROL=y.  The way to do the latter is to
  inline the test of unmapped_page_control into callers and only if it is
  true (and use unlikely(), please) do we call into the KSM gunk.
 
 
 Will do, should_balance_unmapped_pages() will be a made a no-op in the
 absence of CONFIG_UNMAPPED_PAGECACHE_CONTROL
  
   --- a/mm/vmscan.c
   +++ b/mm/vmscan.c
   @@ -145,6 +145,21 @@ static DECLARE_RWSEM(shrinker_rwsem);
#define scanning_global_lru(sc)  (1)
#endif

   +static unsigned long balance_unmapped_pages(int priority, struct zone 
   *zone,
   + struct scan_control *sc);
   +static int unmapped_page_control __read_mostly;
   +
   +static int __init unmapped_page_control_parm(char *str)
   +{
   + unmapped_page_control = 1;
   + /*
   +  * XXX: Should we tweak swappiness here?
   +  */
   + return 1;
   +}
   +__setup(unmapped_page_control, unmapped_page_control_parm);
  
  aw c'mon guys, everybody knows that when you add a kernel parameter you
  document it in Documentation/kernel-parameters.txt.
 
 Will do - feeling silly on missing it out, that is where reviews help.
 
  
static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
   struct scan_control *sc)
{
   @@ -2223,6 +2238,12 @@ loop_again:
 shrink_active_list(SWAP_CLUSTER_MAX, zone,
 sc, priority, 0);

   + /*
   +  * We do unmapped page balancing once here and once
   +  * below, so that we don't lose out
   +  */
   + balance_unmapped_pages(priority, zone, sc);
   +
 if (!zone_watermark_ok_safe(zone, order,
 high_wmark_pages(zone), 0, 0)) {
 end_zone = i;
   @@ -2258,6 +2279,11 @@ loop_again:
 continue;

 sc.nr_scanned = 0;
   + /*
   +  * Balance unmapped pages upfront, this should be
   +  * really cheap
   +  */
   + balance_unmapped_pages(priority, zone, sc);
  
  More unjustifiable overhead on a commonly-executed codepath.
 
 
 Will refactor with a CONFIG suggested above.
  
 /*
  * Call soft limit reclaim before calling shrink_zone.
   @@ -2491,7 +2517,8 @@ void wakeup_kswapd(struct zone *zone, int order)
 pgdat-kswapd_max_order = order

Re: [PATCH 3/3] Provide control over unmapped pages

2010-11-30 Thread Balbir Singh
* Balbir Singh bal...@linux.vnet.ibm.com [2010-12-01 10:46:32]:

 * KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com [2010-12-01 09:14:13]:
 
   Provide control using zone_reclaim() and a boot parameter. The
   code reuses functionality from zone_reclaim() to isolate unmapped
   pages and reclaim them as a priority, ahead of other mapped pages.
   
   Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
   ---
include/linux/swap.h |5 ++-
mm/page_alloc.c  |7 +++--
mm/vmscan.c  |   72 
   +-
3 files changed, 79 insertions(+), 5 deletions(-)
   
   diff --git a/include/linux/swap.h b/include/linux/swap.h
   index eba53e7..78b0830 100644
   --- a/include/linux/swap.h
   +++ b/include/linux/swap.h
   @@ -252,11 +252,12 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page 
   *page);
extern long vm_total_pages;

   -#ifdef CONFIG_NUMA
   -extern int zone_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
extern int sysctl_min_slab_ratio;
extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
   +extern bool should_balance_unmapped_pages(struct zone *zone);
   +#ifdef CONFIG_NUMA
   +extern int zone_reclaim_mode;
#else
#define zone_reclaim_mode 0
static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int 
   order)
   diff --git a/mm/page_alloc.c b/mm/page_alloc.c
   index 62b7280..4228da3 100644
   --- a/mm/page_alloc.c
   +++ b/mm/page_alloc.c
   @@ -1662,6 +1662,9 @@ zonelist_scan:
 unsigned long mark;
 int ret;

   + if (should_balance_unmapped_pages(zone))
   + wakeup_kswapd(zone, order);
   +
  
  You don't have to add extra branch into fast path.
  
  
 mark = zone-watermark[alloc_flags  ALLOC_WMARK_MASK];
 if (zone_watermark_ok(zone, order, mark,
 classzone_idx, alloc_flags))
   @@ -4136,10 +4139,10 @@ static void __paginginit 
   free_area_init_core(struct pglist_data *pgdat,

 zone-spanned_pages = size;
 zone-present_pages = realsize;
   -#ifdef CONFIG_NUMA
   - zone-node = nid;
 zone-min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
 / 100;
   +#ifdef CONFIG_NUMA
   + zone-node = nid;
 zone-min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
#endif
 zone-name = zone_names[j];
   diff --git a/mm/vmscan.c b/mm/vmscan.c
   index 0ac444f..98950f4 100644
   --- a/mm/vmscan.c
   +++ b/mm/vmscan.c
   @@ -145,6 +145,21 @@ static DECLARE_RWSEM(shrinker_rwsem);
#define scanning_global_lru(sc)  (1)
#endif

   +static unsigned long balance_unmapped_pages(int priority, struct zone 
   *zone,
   + struct scan_control *sc);
   +static int unmapped_page_control __read_mostly;
   +
   +static int __init unmapped_page_control_parm(char *str)
   +{
   + unmapped_page_control = 1;
   + /*
   +  * XXX: Should we tweak swappiness here?
   +  */
   + return 1;
   +}
   +__setup(unmapped_page_control, unmapped_page_control_parm);
   +
   +
static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
   struct scan_control *sc)
{
   @@ -2223,6 +2238,12 @@ loop_again:
 shrink_active_list(SWAP_CLUSTER_MAX, zone,
 sc, priority, 0);

   + /*
   +  * We do unmapped page balancing once here and once
   +  * below, so that we don't lose out
   +  */
   + balance_unmapped_pages(priority, zone, sc);
  
  You can't invoke any reclaim from here. It is in zone balancing detection
  phase. It mean your code reclaim pages from zones which has lots free pages 
  too.
 
 The goal is to check not only for zone_watermark_ok, but also to see
 if unmapped pages are way higher than expected values.
 
  
   +
 if (!zone_watermark_ok_safe(zone, order,
 high_wmark_pages(zone), 0, 0)) {
 end_zone = i;
   @@ -2258,6 +2279,11 @@ loop_again:
 continue;

 sc.nr_scanned = 0;
   + /*
   +  * Balance unmapped pages upfront, this should be
   +  * really cheap
   +  */
   + balance_unmapped_pages(priority, zone, sc);
  
  
  This code break page-cache/slab balancing logic. And this is conflict
  against Nick's per-zone slab effort.
 
 
 OK, cc'ing Nick for comments.
  
  Plus, high-order + priority=5 reclaim Simon's case. (see Free memory never 
  fully used, swapping threads)
 
 
 OK, this path should

Re: [PATCH 3/3] Provide control over unmapped pages

2010-11-30 Thread Balbir Singh
* Balbir Singh bal...@linux.vnet.ibm.com [2010-12-01 10:48:16]:

 * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-12-01 10:32:54]:
 
  On Tue, 30 Nov 2010 15:46:31 +0530
  Balbir Singh bal...@linux.vnet.ibm.com wrote:
  
   Provide control using zone_reclaim() and a boot parameter. The
   code reuses functionality from zone_reclaim() to isolate unmapped
   pages and reclaim them as a priority, ahead of other mapped pages.
   
   Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
   ---
include/linux/swap.h |5 ++-
mm/page_alloc.c  |7 +++--
mm/vmscan.c  |   72 
   +-
3 files changed, 79 insertions(+), 5 deletions(-)
   
   diff --git a/include/linux/swap.h b/include/linux/swap.h
   index eba53e7..78b0830 100644
   --- a/include/linux/swap.h
   +++ b/include/linux/swap.h
   @@ -252,11 +252,12 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page 
   *page);
extern long vm_total_pages;

   -#ifdef CONFIG_NUMA
   -extern int zone_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
extern int sysctl_min_slab_ratio;
extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
   +extern bool should_balance_unmapped_pages(struct zone *zone);
   +#ifdef CONFIG_NUMA
   +extern int zone_reclaim_mode;
#else
#define zone_reclaim_mode 0
static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int 
   order)
   diff --git a/mm/page_alloc.c b/mm/page_alloc.c
   index 62b7280..4228da3 100644
   --- a/mm/page_alloc.c
   +++ b/mm/page_alloc.c
   @@ -1662,6 +1662,9 @@ zonelist_scan:
 unsigned long mark;
 int ret;

   + if (should_balance_unmapped_pages(zone))
   + wakeup_kswapd(zone, order);
   +
  
  Hm, I'm not sure the final vision of this feature. Does this reclaiming 
  feature
  can't be called directly via balloon driver just before alloc_page() ?
 
 
 That is a separate patch, this is a boot paramter based control
 approach.
  
  Do you need to keep page caches small even when there are free memory on 
  host ?
 
 
 The goal is to avoid duplication, as you know page cache fills itself
 to consume as much memory as possible. The host generally does not
 have a lot of free memory in a consolidated environment. 

My local MTA failed to deliver the message, trying again. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/3] Linux/Guest unmapped page cache control

2010-11-03 Thread Balbir Singh
* Christoph Lameter c...@linux.com [2010-11-03 09:35:33]:

 On Fri, 29 Oct 2010, Balbir Singh wrote:
 
  A lot of the code is borrowed from zone_reclaim_mode logic for
  __zone_reclaim(). One might argue that the with ballooning and
  KSM this feature is not very useful, but even with ballooning,
 
 Interesting use of zone reclaim. I am having a difficult time reviewing
 the patch since you move and modify functions at the same time. Could you
 separate that out a bit?


Sure, I'll split it out into more readable bits and repost the mm
versions first.
 
  +#define UNMAPPED_PAGE_RATIO 16
 
 Maybe come up with a scheme that allows better configuration of the
 mininum? I think in some setting we may want an absolute limit and in
 other a fraction of something (total zone size or working set?)


Are you suggesting a sysctl or computation based on zone size and
limit, etc? I understand it to be the latter.
 
 
  +bool should_balance_unmapped_pages(struct zone *zone)
  +{
  +   if (unmapped_page_control 
  +   (zone_unmapped_file_pages(zone) 
  +   UNMAPPED_PAGE_RATIO * zone-min_unmapped_pages))
  +   return true;
  +   return false;
  +}
 

Thanks for your review.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 0/3] KVM page cache optimization (v3)

2010-10-28 Thread Balbir Singh
This is version 3 of the page cache control patches

From: Balbir Singh bal...@linux.vnet.ibm.com

This series has three patches, the first controls
the amount of unmapped page cache usage via a boot
parameter and sysctl. The second patch controls page
and slab cache via the balloon driver. Both the patches
make heavy use of the zone_reclaim() functionality
already present in the kernel.

The last patch in the series is against QEmu to make
the ballooning hint optional.

V2 was posted a long time back (see http://lwn.net/Articles/391293/)
One of the review suggestions was to make the hint optional
(discussed in the community call as well).

I'd appreciate any test results with the patches.

TODO

1. libvirt exploits for optional hint

page-cache-control
balloon-page-cache
provide-memory-hint-during-ballooning

---
 b/balloon.c   |   18 +++-
 b/balloon.h   |4
 b/drivers/virtio/virtio_balloon.c |   17 +++
 b/hmp-commands.hx |7 +
 b/hw/virtio-balloon.c |   14 ++-
 b/hw/virtio-balloon.h |3
 b/include/linux/gfp.h |8 +
 b/include/linux/mmzone.h  |2
 b/include/linux/swap.h|3
 b/include/linux/virtio_balloon.h  |3
 b/mm/page_alloc.c |9 +-
 b/mm/vmscan.c |  162 --
 b/qmp-commands.hx |7 -
 include/linux/swap.h  |9 --
 mm/page_alloc.c   |3
 mm/vmscan.c   |2
 16 files changed, 202 insertions(+), 69 deletions(-)


-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 1/3] Linux/Guest unmapped page cache control

2010-10-28 Thread Balbir Singh
Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singh bal...@linux.vnet.ibm.com

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

Host Usage without boot parameter (memory in KB)

MemFree Cached Time
19900   292912 137
17540   296196 139
17900   296124 141
19356   296660 141

Host usage:  (memory in KB)

RSS Cache   mapped  swap
2788664 781884  3780359536

Guest Usage with boot parameter (memory in KB)
-
Memfree Cached   Time
244824  74828   144
237840  81764   143
235880  83044   138
239312  80092   148

Host usage: (memory in KB)

RSS Cache   mapped  swap
2700184 958012  334848  398412

TODOS
-
1. Balance slab cache as well

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---

 include/linux/mmzone.h |2 -
 include/linux/swap.h   |3 +
 mm/page_alloc.c|9 ++-
 mm/vmscan.c|  162 
 4 files changed, 132 insertions(+), 44 deletions(-)


diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3984c4e..a591a7a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -300,12 +300,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
+   unsigned long   min_unmapped_pages;
 #ifdef CONFIG_NUMA
int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
-   unsigned long   min_unmapped_pages;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7cdd633..5d29097 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -251,10 +251,11 @@ extern unsigned long shrink_all_memory(unsigned long 
nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
+extern bool should_balance_unmapped_pages(struct zone *zone);
 
+extern int sysctl_min_unmapped_ratio;
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f12ad18..d8fe29f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1642,6 +1642,9 @@ zonelist_scan:
unsigned long mark;
int ret;
 
+   if (should_balance_unmapped_pages(zone))
+   wakeup_kswapd(zone, order);
+
mark = zone-watermark[alloc_flags  ALLOC_WMARK_MASK];
if (zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
@@ -4101,10 +4104,10 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat,
 
zone-spanned_pages = size;
zone-present_pages = realsize;
-#ifdef CONFIG_NUMA

[RFC][PATCH 2/3] Linux/Guest cooperative unmapped page cache control

2010-10-28 Thread Balbir Singh
Balloon unmapped page cache pages first

From: Balbir Singh bal...@linux.vnet.ibm.com

This patch builds on the ballooning infrastructure by ballooning unmapped
page cache pages first. It looks for low hanging fruit first and tries
to reclaim clean unmapped pages first.

This patch brings zone_reclaim() and other dependencies out of CONFIG_NUMA
and then reuses the zone_reclaim_mode logic if __GFP_FREE_CACHE is passed
in the gfp_mask. The virtio balloon driver has been changed to use
__GFP_FREE_CACHE. During fill_balloon(), the driver looks for hints
provided by the hypervisor to reclaim cached memory. By default the hint
is off and can be turned on by passing an argument that specifies that
we intend to reclaim cached memory.

Tests:

Test 1
--
I ran a simple filter function that kept frequently ballon a single VM
running kernbench. The VM was configured with 2GB of memory and 2 VCPUs.
The filter function was a triangular wave function that ballooned
the VM under study from 500MB to 1500MB using a triangular wave function
continously. The run times of the VM with and without changes are shown
below. The run times showed no significant impact of the changes.

Withchanges

Elapsed Time 223.86 (1.52822)
User Time 191.01 (0.65395)
System Time 199.468 (2.43616)
Percent CPU 174 (1)
Context Switches 103182 (595.05)
Sleeps 39107.6 (1505.67)

Without changes

Elapsed Time 225.526 (2.93102)
User Time 193.53 (3.53626)
System Time 199.832 (3.26281)
Percent CPU 173.6 (1.14018)
Context Switches 103744 (1311.53)
Sleeps 39383.2 (831.865)

The key advantage was that it resulted in lesser RSS usage in the host and
more cached usage, indicating that the caching had been pushed towards
the host. The guest cached memory usage was lower and free memory in
the guest was also higher.

Test 2
--
I ran kernbench under the memory overcommit manager (6 VM's with 2 vCPUs, 2GB)
with KSM and ksmtuned enabled. memory overcommit manager details are at
http://github.com/aglitke/mom/wiki. The command line for kernbench was
kernbench -M.

The tests showed the following

Withchanges

Elapsed Time 842.936 (12.2247)
Elapsed Time 844.266 (25.8047)
Elapsed Time 844.696 (11.2433)
Elapsed Time 846.08 (14.0249)
Elapsed Time 838.58 (7.44609)
Elapsed Time 842.362 (4.37463)

Withoutchanges

Elapsed Time 837.604 (14.1311)
Elapsed Time 839.322 (17.1772)
Elapsed Time 843.744 (9.21541)
Elapsed Time 842.592 (7.48622)
Elapsed Time 844.272 (25.486)
Elapsed Time 838.858 (7.5044)

General observations

1. Free memory in each of guests was higher with changes.
   The additional free memory was of the order of 120MB per VM
2. Cached memory in each guest was lesser with changes
3. Host free memory was almost constant (independent of
   changes)
4. Host anonymous memory usage was lesser with the changes

The goal of this patch is to free up memory locked up in
duplicated cache contents and (1) above shows that we are
able to successfully free it up.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---

 drivers/virtio/virtio_balloon.c |   17 +++--
 include/linux/gfp.h |8 +++-
 include/linux/swap.h|9 +++--
 include/linux/virtio_balloon.h  |3 +++
 mm/page_alloc.c |3 ++-
 mm/vmscan.c |2 +-
 6 files changed, 31 insertions(+), 11 deletions(-)


diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0f1da45..70f97ea 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -99,12 +99,24 @@ static void tell_host(struct virtio_balloon *vb, struct 
virtqueue *vq)
 
 static void fill_balloon(struct virtio_balloon *vb, size_t num)
 {
+   u32 reclaim_cache_first;
+   int err;
+   gfp_t mask = GFP_HIGHUSER | __GFP_NORETRY | __GFP_NOMEMALLOC |
+   __GFP_NOWARN;
+
+   err = virtio_config_val(vb-vdev, VIRTIO_BALLOON_F_BALLOON_HINT,
+   offsetof(struct virtio_balloon_config,
+   reclaim_cache_first),
+   reclaim_cache_first);
+
+   if (!err  reclaim_cache_first)
+   mask |= __GFP_FREE_CACHE;
+
/* We can only do one array worth at a time. */
num = min(num, ARRAY_SIZE(vb-pfns));
 
for (vb-num_pfns = 0; vb-num_pfns  num; vb-num_pfns++) {
-   struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
-   __GFP_NOMEMALLOC | __GFP_NOWARN);
+   struct page *page = alloc_page(mask);
if (!page) {
if (printk_ratelimit())
dev_printk(KERN_INFO, vb-vdev-dev,
@@ -358,6 +370,7 @@ static void __devexit virtballoon_remove(struct 
virtio_device *vdev)
 static unsigned int features[] = {
VIRTIO_BALLOON_F_MUST_TELL_HOST,
VIRTIO_BALLOON_F_STATS_VQ,
+   VIRTIO_BALLOON_F_BALLOON_HINT,
 };
 
 static struct

[RFC][PATCH 3/3] QEmu changes to provide balloon hint

2010-10-28 Thread Balbir Singh
Provide memory hint during ballooning

From: Balbir Singh bal...@linux.vnet.ibm.com

This patch adds an optional hint to the qemu monitor balloon
command. The hint tells the guest operating system to consider
a class of memory during reclaim. Currently the supported
hint is cached memory. The design is generic and can be extended
to provide other hints in the future if required.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---

 balloon.c   |   18 ++
 balloon.h   |4 +++-
 hmp-commands.hx |7 +--
 hw/virtio-balloon.c |   15 +++
 hw/virtio-balloon.h |3 +++
 qmp-commands.hx |7 ---
 6 files changed, 40 insertions(+), 14 deletions(-)


diff --git a/balloon.c b/balloon.c
index 0021fef..b2bdda5 100644
--- a/balloon.c
+++ b/balloon.c
@@ -41,11 +41,13 @@ void qemu_add_balloon_handler(QEMUBalloonEvent *func, void 
*opaque)
 qemu_balloon_event_opaque = opaque;
 }
 
-int qemu_balloon(ram_addr_t target, MonitorCompletion cb, void *opaque)
+int qemu_balloon(ram_addr_t target, bool reclaim_cache_first,
+ MonitorCompletion cb, void *opaque)
 {
 if (qemu_balloon_event) {
 trace_balloon_event(qemu_balloon_event_opaque, target);
-qemu_balloon_event(qemu_balloon_event_opaque, target, cb, opaque);
+qemu_balloon_event(qemu_balloon_event_opaque, target,
+   reclaim_cache_first, cb, opaque);
 return 1;
 } else {
 return 0;
@@ -55,7 +57,7 @@ int qemu_balloon(ram_addr_t target, MonitorCompletion cb, 
void *opaque)
 int qemu_balloon_status(MonitorCompletion cb, void *opaque)
 {
 if (qemu_balloon_event) {
-qemu_balloon_event(qemu_balloon_event_opaque, 0, cb, opaque);
+qemu_balloon_event(qemu_balloon_event_opaque, 0, 0, cb, opaque);
 return 1;
 } else {
 return 0;
@@ -131,13 +133,21 @@ int do_balloon(Monitor *mon, const QDict *params,
   MonitorCompletion cb, void *opaque)
 {
 int ret;
+int val;
+const char *cache_hint;
+int reclaim_cache_first = 0;
 
 if (kvm_enabled()  !kvm_has_sync_mmu()) {
 qerror_report(QERR_KVM_MISSING_CAP, synchronous MMU, balloon);
 return -1;
 }
 
-ret = qemu_balloon(qdict_get_int(params, value), cb, opaque);
+val = qdict_get_int(params, value);
+cache_hint = qdict_get_try_str(params, hint);
+if (cache_hint)
+reclaim_cache_first = 1;
+
+ret = qemu_balloon(val, reclaim_cache_first, cb, opaque);
 if (ret == 0) {
 qerror_report(QERR_DEVICE_NOT_ACTIVE, balloon);
 return -1;
diff --git a/balloon.h b/balloon.h
index d478e28..65d68c1 100644
--- a/balloon.h
+++ b/balloon.h
@@ -17,11 +17,13 @@
 #include monitor.h
 
 typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target,
+bool reclaim_cache_first,
 MonitorCompletion cb, void *cb_data);
 
 void qemu_add_balloon_handler(QEMUBalloonEvent *func, void *opaque);
 
-int qemu_balloon(ram_addr_t target, MonitorCompletion cb, void *opaque);
+int qemu_balloon(ram_addr_t target, bool reclaim_cache_first,
+ MonitorCompletion cb, void *opaque);
 
 int qemu_balloon_status(MonitorCompletion cb, void *opaque);
 
diff --git a/hmp-commands.hx b/hmp-commands.hx
index 81999aa..80e42aa 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -925,8 +925,8 @@ ETEXI
 
 {
 .name   = balloon,
-.args_type  = value:M,
-.params = target,
+.args_type  = value:M,hint:s?,
+.params = target [cache],
 .help   = request VM to change its memory allocation (in MB),
 .user_print = monitor_user_noop,
 .mhandler.cmd_async = do_balloon,
@@ -937,6 +937,9 @@ STEXI
 @item balloon @var{value}
 @findex balloon
 Request VM to change its memory allocation to @var{value} (in MB).
+An optional @var{hint} can be specified to indicate if the guest
+should reclaim from the cached memory in the guest first. The
+...@var{hint} may be ignored by the guest.
 ETEXI
 
 {
diff --git a/hw/virtio-balloon.c b/hw/virtio-balloon.c
index 8adddea..e363507 100644
--- a/hw/virtio-balloon.c
+++ b/hw/virtio-balloon.c
@@ -44,6 +44,7 @@ typedef struct VirtIOBalloon
 size_t stats_vq_offset;
 MonitorCompletion *stats_callback;
 void *stats_opaque_callback_data;
+uint32_t reclaim_cache_first;
 } VirtIOBalloon;
 
 static VirtIOBalloon *to_virtio_balloon(VirtIODevice *vdev)
@@ -181,8 +182,11 @@ static void virtio_balloon_get_config(VirtIODevice *vdev, 
uint8_t *config_data)
 
 config.num_pages = cpu_to_le32(dev-num_pages);
 config.actual = cpu_to_le32(dev-actual);
-
-memcpy(config_data, config, 8);
+if (vdev-guest_features  (1  VIRTIO_BALLOON_F_BALLOON_HINT)) {
+config.reclaim_cache_first = cpu_to_le32(dev-reclaim_cache_first);
+memcpy(config_data, config, 12);
+} else
+memcpy(config_data, config, 8

Re: [PATCH] kvm: add oom notifier for virtio balloon

2010-10-08 Thread Balbir Singh
* Dave Young hidave.darks...@gmail.com [2010-10-05 20:45:21]:

 Balloon could cause guest memory oom killing and panic.
 
 Add oom notify to leak some memory and retry fill balloon after 5 minutes.
 
 At the same time add a mutex to protect balloon operations
 because we need leak balloon in oom notifier and give back freed value. 
 
 Thanks Anthony Liguori for his sugestion about inflate retrying.
 Sometimes it will cause endless inflate/oom/delay loop,
 so I think next step is to add an option to do noretry-when-oom balloon.
 
 Signed-off-by: Dave Young hidave.darks...@gmail.com

Won't __GFP_NORETRY prevent OOM? Could you please describe how you
tested the patch?

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: add oom notifier for virtio balloon

2010-10-08 Thread Balbir Singh
* Dave Young hidave.darks...@gmail.com [2010-10-08 21:33:02]:

 On Fri, Oct 8, 2010 at 9:09 PM, Balbir Singh bal...@linux.vnet.ibm.com 
 wrote:
  * Dave Young hidave.darks...@gmail.com [2010-10-05 20:45:21]:
 
  Balloon could cause guest memory oom killing and panic.
 
  Add oom notify to leak some memory and retry fill balloon after 5 minutes.
 
  At the same time add a mutex to protect balloon operations
  because we need leak balloon in oom notifier and give back freed value.
 
  Thanks Anthony Liguori for his sugestion about inflate retrying.
  Sometimes it will cause endless inflate/oom/delay loop,
  so I think next step is to add an option to do noretry-when-oom balloon.
 
  Signed-off-by: Dave Young hidave.darks...@gmail.com
 
  Won't __GFP_NORETRY prevent OOM? Could you please describe how you
  tested the patch?
 
 I have not tried __GFP_NORETRY, it should work, but balloon thread
 will keep wasting cpu resource to allocating.
 
 
 To test the patch, just balloon to small than minimal memory.
 
 I use balloon 30 in qemu monitor to limit slackware guest memory
 usage. The normal memory used is ~40M.
 
 Actually we need to differentiate the process which caused oom. If it
 is balloon thread we should just stop ballooning, if it is others we
 can do something like this patch, e.g. retry ballooning after 5
 minutes.

Ideally the balloon thread should never OOM with __GFP_NORETRY (IIRC).
The other situation should be dealt with, we should free up any pages
we have. I wonder if the timeout should be a sysctl tunable.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-17 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-16 14:39:02]:

 We're talking about an environment which we're always trying to
 optimize.  Imagine that we're always trying to consolidate guests on to
 smaller numbers of hosts.  We're effectively in a state where we
 _always_ want new guests.
 
 If this came at no cost to the guests, you'd be right.  But at some
 point guest performance will be hit by this, so the advantage gained
 from freeing memory will be balanced by the disadvantage.
 
 Also, memory is not the only resource.  At some point you become cpu
 bound; at that point freeing memory doesn't help and in fact may
 increase your cpu load.


We'll probably need control over other resources as well, but IMHO
memory is the most precious because it is non-renewable. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-15 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-15 09:58:33]:

 On 06/14/2010 08:45 PM, Balbir Singh wrote:
 
 There are two decisions that need to be made:
 
 - how much memory a guest should be given
 - given some guest memory, what's the best use for it
 
 The first question can perhaps be answered by looking at guest I/O
 rates and giving more memory to more active guests.  The second
 question is hard, but not any different than running non-virtualized
 - except if we can detect sharing or duplication.  In this case,
 dropping a duplicated page is worthwhile, while dropping a shared
 page provides no benefit.
 I think there is another way of looking at it, give some free memory
 
 1. Can the guest run more applications or run faster
 
 That's my second question.  How to best use this memory.  More
 applications == drop the page from cache, faster == keep page in
 cache.
 
 All we need is to select the right page to drop.


Do we need to drop to the granularity of the page to drop? I think
figuring out the class of pages and making sure that we don't write
our own reclaim logic, but work with what we have to identify the
class of pages is a good start. 
 
 2. Can the host potentially get this memory via ballooning or some
 other means to start newer guest instances
 
 Well, we already have ballooning.  The question is can we improve
 the eviction algorithm.
 
 I think the answer to 1 and 2 is yes.
 
 How the patch helps answer either question, I'm not sure.  I don't
 think preferential dropping of unmapped page cache is the answer.
 
 Preferential dropping as selected by the host, that knows about the
 setup and if there is duplication involved. While we use the term
 preferential dropping, remember it is still via LRU and we don't
 always succeed. It is a best effort (if you can and the unmapped pages
 are not highly referenced) scenario.
 
 How can the host tell if there is duplication?  It may know it has
 some pagecache, but it has no idea whether or to what extent guest
 pagecache duplicates host pagecache.
 

Well it is possible in host user space, I for example use memory
cgroup and through the stats I have a good idea of how much is duplicated.
I am ofcourse making an assumption with my setup of the cached mode,
that the data in the guest page cache and page cache in the cgroup
will be duplicated to a large extent. I did some trivial experiments
like drop the data from the guest and look at the cost of bringing it
in and dropping the data from both guest and host and look at the
cost. I could see a difference.

Unfortunately, I did not save the data, so I'll need to redo the
experiment.

 Those tell you how to balance going after the different classes of
 things that we can reclaim.
 
 Again, this is useless when ballooning is being used.  But, I'm thinking
 of a more general mechanism to force the system to both have MemFree
 _and_ be acting as if it is under memory pressure.
 If there is no memory pressure on the host, there is no reason for
 the guest to pretend it is under pressure.  If there is memory
 pressure on the host, it should share the pain among its guests by
 applying the balloon.  So I don't think voluntarily dropping cache
 is a good direction.
 
 There are two situations
 
 1. Voluntarily drop cache, if it was setup to do so (the host knows
 that it caches that information anyway)
 
 It doesn't, really.  The host only has aggregate information about
 itself, and no information about the guest.
 
 Dropping duplicate pages would be good if we could identify them.
 Even then, it's better to drop the page from the host, not the
 guest, unless we know the same page is cached by multiple guests.


On the exact pages to drop, please see my comments above on the class
of pages to drop.
There are reasons for wanting to get the host to cache the data

Unless the guest is using cache = none, the data will still hit the
host page cache
The host can do a better job of optimizing the writeouts
 
 But why would the guest voluntarily drop the cache?  If there is no
 memory pressure, dropping caches increases cpu overhead and latency
 even if the data is still cached on the host.
 

So, there are basically two approaches

1. First patch, proactive - enabled by a boot option
2. When ballooned, we try to (please NOTE try to) reclaim cached pages
first. Failing which, we go after regular pages in the alloc_page()
call in the balloon driver.

 2. Drop the cache on either a special balloon option, again the host
 knows it caches that very same information, so it prefers to free that
 up first.
 
 Dropping in response to pressure is good.  I'm just not convinced
 the patch helps in selecting the correct page to drop.


That is why I've presented data on the experiments I've run and
provided more arguments to backup the approach. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info

Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-15 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-15 10:12:44]:

 On 06/14/2010 08:16 PM, Balbir Singh wrote:
 * Dave Hansend...@linux.vnet.ibm.com  [2010-06-14 10:09:31]:
 
 On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote:
 If you've got duplicate pages and you know
 that they are duplicated and can be retrieved at a lower cost, why
 wouldn't we go after them first?
 I agree with this in theory.  But, the guest lacks the information about
 what is truly duplicated and what the costs are for itself and/or the
 host to recreate it.  Unmapped page cache may be the best proxy that
 we have at the moment for easy to recreate, but I think it's still too
 poor a match to make these patches useful.
 
 That is why the policy (in the next set) will come from the host. As
 to whether the data is truly duplicated, my experiments show up to 60%
 of the page cache is duplicated.
 
 Isn't that incredibly workload dependent?
 
 We can't expect the host admin to know whether duplication will
 occur or not.


I was referring to cache = (policy) we use based on the setup. I don't
think the duplication is too workload specific. Moreover, we could use
aggressive policies and restrict page cache usage or do it selectively
on ballooning. We could also add other options to make the ballooning
option truly optional, so that the system management software decides. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-15 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-15 12:44:31]:

 On 06/15/2010 10:49 AM, Balbir Singh wrote:
 
 All we need is to select the right page to drop.
 
 Do we need to drop to the granularity of the page to drop? I think
 figuring out the class of pages and making sure that we don't write
 our own reclaim logic, but work with what we have to identify the
 class of pages is a good start.
 
 Well, the class of pages are 'pages that are duplicated on the
 host'.  Unmapped page cache pages are 'pages that might be
 duplicated on the host'.  IMO, that's not close enough.


Agreed, but what happens in reality with the code is that it drops
not-so-frequently-used cache (still reusing the reclaim mechanism),
but prioritizing cached memory.
 
 How can the host tell if there is duplication?  It may know it has
 some pagecache, but it has no idea whether or to what extent guest
 pagecache duplicates host pagecache.
 
 Well it is possible in host user space, I for example use memory
 cgroup and through the stats I have a good idea of how much is duplicated.
 I am ofcourse making an assumption with my setup of the cached mode,
 that the data in the guest page cache and page cache in the cgroup
 will be duplicated to a large extent. I did some trivial experiments
 like drop the data from the guest and look at the cost of bringing it
 in and dropping the data from both guest and host and look at the
 cost. I could see a difference.
 
 Unfortunately, I did not save the data, so I'll need to redo the
 experiment.
 
 I'm sure we can detect it experimentally, but how do we do it
 programatically at run time (without dropping all the pages).
 Situations change, and I don't think we can infer from a few
 experiments that we'll have a similar amount of sharing.  The cost
 of an incorrect decision is too high IMO (not that I think the
 kernel always chooses the right pages now, but I'd like to avoid
 regressions from the unvirtualized state).
 
 btw, when running with a disk controller that has a very large
 cache, we might also see duplication between guest and host.  So,
 if this is a good idea, it shouldn't be enabled just for
 virtualization, but for any situation where we have a sizeable cache
 behind us.
 

It depends, once the disk controller has the cache and the pages in
the guest are not-so-frequently-used we can drop them. Please remember
we still use the LRU to identify these pages.

 It doesn't, really.  The host only has aggregate information about
 itself, and no information about the guest.
 
 Dropping duplicate pages would be good if we could identify them.
 Even then, it's better to drop the page from the host, not the
 guest, unless we know the same page is cached by multiple guests.
 
 On the exact pages to drop, please see my comments above on the class
 of pages to drop.
 
 Well, we disagree about that.  There is some value in dropping
 duplicated pages (not always), but that's not what the patch does.
 It drops unmapped pagecache pages, which may or may not be
 duplicated.
 
 There are reasons for wanting to get the host to cache the data
 
 There are also reasons to get the guest to cache the data - it's
 more efficient to access it in the guest.
 
 Unless the guest is using cache = none, the data will still hit the
 host page cache
 The host can do a better job of optimizing the writeouts
 
 True, especially for non-raw storage.  But even there we have to
 fsync all the time to keep the metadata right.
 
 But why would the guest voluntarily drop the cache?  If there is no
 memory pressure, dropping caches increases cpu overhead and latency
 even if the data is still cached on the host.
 
 So, there are basically two approaches
 
 1. First patch, proactive - enabled by a boot option
 2. When ballooned, we try to (please NOTE try to) reclaim cached pages
 first. Failing which, we go after regular pages in the alloc_page()
 call in the balloon driver.
 
 Doesn't that mean you may evict a RU mapped page ahead of an LRU
 unmapped page, just in the hope that it is double-cached?
 
 Maybe we need the guest and host to talk to each other about which
 pages to keep.
 

Yeah.. I guess that falls into the domain of CMM.

 2. Drop the cache on either a special balloon option, again the host
 knows it caches that very same information, so it prefers to free that
 up first.
 Dropping in response to pressure is good.  I'm just not convinced
 the patch helps in selecting the correct page to drop.
 
 That is why I've presented data on the experiments I've run and
 provided more arguments to backup the approach.
 
 I'm still unconvinced, sorry.
 

The reason for making this optional is to let the administrators
decide how they want to use the memory in the system. In some
situations it might be a big no-no to waste memory, in some cases it
might be acceptable. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org

Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-15 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-15 12:54:31]:

 On 06/15/2010 10:52 AM, Balbir Singh wrote:
 
 That is why the policy (in the next set) will come from the host. As
 to whether the data is truly duplicated, my experiments show up to 60%
 of the page cache is duplicated.
 Isn't that incredibly workload dependent?
 
 We can't expect the host admin to know whether duplication will
 occur or not.
 
 I was referring to cache = (policy) we use based on the setup. I don't
 think the duplication is too workload specific. Moreover, we could use
 aggressive policies and restrict page cache usage or do it selectively
 on ballooning. We could also add other options to make the ballooning
 option truly optional, so that the system management software decides.
 
 Consider a read-only workload that exactly fits in guest cache.
 Without trimming, the guest will keep hitting its own cache, and the
 host will see no access to the cache at all.  So the host (assuming
 it is under even low pressure) will evict those pages, and the guest
 will happily use its own cache.  If we start to trim, the guest will
 have to go to disk.  That's the best case.

 Now for the worst case.  A random access workload that misses the
 cache on both guest and host.  Now every page is duplicated, and
 trimming guest pages allows the host to increase its cache, and
 potentially reduce misses.  In this case trimming duplicated pages
 works.
 
 Real life will see a mix of this.  Often used pages won't be
 duplicated, and less often used pages may see some duplication,
 especially if the host cache portion dedicated to the guest is
 bigger than the guest cache.
 
 I can see that trimming duplicate pages helps, but (a) I'd like to
 be sure they are duplicates and (b) often trimming them from the
 host is better than trimming them from the guest.


Lets see the behaviour with these patches

The first patch is a proactive approach to keep more memory around.
Enabling the parameter implies we are OK paying the cost of some
overhead. My data shows that leaves a significant amount of free
memory with a small 5% (in my case) overhead. This brings us back to
what you can do with free memory.

The second patch shows no overhead and selectively tries to use free
cache to return back on memory pressure (as indicated by the balloon
driver). We've discussed the reasons for doing this

1. In the situations where cache is duplicated this should benefit
us. Your contention is that we need to be specific about the
duplication. That falls under the realm of CMM.
2. In the case of slab cache, duplication does not matter, it is a
free page, that should be reclaimed ahead of mapped pages ideally.
If the slab grows, it will get another new page.

What is the cost of (1)

In the worst case, we select a non-duplicated page, but for us to
select it, it should be inactive, in that case we do I/O to bring back
the page.

 Trimming from the guest is worthwhile if the pages are not used very
 often (but enough that caching them in the host is worth it) and if
 the host cache can serve more than one guest.  If we can identify
 those pages, we don't risk degrading best-case workloads (as defined
 above).
 
 (note ksm to some extent identifies those pages, though it is a bit
 expensive, and doesn't share with the host pagecache).


I see that you are hinting towards finding exact duplicates, I don't
know if the cost and complexity justify it. I hope more users can try
the patches with and without the boot parameter and provide additional
feedback.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control

2010-06-14 Thread Balbir Singh
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-14 09:28:19]:

 On Mon, 14 Jun 2010 00:01:45 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:
 
  * Balbir Singh bal...@linux.vnet.ibm.com [2010-06-08 21:21:46]:
  
   Selectively control Unmapped Page Cache (nospam version)
   
   From: Balbir Singh bal...@linux.vnet.ibm.com
   
   This patch implements unmapped page cache control via preferred
   page cache reclaim. The current patch hooks into kswapd and reclaims
   page cache if the user has requested for unmapped page control.
   This is useful in the following scenario
   
   - In a virtualized environment with cache=writethrough, we see
 double caching - (one in the host and one in the guest). As
 we try to scale guests, cache usage across the system grows.
 The goal of this patch is to reclaim page cache when Linux is running
 as a guest and get the host to hold the page cache and manage it.
 There might be temporary duplication, but in the long run, memory
 in the guests would be used for mapped pages.
   - The option is controlled via a boot option and the administrator
 can selectively turn it on, on a need to use basis.
   
   A lot of the code is borrowed from zone_reclaim_mode logic for
   __zone_reclaim(). One might argue that the with ballooning and
   KSM this feature is not very useful, but even with ballooning,
   we need extra logic to balloon multiple VM machines and it is hard
   to figure out the correct amount of memory to balloon. With these
   patches applied, each guest has a sufficient amount of free memory
   available, that can be easily seen and reclaimed by the balloon driver.
   The additional memory in the guest can be reused for additional
   applications or used to start additional guests/balance memory in
   the host.
   
   KSM currently does not de-duplicate host and guest page cache. The goal
   of this patch is to help automatically balance unmapped page cache when
   instructed to do so.
   
   There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
   and the number of pages to reclaim when unmapped_page_control argument
   is supplied. These numbers were chosen to avoid aggressiveness in
   reaping page cache ever so frequently, at the same time providing control.
   
   The sysctl for min_unmapped_ratio provides further control from
   within the guest on the amount of unmapped pages to reclaim.
  
  
  Are there any major objections to this patch?
   
 
 This kind of patch needs how it works well measurement.
 
 - How did you measure the effect of the patch ? kernbench is not enough, of 
 course.

I can run other benchmarks as well, I will do so

 - Why don't you believe LRU ? And if LRU doesn't work well, should it be
   fixed by a knob rather than generic approach ?
 - No side effects ?

I believe in LRU, just that the problem I am trying to solve is of
using double the memory for caching the same data (consider kvm
running in cache=writethrough or writeback mode, both the hypervisor
and the guest OS maintain a page cache of the same data). As the VM's
grow the overhead is substantial. In my runs I found upto 60%
duplication in some cases.

 
 - Linux vm guys tend to say, free memory is bad memory. ok, for what
   free memory created by your patch is used ? IOW, I can't see the benefit.
   If free memory that your patch created will be used for another page-cache,
   it will be dropped soon by your patch itself.
 

Free memory is good for cases when you want to do more in the same
system. I agree that in a bare metail environment that might be
partially true. I don't have a problem with frequently used data being
cached, but I am targetting a consolidated environment at the moment.
Moreover, the administrator has control via a boot option, so it is
non-instrusive in many ways.

   If your patch just drops duplicated, but no more necessary for other kvm,
   I agree your patch may increase available size of page-caches. But you just
   drops unmapped pages.


unmapped and unused are the best targets, I plan to add slab cache control 
later. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control

2010-06-14 Thread Balbir Singh
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-14 16:00:21]:

 On Mon, 14 Jun 2010 12:19:55 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:
   - Why don't you believe LRU ? And if LRU doesn't work well, should it be
 fixed by a knob rather than generic approach ?
   - No side effects ?
  
  I believe in LRU, just that the problem I am trying to solve is of
  using double the memory for caching the same data (consider kvm
  running in cache=writethrough or writeback mode, both the hypervisor
  and the guest OS maintain a page cache of the same data). As the VM's
  grow the overhead is substantial. In my runs I found upto 60%
  duplication in some cases.
  
  
  - Linux vm guys tend to say, free memory is bad memory. ok, for what
free memory created by your patch is used ? IOW, I can't see the benefit.
If free memory that your patch created will be used for another 
  page-cache,
it will be dropped soon by your patch itself.
  
  Free memory is good for cases when you want to do more in the same
  system. I agree that in a bare metail environment that might be
  partially true. I don't have a problem with frequently used data being
  cached, but I am targetting a consolidated environment at the moment.
  Moreover, the administrator has control via a boot option, so it is
  non-instrusive in many ways.
 
 It sounds that what you want is to improve performance etc. but to make it
 easy sizing the system and to help admins. Right ?


Right, to allow freeing up of using double the memory to cache data.
 
 From performance perspective, I don't see any advantage to drop caches
 which can be dropped easily. I just use cpus for the purpose it may no
 be necessary.
 

It is not that easy, in a virtualized environment, you do directly
reclaim, but use a mechanism like ballooning and that too requires a
smart software to decide where to balloon from. This patch (optionally
if enabled) optimizes that by

1. Reducing double caching
2. Not requiring newer smarts or a management software to monitor and
balloon
3. Allows better estimation of free memory by avoiding double caching
4. Allows immediate use of free memory for other applications or
startup of newer guest instances.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-14 11:09:44]:

 On 06/11/2010 07:56 AM, Balbir Singh wrote:
 
 Just to be clear, let's say we have a mapped page (say of /sbin/init)
 that's been unreferenced since _just_ after the system booted.  We also
 have an unmapped page cache page of a file often used at runtime, say
 one from /etc/resolv.conf or /etc/passwd.
 
 Which page will be preferred for eviction with this patch set?
 
 In this case the order is as follows
 
 1. First we pick free pages if any
 2. If we don't have free pages, we go after unmapped page cache and
 slab cache
 3. If that fails as well, we go after regularly memory
 
 In the scenario that you describe, we'll not be able to easily free up
 the frequently referenced page from /etc/*. The code will move on to
 step 3 and do its regular reclaim.
 
 Still it seems to me you are subverting the normal order of reclaim.
 I don't see why an unmapped page cache or slab cache item should be
 evicted before a mapped page.  Certainly the cost of rebuilding a
 dentry compared to the gain from evicting it, is much higher than
 that of reestablishing a mapped page.


Subverting to aviod memory duplication, the word subverting is
overloaded, let me try to reason a bit. First let me explain the
problem

Memory is a precious resource in a consolidated environment.
We don't want to waste memory via page cache duplication
(cache=writethrough and cache=writeback mode).

Now here is what we are trying to do

1. A slab page will not be freed until the entire page is free (all
slabs have been kfree'd so to speak). Normal reclaim will definitely
free this page, but a lot of it depends on how frequently we are
scanning the LRU list and when this page got added.
2. In the case of page cache (specifically unmapped page cache), there
is duplication already, so why not go after unmapped page caches when
the system is under memory pressure?

In the case of 1, we don't force a dentry to be freed, but rather a
freed page in the slab cache to be reclaimed ahead of forcing reclaim
of mapped pages.

Does the problem statement make sense? If so, do you agree with 1 and
2? Is there major concern about subverting regular reclaim? Does
subverting it make sense in the duplicated scenario?

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-14 15:40:28]:

 On 06/14/2010 11:48 AM, Balbir Singh wrote:
 
 In this case the order is as follows
 
 1. First we pick free pages if any
 2. If we don't have free pages, we go after unmapped page cache and
 slab cache
 3. If that fails as well, we go after regularly memory
 
 In the scenario that you describe, we'll not be able to easily free up
 the frequently referenced page from /etc/*. The code will move on to
 step 3 and do its regular reclaim.
 Still it seems to me you are subverting the normal order of reclaim.
 I don't see why an unmapped page cache or slab cache item should be
 evicted before a mapped page.  Certainly the cost of rebuilding a
 dentry compared to the gain from evicting it, is much higher than
 that of reestablishing a mapped page.
 
 Subverting to aviod memory duplication, the word subverting is
 overloaded,
 
 Right, should have used a different one.
 
 let me try to reason a bit. First let me explain the
 problem
 
 Memory is a precious resource in a consolidated environment.
 We don't want to waste memory via page cache duplication
 (cache=writethrough and cache=writeback mode).
 
 Now here is what we are trying to do
 
 1. A slab page will not be freed until the entire page is free (all
 slabs have been kfree'd so to speak). Normal reclaim will definitely
 free this page, but a lot of it depends on how frequently we are
 scanning the LRU list and when this page got added.
 2. In the case of page cache (specifically unmapped page cache), there
 is duplication already, so why not go after unmapped page caches when
 the system is under memory pressure?
 
 In the case of 1, we don't force a dentry to be freed, but rather a
 freed page in the slab cache to be reclaimed ahead of forcing reclaim
 of mapped pages.
 
 Sounds like this should be done unconditionally, then.  An empty
 slab page is worth less than an unmapped pagecache page at all
 times, no?


In a consolidated environment, even at the cost of some CPU to run
shrinkers, I think potentially yes.
 
 Does the problem statement make sense? If so, do you agree with 1 and
 2? Is there major concern about subverting regular reclaim? Does
 subverting it make sense in the duplicated scenario?
 
 
 In the case of 2, how do you know there is duplication?  You know
 the guest caches the page, but you have no information about the
 host.  Since the page is cached in the guest, the host doesn't see
 it referenced, and is likely to drop it.

True, that is why the first patch is controlled via a boot parameter
that the host can pass. For the second patch, I think we'll need
something like a balloon size cache? with the cache argument being
optional. 

 
 If there is no duplication, then you may have dropped a
 recently-used page and will likely cause a major fault soon.


Yes, agreed. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-14 08:12:56]:

 On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote:
  1. A slab page will not be freed until the entire page is free (all
  slabs have been kfree'd so to speak). Normal reclaim will definitely
  free this page, but a lot of it depends on how frequently we are
  scanning the LRU list and when this page got added.
 
 You don't have to be freeing entire slab pages for the reclaim to have
 been useful.  You could just be making space so that _future_
 allocations fill in the slab holes you just created.  You may not be
 freeing pages, but you're reducing future system pressure.
 
 If unmapped page cache is the easiest thing to evict, then it should be
 the first thing that goes when a balloon request comes in, which is the
 case this patch is trying to handle.  If it isn't the easiest thing to
 evict, then we _shouldn't_ evict it.


Like I said earlier, a lot of that works correctly as you said, but it
is also an idealization. If you've got duplicate pages and you know
that they are duplicated and can be retrieved at a lower cost, why
wouldn't we go after them first?

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-14 10:09:31]:

 On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote:
  If you've got duplicate pages and you know
  that they are duplicated and can be retrieved at a lower cost, why
  wouldn't we go after them first?
 
 I agree with this in theory.  But, the guest lacks the information about
 what is truly duplicated and what the costs are for itself and/or the
 host to recreate it.  Unmapped page cache may be the best proxy that
 we have at the moment for easy to recreate, but I think it's still too
 poor a match to make these patches useful.


That is why the policy (in the next set) will come from the host. As
to whether the data is truly duplicated, my experiments show up to 60%
of the page cache is duplicated. The first patch today is again
enabled by the host. Both of them are expected to be useful in the
cache != none case.

The data I have shows more details including the performance and
overhead.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-14 18:34:58]:

 On 06/14/2010 06:12 PM, Dave Hansen wrote:
 On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote:
 1. A slab page will not be freed until the entire page is free (all
 slabs have been kfree'd so to speak). Normal reclaim will definitely
 free this page, but a lot of it depends on how frequently we are
 scanning the LRU list and when this page got added.
 You don't have to be freeing entire slab pages for the reclaim to have
 been useful.  You could just be making space so that _future_
 allocations fill in the slab holes you just created.  You may not be
 freeing pages, but you're reducing future system pressure.
 
 Depends.  If you've evicted something that will be referenced soon,
 you're increasing system pressure.


I don't think slab pages care about being referenced soon, they are
either allocated or freed. A page is just a storage unit for the data
structure. A new one can be allocated on demand.
 
 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-14 19:34:00]:

 On 06/14/2010 06:55 PM, Dave Hansen wrote:
 On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote:
 On 06/14/2010 06:33 PM, Dave Hansen wrote:
 At the same time, I see what you're trying to do with this.  It really
 can be an alternative to ballooning if we do it right, since ballooning
 would probably evict similar pages.  Although it would only work in idle
 guests, what about a knob that the host can turn to just get the guest
 to start running reclaim?
 Isn't the knob in this proposal the balloon?  AFAICT, the idea here is
 to change how the guest reacts to being ballooned, but the trigger
 itself would not change.
 I think the patch was made on the following assumptions:
 1. Guests will keep filling their memory with relatively worthless page
 cache that they don't really need.
 2. When they do this, it hurts the overall system with no real gain for
 anyone.
 
 In the case of a ballooned guest, they _won't_ keep filling memory.  The
 balloon will prevent them.  So, I guess I was just going down the path
 of considering if this would be useful without ballooning in place.  To
 me, it's really hard to justify _with_ ballooning in place.
 
 There are two decisions that need to be made:
 
 - how much memory a guest should be given
 - given some guest memory, what's the best use for it
 
 The first question can perhaps be answered by looking at guest I/O
 rates and giving more memory to more active guests.  The second
 question is hard, but not any different than running non-virtualized
 - except if we can detect sharing or duplication.  In this case,
 dropping a duplicated page is worthwhile, while dropping a shared
 page provides no benefit.

I think there is another way of looking at it, give some free memory

1. Can the guest run more applications or run faster
2. Can the host potentially get this memory via ballooning or some
other means to start newer guest instances

I think the answer to 1 and 2 is yes.

 
 How the patch helps answer either question, I'm not sure.  I don't
 think preferential dropping of unmapped page cache is the answer.


Preferential dropping as selected by the host, that knows about the
setup and if there is duplication involved. While we use the term
preferential dropping, remember it is still via LRU and we don't
always succeed. It is a best effort (if you can and the unmapped pages
are not highly referenced) scenario.
 
 My issue is that changing the type of object being preferentially
 reclaimed just changes the type of workload that would prematurely
 suffer from reclaim.  In this case, workloads that use a lot of unmapped
 pagecache would suffer.
 
 btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs?
 Those tell you how to balance going after the different classes of
 things that we can reclaim.
 
 Again, this is useless when ballooning is being used.  But, I'm thinking
 of a more general mechanism to force the system to both have MemFree
 _and_ be acting as if it is under memory pressure.
 
 If there is no memory pressure on the host, there is no reason for
 the guest to pretend it is under pressure.  If there is memory
 pressure on the host, it should share the pain among its guests by
 applying the balloon.  So I don't think voluntarily dropping cache
 is a good direction.


There are two situations

1. Voluntarily drop cache, if it was setup to do so (the host knows
that it caches that information anyway)
2. Drop the cache on either a special balloon option, again the host
knows it caches that very same information, so it prefers to free that
up first. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control

2010-06-13 Thread Balbir Singh
* Balbir Singh bal...@linux.vnet.ibm.com [2010-06-08 21:21:46]:

 Selectively control Unmapped Page Cache (nospam version)
 
 From: Balbir Singh bal...@linux.vnet.ibm.com
 
 This patch implements unmapped page cache control via preferred
 page cache reclaim. The current patch hooks into kswapd and reclaims
 page cache if the user has requested for unmapped page control.
 This is useful in the following scenario
 
 - In a virtualized environment with cache=writethrough, we see
   double caching - (one in the host and one in the guest). As
   we try to scale guests, cache usage across the system grows.
   The goal of this patch is to reclaim page cache when Linux is running
   as a guest and get the host to hold the page cache and manage it.
   There might be temporary duplication, but in the long run, memory
   in the guests would be used for mapped pages.
 - The option is controlled via a boot option and the administrator
   can selectively turn it on, on a need to use basis.
 
 A lot of the code is borrowed from zone_reclaim_mode logic for
 __zone_reclaim(). One might argue that the with ballooning and
 KSM this feature is not very useful, but even with ballooning,
 we need extra logic to balloon multiple VM machines and it is hard
 to figure out the correct amount of memory to balloon. With these
 patches applied, each guest has a sufficient amount of free memory
 available, that can be easily seen and reclaimed by the balloon driver.
 The additional memory in the guest can be reused for additional
 applications or used to start additional guests/balance memory in
 the host.
 
 KSM currently does not de-duplicate host and guest page cache. The goal
 of this patch is to help automatically balance unmapped page cache when
 instructed to do so.
 
 There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
 and the number of pages to reclaim when unmapped_page_control argument
 is supplied. These numbers were chosen to avoid aggressiveness in
 reaping page cache ever so frequently, at the same time providing control.
 
 The sysctl for min_unmapped_ratio provides further control from
 within the guest on the amount of unmapped pages to reclaim.


Are there any major objections to this patch?
 
-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-11 Thread Balbir Singh
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-11 14:05:53]:

 On Fri, 11 Jun 2010 10:16:32 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:
 
  * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-11 10:54:41]:
  
   On Thu, 10 Jun 2010 17:07:32 -0700
   Dave Hansen d...@linux.vnet.ibm.com wrote:
   
On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote:
  I'm not sure victimizing unmapped cache pages is a good idea.
  Shouldn't page selection use the LRU for recency information instead
  of the cost of guest reclaim?  Dropping a frequently used unmapped
  cache page can be more expensive than dropping an unused text page
  that was loaded as part of some executable's initialization and
  forgotten.
 
 We victimize the unmapped cache only if it is unused (in LRU order).
 We don't force the issue too much. We also have free slab cache to go
 after.

Just to be clear, let's say we have a mapped page (say of /sbin/init)
that's been unreferenced since _just_ after the system booted.  We also
have an unmapped page cache page of a file often used at runtime, say
one from /etc/resolv.conf or /etc/passwd.

   
   Hmm. I'm not fan of estimating working set size by calculation
   based on some numbers without considering history or feedback.
   
   Can't we use some kind of feedback algorithm as hi-low-watermark, random 
   walk
   or GA (or somehing more smart) to detect the size ?
  
  
  Could you please clarify at what level you are suggesting size
  detection? I assume it is outside the OS, right? 
  
 OS includes kernel and system programs ;)
 
 I can think of both way in kernel and in user approarh and they should be
 complement to each other.
 
 An example of kernel-based approach is.
  1. add a shrinker callback(A) for balloon-driver-for-guest as guest kswapd.
  2. add a shrinker callback(B) for balloon-driver-for-host as host kswapd.
 (I guess current balloon driver is only for host. Please imagine.)
 
 (A) increases free memory in Guest.
 (B) increases free memory in Host.
 
 This is an example of feedback based memory resizing between host and guest.
 
 I think (B) is necessary at least before considering complecated things.

B is left to the hypervisor and the memory policy running on it. My
patches address Linux running as a guest, with a Linux hypervisor at
the moment, but that can be extended to other balloon drivers as well.

 
 To implement something clever,  (A) and (B) should take into account that
 how frequently memory reclaim in guest (which requires some I/O) happens.
 

Yes, I think the policy in the hypervisor needs to look at those
details as well.

 If doing outside kernel, I think using memcg is better than depends on
 balloon driver. But co-operative balloon and memcg may show us something
 good.
 

Yes, agreed. Co-operative is better, if there is no co-operation than
memcg might be used for enforcement.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] KVM: busy-spin detector

2010-06-11 Thread Balbir Singh
* Marcelo Tosatti mtosa...@redhat.com [2010-06-10 23:25:51]:

 
 The following patch implements a simple busy-spin detector. It considers
 a vcpu as busy-spinning if there are two consecutive exits due to
 external interrupt on the same RIP, and sleeps for 100us in that case.
 
 It is very likely that if the vcpu is making progress it will either
 exit for other reasons or change RIP.
 
 The percentage numbers below represent improvement in kernel build
 time in comparison with mainline (RHEL 5.4 guest).


Interesting approach, is there a reason to tie it in with pause loop
exits? Can't we do something more generic or even para-virtish.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] KVM: busy-spin detector

2010-06-11 Thread Balbir Singh
* Huang, Zhiteng zhiteng.hu...@intel.com [2010-06-11 23:03:25]:

 PLE-like design may be more generic than para-virtish when it comes to 
 Windows guest.


Hmm.. sounds reasonable
 
 Is this busy-spin actually a Lock Holder Preemption problem?


Yep, I was hinting towards solving that problem. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] KVM: busy-spin detector

2010-06-11 Thread Balbir Singh
* Marcelo Tosatti mtosa...@redhat.com [2010-06-11 14:46:27]:

  Interesting approach, is there a reason to tie it in with pause loop
  exits? 
 
 Hum, i don't see any. PLE exits provide the same detection, but more
 accurately.

  Can't we do something more generic or even para-virtish.
 
 This is pretty generic already? Or what do you mean?
 
 The advantage is it does not require paravirt modifications in the  
 guest (at the expense of guessing what the guest is doing).


Agreed, but one needs to depend on newer hardware to get this feature
to work. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-10 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-10 12:43:11]:

 On 06/08/2010 06:51 PM, Balbir Singh wrote:
 Balloon unmapped page cache pages first
 
 From: Balbir Singhbal...@linux.vnet.ibm.com
 
 This patch builds on the ballooning infrastructure by ballooning unmapped
 page cache pages first. It looks for low hanging fruit first and tries
 to reclaim clean unmapped pages first.
 
 I'm not sure victimizing unmapped cache pages is a good idea.
 Shouldn't page selection use the LRU for recency information instead
 of the cost of guest reclaim?  Dropping a frequently used unmapped
 cache page can be more expensive than dropping an unused text page
 that was loaded as part of some executable's initialization and
 forgotten.


We victimize the unmapped cache only if it is unused (in LRU order).
We don't force the issue too much. We also have free slab cache to go
after.

 Many workloads have many unmapped cache pages, for example static
 web serving and the all-important kernel build.
 

I've tested kernbench, you can see the results in the original posting
and there is no observable overhead as a result of the patch in my
run.

 The key advantage was that it resulted in lesser RSS usage in the host and
 more cached usage, indicating that the caching had been pushed towards
 the host. The guest cached memory usage was lower and free memory in
 the guest was also higher.
 
 Caching in the host is only helpful if the cache can be shared,
 otherwise it's better to cache in the guest.


Hmm.. so we would need a ballon cache hint from the monitor, so that
it is not unconditional? Overall my results show the following

1. No drastic reduction of guest unmapped cache, just sufficient to
show lesser RSS in the host. More freeable memory (as in cached
memory + free memory) visible on the host.
2. No significant impact on the benchmark (numbers) running in the
guest.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-10 Thread Balbir Singh
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-11 10:54:41]:

 On Thu, 10 Jun 2010 17:07:32 -0700
 Dave Hansen d...@linux.vnet.ibm.com wrote:
 
  On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote:
I'm not sure victimizing unmapped cache pages is a good idea.
Shouldn't page selection use the LRU for recency information instead
of the cost of guest reclaim?  Dropping a frequently used unmapped
cache page can be more expensive than dropping an unused text page
that was loaded as part of some executable's initialization and
forgotten.
   
   We victimize the unmapped cache only if it is unused (in LRU order).
   We don't force the issue too much. We also have free slab cache to go
   after.
  
  Just to be clear, let's say we have a mapped page (say of /sbin/init)
  that's been unreferenced since _just_ after the system booted.  We also
  have an unmapped page cache page of a file often used at runtime, say
  one from /etc/resolv.conf or /etc/passwd.
  
 
 Hmm. I'm not fan of estimating working set size by calculation
 based on some numbers without considering history or feedback.
 
 Can't we use some kind of feedback algorithm as hi-low-watermark, random walk
 or GA (or somehing more smart) to detect the size ?


Could you please clarify at what level you are suggesting size
detection? I assume it is outside the OS, right? 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-10 Thread Balbir Singh
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-10 17:07:32]:

 On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote:
   I'm not sure victimizing unmapped cache pages is a good idea.
   Shouldn't page selection use the LRU for recency information instead
   of the cost of guest reclaim?  Dropping a frequently used unmapped
   cache page can be more expensive than dropping an unused text page
   that was loaded as part of some executable's initialization and
   forgotten.
  
  We victimize the unmapped cache only if it is unused (in LRU order).
  We don't force the issue too much. We also have free slab cache to go
  after.
 
 Just to be clear, let's say we have a mapped page (say of /sbin/init)
 that's been unreferenced since _just_ after the system booted.  We also
 have an unmapped page cache page of a file often used at runtime, say
 one from /etc/resolv.conf or /etc/passwd.
 
 Which page will be preferred for eviction with this patch set?


In this case the order is as follows

1. First we pick free pages if any
2. If we don't have free pages, we go after unmapped page cache and
slab cache
3. If that fails as well, we go after regularly memory

In the scenario that you describe, we'll not be able to easily free up
the frequently referenced page from /etc/*. The code will move on to
step 3 and do its regular reclaim. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC/T/D][PATCH 0/2] KVM page cache optimization (v2)

2010-06-08 Thread Balbir Singh
This is version 2 of the page cache control patches for
KVM. This series has two patches, the first controls
the amount of unmapped page cache usage via a boot
parameter and sysctl. The second patch controls page
and slab cache via the balloon driver. Both the patches
make heavy use of the zone_reclaim() functionality
already present in the kernel.

page-cache-control
balloon-page-cache

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC][PATCH 1/2] Linux/Guest unmapped page cache control

2010-06-08 Thread Balbir Singh
Selectively control Unmapped Page Cache (nospam version)

From: Balbir Singh bal...@linux.vnet.ibm.com

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario

- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

The patch is applied against mmotm feb-11-2010.

TODt Usage without boot parameter (memory in KB)

MemFree Cached Time
19900   292912 137
17540   296196 139
17900   296124 141
19356   296660 141

Host usage:  (memory in KB)

RSS Cache   mapped  swap
2788664 781884  3780359536

Guest Usage with boot parameter (memory in KB)
-
Memfree Cached   Time
244824  74828   144
237840  81764   143
235880  83044   138
239312  80092   148

Host usage: (memory in KB)

RSS Cache   mapped  swap
2700184 958012  334848  398412

TODOS
-
1. Balance slab cache as well
2. Invoke the balance routines from the balloon driver

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---

 include/linux/mmzone.h |2 -
 include/linux/swap.h   |3 +
 mm/page_alloc.c|9 ++-
 mm/vmscan.c|  165 
 4 files changed, 134 insertions(+), 45 deletions(-)


diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..9f96b6d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -293,12 +293,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
+   unsigned long   min_unmapped_pages;
 #ifdef CONFIG_NUMA
int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
-   unsigned long   min_unmapped_pages;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ff4acea..f92f1ee 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -251,10 +251,11 @@ extern unsigned long shrink_all_memory(unsigned long 
nr_pages);
 extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
+extern bool should_balance_unmapped_pages(struct zone *zone);
 
+extern int sysctl_min_unmapped_ratio;
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 431214b..fee9420 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1641,6 +1641,9 @@ zonelist_scan:
unsigned long mark;
int ret;
 
+   if (should_balance_unmapped_pages(zone))
+   wakeup_kswapd(zone, order);
+
mark = zone-watermark[alloc_flags  ALLOC_WMARK_MASK];
if (zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
@@ -4069,10 +4072,10 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat

[RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-08 Thread Balbir Singh
Balloon unmapped page cache pages first

From: Balbir Singh bal...@linux.vnet.ibm.com

This patch builds on the ballooning infrastructure by ballooning unmapped
page cache pages first. It looks for low hanging fruit first and tries
to reclaim clean unmapped pages first.

This patch brings zone_reclaim() and other dependencies out of CONFIG_NUMA
and then reuses the zone_reclaim_mode logic if __GFP_FREE_CACHE is passed
in the gfp_mask. The virtio balloon driver has been changed to use
__GFP_FREE_CACHE.

Tests:

I ran a simple filter function that kept frequently ballon a single VM
running kernbench. The VM was configured with 2GB of memory and 2 VCPUs.
The filter function was a triangular wave function that ballooned
the VM under study from 500MB to 1500MB using a triangular wave function
continously. The run times of the VM with and without changes are shown
below. The run times showed no significant impact of the changes.

Withchanges

Elapsed Time 223.86 (1.52822)
User Time 191.01 (0.65395)
System Time 199.468 (2.43616)
Percent CPU 174 (1)
Context Switches 103182 (595.05)
Sleeps 39107.6 (1505.67)

Without changes

Elapsed Time 225.526 (2.93102)
User Time 193.53 (3.53626)
System Time 199.832 (3.26281)
Percent CPU 173.6 (1.14018)
Context Switches 103744 (1311.53)
Sleeps 39383.2 (831.865)

The key advantage was that it resulted in lesser RSS usage in the host and
more cached usage, indicating that the caching had been pushed towards
the host. The guest cached memory usage was lower and free memory in
the guest was also higher.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---

 drivers/virtio/virtio_balloon.c |3 ++-
 include/linux/gfp.h |8 +++-
 include/linux/swap.h|9 +++--
 mm/page_alloc.c |3 ++-
 mm/vmscan.c |2 +-
 5 files changed, 15 insertions(+), 10 deletions(-)


diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0f1da45..609a9c2 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -104,7 +104,8 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
 
for (vb-num_pfns = 0; vb-num_pfns  num; vb-num_pfns++) {
struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
-   __GFP_NOMEMALLOC | __GFP_NOWARN);
+   __GFP_NOMEMALLOC | __GFP_NOWARN |
+   __GFP_FREE_CACHE);
if (!page) {
if (printk_ratelimit())
dev_printk(KERN_INFO, vb-vdev-dev,
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 975609c..9048259 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -61,12 +61,18 @@ struct vm_area_struct;
 #endif
 
 /*
+ * While allocating pages, try to free cache pages first. Note the
+ * heavy dependency on zone_reclaim_mode logic
+ */
+#define __GFP_FREE_CACHE ((__force gfp_t)0x40u) /* Free cache first */
+
+/*
  * This may seem redundant, but it's a way of annotating false positives vs.
  * allocations that simply cannot be supported (e.g. page tables).
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 22/* Room for 22 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 23/* Room for 22 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1  __GFP_BITS_SHIFT) - 1))
 
 /* This equals 0, but use constants in case they ever change */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f92f1ee..f77c603 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,16 +254,13 @@ extern long vm_total_pages;
 extern bool should_balance_unmapped_pages(struct zone *zone);
 
 extern int sysctl_min_unmapped_ratio;
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
 extern int sysctl_min_slab_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
 #else
 #define zone_reclaim_mode 0
-static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
-{
-   return 0;
-}
 #endif
 
 extern int page_evictable(struct page *page, struct vm_area_struct *vma);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fee9420..d977b36 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1649,7 +1649,8 @@ zonelist_scan:
classzone_idx, alloc_flags))
goto try_this_zone;
 
-   if (zone_reclaim_mode == 0)
+   if (zone_reclaim_mode == 0 
+   !(gfp_mask  __GFP_FREE_CACHE))
goto this_zone_full;
 
ret = zone_reclaim(zone, gfp_mask, order);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 27bc536..393bee5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2624,6 +2624,7 @@ module_init(kswapd_init)
  * the watermarks

Re: KVM and the OOM-Killer

2010-05-14 Thread Balbir Singh
* Athanasius k...@miggy.org [2010-05-14 08:33:34]:

 On Thu, May 13, 2010 at 01:20:31PM +0100, James Stevens wrote:
  We have a KVM host with 48Gb of RAM and run about 20 KVM clients on it.  
  After some time - different time depending on the kernel version - the  
  VM host kernel will start OOM-Killing the VM clients, even when there is  
  lots of free RAM (10Gb) and free SWAP (34Gb).
 
   It seems going to a 64 bit kernel is what you want, but I thought it
 worth mentioning the available method to say try not to OOM-kill *this*
 process:
 
   echo -16  /proc/pid/oom_adj
 

A lot of this is being changed, but not yet committed. There are
patches out there to deal with the lowmem issue. Meanwhile, do follow
the suggestions on oom_adj and moving to 64 bit.


-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM and the OOM-Killer

2010-05-14 Thread Balbir Singh
* James Stevens james.stev...@jrcs.co.uk [2010-05-14 09:10:19]:

  echo -16  /proc/pid/oom_adj
 
 Thanks for that - yes, I know about oom_adj, but it doesn't
 (totally) work. udevd has a default of -17 and it got killed
 anyway.
 
 Also, the only thing this server runs is VMs so if they can't be
 killed oom-killer will just run through the everything else
 (syslogd, sshd, klogd, udevd, hald, agetty etc) - so on balance its
 a case of which is worse?  Without those daemons the system can
 become inaccessible and could become unstable, so on balance it may
 be better to let it kill the VMs.
 
 My current work-around is :-
 
 sync; echo 3  /proc/sys/vm/drop_caches


Have you looked at memory cgroups and using that with limits with VMs? 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM and the OOM-Killer

2010-05-14 Thread Balbir Singh
* James Stevens james.stev...@jrcs.co.uk [2010-05-14 09:43:04]:

 Have you looked at memory cgroups and using that with limits with VMs?
 
 The problem was *NOT* that my VMs exhausted all memory. I know that
 is what normally triggers oom-killer, but you have to understand
 this mine was a very different scenario, hence I wanted to bring it
 to people's attention. I had about 10Gb of *FREE* HIGH and 34GB of
 *FREE* SWAP when oom-killer was activated - yep, didn't make sense
 to me either. If you want to study the logs :-


I understand, You could potentially encapsulate all else - except your
VM's in a small cgroup and frequently reclaim from there using the
memory cgroup. If drop caches works for you, that is good too. I am
surprised that cache allocations are causing lowmem exhaustion.
 


-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RESEND]Fix GFP flags passed from the virtio balloon driver

2010-04-21 Thread Balbir Singh
Fix GFP flags passed from the virtio balloon driver

From: Balbir Singh bal...@linux.vnet.ibm.com

The virtio balloon driver can dig into the reservation pools
of the OS to satisfy a balloon request. This is not advisable
and other balloon drivers (drivers/xen/balloon.c) avoid this
as well. The patch also avoids printing a warning if allocation
fails.

Comments?

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---

 drivers/virtio/virtio_balloon.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)


diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 369f2ee..f8ffe8c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -102,7 +102,8 @@ static void fill_balloon(struct virtio_balloon *vb, size_t 
num)
num = min(num, ARRAY_SIZE(vb-pfns));
 
for (vb-num_pfns = 0; vb-num_pfns  num; vb-num_pfns++) {
-   struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY);
+   struct page *page = alloc_page(GFP_HIGHUSER | __GFP_NORETRY |
+   __GFP_NOMEMALLOC | __GFP_NOWARN);
if (!page) {
if (printk_ratelimit())
dev_printk(KERN_INFO, vb-vdev-dev,

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >