Re: [patch 9/9] mm: workingset: keep shadow entries in check

2013-08-22 Thread Johannes Weiner
On Tue, Aug 20, 2013 at 01:59:24PM -0700, Andrew Morton wrote:
> On Sat, 17 Aug 2013 15:31:23 -0400 Johannes Weiner  wrote:
> 
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> 
> erk.  This whole patch is overhead :(
> 
> > To get this under control, a list of inodes that contain shadow
> > entries is maintained.  If the global number of shadows exceeds a
> > certain threshold, a shrinker is activated that reclaims old entries
> > from the mappings.  This is heavy-handed but it should not be a hot
> > path and is mainly there to protect from accidentally/maliciously
> > induced OOM kills.  The global list is also not a problem because the
> > modifications are very rare: inodes are added once in their lifetime
> > when the first shadow entry is stored (i.e. the first page reclaimed)
> > and lazily removed when the inode exits.  Or if the shrinker removes
> > all shadow entries.
> > 
> > ...
> >
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -417,6 +417,7 @@ struct address_space {
> > /* Protected by tree_lock together with the radix tree */
> > unsigned long   nrpages;/* number of total pages */
> > unsigned long   nrshadows;  /* number of shadow entries */
> > +   struct list_headshadow_list;/* list of mappings with 
> > shadows */
> > pgoff_t writeback_index;/* writeback starts here */
> > const struct address_space_operations *a_ops;   /* methods */
> > unsigned long   flags;  /* error bits/gfp mask */
> 
> There's another 16 bytes into the inode.  Bad.

Yeah :(

An obvious alternative to storing page eviction information in the
page cache radix tree would be to have a separate data structure that
scales with the number of physical pages available.

It really depends on the workload which one's cheaper in terms of both
memory and cpu.

Workloads where the ratio between number of inodes and inode size is
high suffer from the increased inode size, but when they have decent
access locality within the inodes, the inodes should be evicted along
with their pages.  So in this case there is little to no memory
overhead from the eviction information compared to the fixed size
separate data structure.

And refault information lookup is cheaper of course when storing
eviction information inside the cache slots.

Workloads for which this model sucks are those with inode locality but
no data locality.  The inodes stay alive and the lack of data locality
produces many shadow entries that only the costly shrinker can get rid
of.  Numbers aside, it was a judgement call to improve workloads with
high data locality at the cost of those without.

But I'll include a more concrete cost analysis in the submission that
also includes more concrete details on the benefits of the series ;)

> > +void workingset_shadows_inc(struct address_space *mapping)
> > +{
> > +   might_lock(_lock);
> > +
> > +   if (mapping->nrshadows == 0 && list_empty(>shadow_list)) {
> > +   spin_lock(_lock);
> 
> I can't work out whether or not shadow_lock is supposed to be irq-save.
> Some places it is, others are unobvious.

It is.  The caller holds the irq-safe tree_lock, though, so no need to
disable IRQs a second time.  I'll add documentation.

> > +static unsigned long get_nr_old_shadows(void)
> > +{
> > +   unsigned long nr_max;
> > +   unsigned long nr;
> > +   long sum = 0;
> > +   int cpu;
> > +
> > +   for_each_possible_cpu(cpu)
> > +   sum += per_cpu(nr_shadows, cpu);
> 
> Ouch, slow.  shrink_slab() will call this repeatedly and scan_shadows()
> calls it from a loop.  Can we use something non-deathly-slow here? 
> Like percpu_counter_read_positive()?

Finally, a usecase for percpu_counter!!  Sounds reasonable, I'll
convert this stuff over.

> > +   nr = max(sum, 0L);
> > +
> > +   /*
> > +* Every shadow entry with a refault distance bigger than the
> > +* active list is ignored and so NR_ACTIVE_FILE would be a
> > +* reasonable ceiling.  But scanning and shrinking shadow
> > +* entries is quite expensive, so be generous.
> > +*/
> > +   nr_max = global_dirtyable_memory() * 4;
> > +
> > +   if (nr <= nr_max)
> > +   return 0;
> > +   return nr - nr_max;
> > +}
> > +
> > +static unsigned long scan_mapping(struct address_space *mapping,
> > + unsigned long nr_to_scan)
> 
> Some methodological description would be useful.

Fair enough, I'll 

Re: [patch 9/9] mm: workingset: keep shadow entries in check

2013-08-22 Thread Johannes Weiner
On Tue, Aug 20, 2013 at 01:59:24PM -0700, Andrew Morton wrote:
 On Sat, 17 Aug 2013 15:31:23 -0400 Johannes Weiner han...@cmpxchg.org wrote:
 
  Previously, page cache radix tree nodes were freed after reclaim
  emptied out their page pointers.  But now reclaim stores shadow
  entries in their place, which are only reclaimed when the inodes
  themselves are reclaimed.  This is problematic for bigger files that
  are still in use after they have a significant amount of their cache
  reclaimed, without any of those pages actually refaulting.  The shadow
  entries will just sit there and waste memory.  In the worst case, the
  shadow entries will accumulate until the machine runs out of memory.
 
 erk.  This whole patch is overhead :(
 
  To get this under control, a list of inodes that contain shadow
  entries is maintained.  If the global number of shadows exceeds a
  certain threshold, a shrinker is activated that reclaims old entries
  from the mappings.  This is heavy-handed but it should not be a hot
  path and is mainly there to protect from accidentally/maliciously
  induced OOM kills.  The global list is also not a problem because the
  modifications are very rare: inodes are added once in their lifetime
  when the first shadow entry is stored (i.e. the first page reclaimed)
  and lazily removed when the inode exits.  Or if the shrinker removes
  all shadow entries.
  
  ...
 
  --- a/include/linux/fs.h
  +++ b/include/linux/fs.h
  @@ -417,6 +417,7 @@ struct address_space {
  /* Protected by tree_lock together with the radix tree */
  unsigned long   nrpages;/* number of total pages */
  unsigned long   nrshadows;  /* number of shadow entries */
  +   struct list_headshadow_list;/* list of mappings with 
  shadows */
  pgoff_t writeback_index;/* writeback starts here */
  const struct address_space_operations *a_ops;   /* methods */
  unsigned long   flags;  /* error bits/gfp mask */
 
 There's another 16 bytes into the inode.  Bad.

Yeah :(

An obvious alternative to storing page eviction information in the
page cache radix tree would be to have a separate data structure that
scales with the number of physical pages available.

It really depends on the workload which one's cheaper in terms of both
memory and cpu.

Workloads where the ratio between number of inodes and inode size is
high suffer from the increased inode size, but when they have decent
access locality within the inodes, the inodes should be evicted along
with their pages.  So in this case there is little to no memory
overhead from the eviction information compared to the fixed size
separate data structure.

And refault information lookup is cheaper of course when storing
eviction information inside the cache slots.

Workloads for which this model sucks are those with inode locality but
no data locality.  The inodes stay alive and the lack of data locality
produces many shadow entries that only the costly shrinker can get rid
of.  Numbers aside, it was a judgement call to improve workloads with
high data locality at the cost of those without.

But I'll include a more concrete cost analysis in the submission that
also includes more concrete details on the benefits of the series ;)

  +void workingset_shadows_inc(struct address_space *mapping)
  +{
  +   might_lock(shadow_lock);
  +
  +   if (mapping-nrshadows == 0  list_empty(mapping-shadow_list)) {
  +   spin_lock(shadow_lock);
 
 I can't work out whether or not shadow_lock is supposed to be irq-save.
 Some places it is, others are unobvious.

It is.  The caller holds the irq-safe tree_lock, though, so no need to
disable IRQs a second time.  I'll add documentation.

  +static unsigned long get_nr_old_shadows(void)
  +{
  +   unsigned long nr_max;
  +   unsigned long nr;
  +   long sum = 0;
  +   int cpu;
  +
  +   for_each_possible_cpu(cpu)
  +   sum += per_cpu(nr_shadows, cpu);
 
 Ouch, slow.  shrink_slab() will call this repeatedly and scan_shadows()
 calls it from a loop.  Can we use something non-deathly-slow here? 
 Like percpu_counter_read_positive()?

Finally, a usecase for percpu_counter!!  Sounds reasonable, I'll
convert this stuff over.

  +   nr = max(sum, 0L);
  +
  +   /*
  +* Every shadow entry with a refault distance bigger than the
  +* active list is ignored and so NR_ACTIVE_FILE would be a
  +* reasonable ceiling.  But scanning and shrinking shadow
  +* entries is quite expensive, so be generous.
  +*/
  +   nr_max = global_dirtyable_memory() * 4;
  +
  +   if (nr = nr_max)
  +   return 0;
  +   return nr - nr_max;
  +}
  +
  +static unsigned long scan_mapping(struct address_space *mapping,
  + unsigned long nr_to_scan)
 
 Some methodological description would be useful.

Fair enough, I'll write something.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a 

Re: [patch 9/9] mm: workingset: keep shadow entries in check

2013-08-20 Thread Andrew Morton
On Sat, 17 Aug 2013 15:31:23 -0400 Johannes Weiner  wrote:

> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.

erk.  This whole patch is overhead :(

> To get this under control, a list of inodes that contain shadow
> entries is maintained.  If the global number of shadows exceeds a
> certain threshold, a shrinker is activated that reclaims old entries
> from the mappings.  This is heavy-handed but it should not be a hot
> path and is mainly there to protect from accidentally/maliciously
> induced OOM kills.  The global list is also not a problem because the
> modifications are very rare: inodes are added once in their lifetime
> when the first shadow entry is stored (i.e. the first page reclaimed)
> and lazily removed when the inode exits.  Or if the shrinker removes
> all shadow entries.
> 
> ...
>
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -417,6 +417,7 @@ struct address_space {
>   /* Protected by tree_lock together with the radix tree */
>   unsigned long   nrpages;/* number of total pages */
>   unsigned long   nrshadows;  /* number of shadow entries */
> + struct list_headshadow_list;/* list of mappings with 
> shadows */
>   pgoff_t writeback_index;/* writeback starts here */
>   const struct address_space_operations *a_ops;   /* methods */
>   unsigned long   flags;  /* error bits/gfp mask */

There's another 16 bytes into the inode.  Bad.

>
> ...
>
> +void workingset_shadows_inc(struct address_space *mapping)
> +{
> + might_lock(_lock);
> +
> + if (mapping->nrshadows == 0 && list_empty(>shadow_list)) {
> + spin_lock(_lock);

I can't work out whether or not shadow_lock is supposed to be irq-save.
Some places it is, others are unobvious.

> + list_add(>shadow_list, _mappings);
> + spin_unlock(_lock);
> + }
> +
> + mapping->nrshadows++;
> + this_cpu_inc(nr_shadows);
> +}
> +
>
> ...
>
> +static unsigned long get_nr_old_shadows(void)
> +{
> + unsigned long nr_max;
> + unsigned long nr;
> + long sum = 0;
> + int cpu;
> +
> + for_each_possible_cpu(cpu)
> + sum += per_cpu(nr_shadows, cpu);

Ouch, slow.  shrink_slab() will call this repeatedly and scan_shadows()
calls it from a loop.  Can we use something non-deathly-slow here? 
Like percpu_counter_read_positive()?

> + nr = max(sum, 0L);
> +
> + /*
> +  * Every shadow entry with a refault distance bigger than the
> +  * active list is ignored and so NR_ACTIVE_FILE would be a
> +  * reasonable ceiling.  But scanning and shrinking shadow
> +  * entries is quite expensive, so be generous.
> +  */
> + nr_max = global_dirtyable_memory() * 4;
> +
> + if (nr <= nr_max)
> + return 0;
> + return nr - nr_max;
> +}
> +
> +static unsigned long scan_mapping(struct address_space *mapping,
> +   unsigned long nr_to_scan)

Some methodological description would be useful.

> +{
> + unsigned long nr_scanned = 0;
> + struct radix_tree_iter iter;
> + void **slot;
> +
> + rcu_read_lock();
> +restart:
> + radix_tree_for_each_slot(slot, >page_tree, , 0) {
> + unsigned long nrshadows;
> + unsigned long distance;
> + struct zone *zone;
> + struct page *page;
> +
> + page = radix_tree_deref_slot(slot);
> + if (unlikely(!page))
> + continue;
> + if (!radix_tree_exception(page))
> + continue;
> + if (radix_tree_deref_retry(page))
> + goto restart;
> +
> + unpack_shadow(page, , );
> +
> + if (distance <= zone_page_state(zone, NR_ACTIVE_FILE))
> + continue;
> +
> + spin_lock_irq(>tree_lock);
> + if (radix_tree_delete_item(>page_tree,
> +iter.index, page)) {
> + inc_zone_state(zone, WORKINGSET_SHADOWS_RECLAIMED);
> + workingset_shadows_dec(mapping);
> + nr_scanned++;
> + }
> + nrshadows = mapping->nrshadows;
> + spin_unlock_irq(>tree_lock);
> +
> + if (nrshadows == 0)
> + break;
> +
> + if (--nr_to_scan == 0)
> + break;
> + }
> + rcu_read_unlock();
> +
> + 

Re: [patch 9/9] mm: workingset: keep shadow entries in check

2013-08-20 Thread Andrew Morton
On Sat, 17 Aug 2013 15:31:23 -0400 Johannes Weiner han...@cmpxchg.org wrote:

 Previously, page cache radix tree nodes were freed after reclaim
 emptied out their page pointers.  But now reclaim stores shadow
 entries in their place, which are only reclaimed when the inodes
 themselves are reclaimed.  This is problematic for bigger files that
 are still in use after they have a significant amount of their cache
 reclaimed, without any of those pages actually refaulting.  The shadow
 entries will just sit there and waste memory.  In the worst case, the
 shadow entries will accumulate until the machine runs out of memory.

erk.  This whole patch is overhead :(

 To get this under control, a list of inodes that contain shadow
 entries is maintained.  If the global number of shadows exceeds a
 certain threshold, a shrinker is activated that reclaims old entries
 from the mappings.  This is heavy-handed but it should not be a hot
 path and is mainly there to protect from accidentally/maliciously
 induced OOM kills.  The global list is also not a problem because the
 modifications are very rare: inodes are added once in their lifetime
 when the first shadow entry is stored (i.e. the first page reclaimed)
 and lazily removed when the inode exits.  Or if the shrinker removes
 all shadow entries.
 
 ...

 --- a/include/linux/fs.h
 +++ b/include/linux/fs.h
 @@ -417,6 +417,7 @@ struct address_space {
   /* Protected by tree_lock together with the radix tree */
   unsigned long   nrpages;/* number of total pages */
   unsigned long   nrshadows;  /* number of shadow entries */
 + struct list_headshadow_list;/* list of mappings with 
 shadows */
   pgoff_t writeback_index;/* writeback starts here */
   const struct address_space_operations *a_ops;   /* methods */
   unsigned long   flags;  /* error bits/gfp mask */

There's another 16 bytes into the inode.  Bad.


 ...

 +void workingset_shadows_inc(struct address_space *mapping)
 +{
 + might_lock(shadow_lock);
 +
 + if (mapping-nrshadows == 0  list_empty(mapping-shadow_list)) {
 + spin_lock(shadow_lock);

I can't work out whether or not shadow_lock is supposed to be irq-save.
Some places it is, others are unobvious.

 + list_add(mapping-shadow_list, shadow_mappings);
 + spin_unlock(shadow_lock);
 + }
 +
 + mapping-nrshadows++;
 + this_cpu_inc(nr_shadows);
 +}
 +

 ...

 +static unsigned long get_nr_old_shadows(void)
 +{
 + unsigned long nr_max;
 + unsigned long nr;
 + long sum = 0;
 + int cpu;
 +
 + for_each_possible_cpu(cpu)
 + sum += per_cpu(nr_shadows, cpu);

Ouch, slow.  shrink_slab() will call this repeatedly and scan_shadows()
calls it from a loop.  Can we use something non-deathly-slow here? 
Like percpu_counter_read_positive()?

 + nr = max(sum, 0L);
 +
 + /*
 +  * Every shadow entry with a refault distance bigger than the
 +  * active list is ignored and so NR_ACTIVE_FILE would be a
 +  * reasonable ceiling.  But scanning and shrinking shadow
 +  * entries is quite expensive, so be generous.
 +  */
 + nr_max = global_dirtyable_memory() * 4;
 +
 + if (nr = nr_max)
 + return 0;
 + return nr - nr_max;
 +}
 +
 +static unsigned long scan_mapping(struct address_space *mapping,
 +   unsigned long nr_to_scan)

Some methodological description would be useful.

 +{
 + unsigned long nr_scanned = 0;
 + struct radix_tree_iter iter;
 + void **slot;
 +
 + rcu_read_lock();
 +restart:
 + radix_tree_for_each_slot(slot, mapping-page_tree, iter, 0) {
 + unsigned long nrshadows;
 + unsigned long distance;
 + struct zone *zone;
 + struct page *page;
 +
 + page = radix_tree_deref_slot(slot);
 + if (unlikely(!page))
 + continue;
 + if (!radix_tree_exception(page))
 + continue;
 + if (radix_tree_deref_retry(page))
 + goto restart;
 +
 + unpack_shadow(page, zone, distance);
 +
 + if (distance = zone_page_state(zone, NR_ACTIVE_FILE))
 + continue;
 +
 + spin_lock_irq(mapping-tree_lock);
 + if (radix_tree_delete_item(mapping-page_tree,
 +iter.index, page)) {
 + inc_zone_state(zone, WORKINGSET_SHADOWS_RECLAIMED);
 + workingset_shadows_dec(mapping);
 + nr_scanned++;
 + }
 + nrshadows = mapping-nrshadows;
 + spin_unlock_irq(mapping-tree_lock);
 +
 + if (nrshadows == 0)
 + break;
 +
 + if (--nr_to_scan == 0)
 + break;
 + }
 + rcu_read_unlock();
 +
 + return nr_scanned;
 +}
 +

 ...

[patch 9/9] mm: workingset: keep shadow entries in check

2013-08-17 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, a list of inodes that contain shadow
entries is maintained.  If the global number of shadows exceeds a
certain threshold, a shrinker is activated that reclaims old entries
from the mappings.  This is heavy-handed but it should not be a hot
path and is mainly there to protect from accidentally/maliciously
induced OOM kills.  The global list is also not a problem because the
modifications are very rare: inodes are added once in their lifetime
when the first shadow entry is stored (i.e. the first page reclaimed)
and lazily removed when the inode exits.  Or if the shrinker removes
all shadow entries.

Signed-off-by: Johannes Weiner 
---
 fs/inode.c |   5 +-
 include/linux/fs.h |   1 +
 include/linux/mmzone.h |   1 +
 include/linux/swap.h   |   4 +
 mm/filemap.c   |   4 +-
 mm/truncate.c  |   2 +-
 mm/vmstat.c|   1 +
 mm/workingset.c| 196 +
 8 files changed, 208 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8862b1b..b23b141 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -169,6 +169,7 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
mapping->private_data = NULL;
mapping->backing_dev_info = _backing_dev_info;
mapping->writeback_index = 0;
+   workingset_init_mapping(mapping);
 
/*
 * If the block_device provides a backing_dev_info for client
@@ -546,9 +547,7 @@ static void evict(struct inode *inode)
 */
inode_wait_for_writeback(inode);
 
-   spin_lock_irq(>i_data.tree_lock);
-   mapping_set_exiting(>i_data);
-   spin_unlock_irq(>i_data.tree_lock);
+   workingset_exit_mapping(>i_data);
 
if (op->evict_inode) {
op->evict_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ac5d84e..140ae2d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -417,6 +417,7 @@ struct address_space {
/* Protected by tree_lock together with the radix tree */
unsigned long   nrpages;/* number of total pages */
unsigned long   nrshadows;  /* number of shadow entries */
+   struct list_headshadow_list;/* list of mappings with 
shadows */
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops;   /* methods */
unsigned long   flags;  /* error bits/gfp mask */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 56f540e..747adae 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -146,6 +146,7 @@ enum zone_stat_item {
WORKINGSET_STALE,
WORKINGSET_BALANCE,
WORKINGSET_BALANCE_FORCE,
+   WORKINGSET_SHADOWS_RECLAIMED,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 441845d..f18dd33eb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,9 +261,13 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
+void workingset_init_mapping(struct address_space *mapping);
+void workingset_exit_mapping(struct address_space *mapping);
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 void workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index ab4351e..d47d83e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -132,7 +132,7 @@ void __delete_from_page_cache(struct page *page, void 
*shadow)
 
slot = radix_tree_lookup_slot(>page_tree, page->index);
radix_tree_replace_slot(slot, shadow);
-   mapping->nrshadows++;
+   workingset_shadows_inc(mapping);
} else
radix_tree_delete(>page_tree, page->index);
page->mapping = NULL;
@@ -466,7 +466,7 @@ static int page_cache_insert(struct address_space *mapping, 
pgoff_t offset,
if (!radix_tree_exceptional_entry(p))
return -EEXIST;
radix_tree_replace_slot(slot, page);
-   mapping->nrshadows--;
+  

[patch 9/9] mm: workingset: keep shadow entries in check

2013-08-17 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, a list of inodes that contain shadow
entries is maintained.  If the global number of shadows exceeds a
certain threshold, a shrinker is activated that reclaims old entries
from the mappings.  This is heavy-handed but it should not be a hot
path and is mainly there to protect from accidentally/maliciously
induced OOM kills.  The global list is also not a problem because the
modifications are very rare: inodes are added once in their lifetime
when the first shadow entry is stored (i.e. the first page reclaimed)
and lazily removed when the inode exits.  Or if the shrinker removes
all shadow entries.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
---
 fs/inode.c |   5 +-
 include/linux/fs.h |   1 +
 include/linux/mmzone.h |   1 +
 include/linux/swap.h   |   4 +
 mm/filemap.c   |   4 +-
 mm/truncate.c  |   2 +-
 mm/vmstat.c|   1 +
 mm/workingset.c| 196 +
 8 files changed, 208 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8862b1b..b23b141 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -169,6 +169,7 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
mapping-private_data = NULL;
mapping-backing_dev_info = default_backing_dev_info;
mapping-writeback_index = 0;
+   workingset_init_mapping(mapping);
 
/*
 * If the block_device provides a backing_dev_info for client
@@ -546,9 +547,7 @@ static void evict(struct inode *inode)
 */
inode_wait_for_writeback(inode);
 
-   spin_lock_irq(inode-i_data.tree_lock);
-   mapping_set_exiting(inode-i_data);
-   spin_unlock_irq(inode-i_data.tree_lock);
+   workingset_exit_mapping(inode-i_data);
 
if (op-evict_inode) {
op-evict_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ac5d84e..140ae2d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -417,6 +417,7 @@ struct address_space {
/* Protected by tree_lock together with the radix tree */
unsigned long   nrpages;/* number of total pages */
unsigned long   nrshadows;  /* number of shadow entries */
+   struct list_headshadow_list;/* list of mappings with 
shadows */
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops;   /* methods */
unsigned long   flags;  /* error bits/gfp mask */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 56f540e..747adae 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -146,6 +146,7 @@ enum zone_stat_item {
WORKINGSET_STALE,
WORKINGSET_BALANCE,
WORKINGSET_BALANCE_FORCE,
+   WORKINGSET_SHADOWS_RECLAIMED,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 441845d..f18dd33eb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,9 +261,13 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
+void workingset_init_mapping(struct address_space *mapping);
+void workingset_exit_mapping(struct address_space *mapping);
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 void workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index ab4351e..d47d83e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -132,7 +132,7 @@ void __delete_from_page_cache(struct page *page, void 
*shadow)
 
slot = radix_tree_lookup_slot(mapping-page_tree, page-index);
radix_tree_replace_slot(slot, shadow);
-   mapping-nrshadows++;
+   workingset_shadows_inc(mapping);
} else
radix_tree_delete(mapping-page_tree, page-index);
page-mapping = NULL;
@@ -466,7 +466,7 @@ static int page_cache_insert(struct address_space *mapping, 
pgoff_t offset,
if (!radix_tree_exceptional_entry(p))
return -EEXIST;
radix_tree_replace_slot(slot, page);
-  

Re: [patch 9/9] mm: workingset: keep shadow entries in check

2013-08-14 Thread Johannes Weiner
Hey Andi!

On Mon, Aug 12, 2013 at 01:56:31AM +0200, Andi Kleen wrote:
> 
> I really like the idea of using the spare slots in the radix tree
> for something useful. It's amazing we haven't used that before.
> 
> I wonder if with some clever encoding even more information could be fit?

What do you have in mind?

> e.g. I assume you don't really need all bits of the refault distance,
> just a good enough approximation.
> 
> -Andi
> -- 
> a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: workingset: keep shadow entries in check

2013-08-14 Thread Johannes Weiner
Hey Andi!

On Mon, Aug 12, 2013 at 01:56:31AM +0200, Andi Kleen wrote:
 
 I really like the idea of using the spare slots in the radix tree
 for something useful. It's amazing we haven't used that before.
 
 I wonder if with some clever encoding even more information could be fit?

What do you have in mind?

 e.g. I assume you don't really need all bits of the refault distance,
 just a good enough approximation.
 
 -Andi
 -- 
 a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: workingset: keep shadow entries in check

2013-08-11 Thread Andi Kleen

I really like the idea of using the spare slots in the radix tree
for something useful. It's amazing we haven't used that before.

I wonder if with some clever encoding even more information could be fit?

e.g. I assume you don't really need all bits of the refault distance,
just a good enough approximation.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: workingset: keep shadow entries in check

2013-08-11 Thread Andi Kleen

I really like the idea of using the spare slots in the radix tree
for something useful. It's amazing we haven't used that before.

I wonder if with some clever encoding even more information could be fit?

e.g. I assume you don't really need all bits of the refault distance,
just a good enough approximation.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 9/9] mm: workingset: keep shadow entries in check

2013-08-06 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, two mechanisms are used:

1. a refault balance counter is maintained per inode that grows with
   each shadow entry planted and shrinks with each refault.  Once the
   counter grows beyond a certain threshold, planting new shadows in
   that file is throttled.  It's per file so that a single file can
   not disable thrashing detection globally.  However, this still
   allows shadow entries to grow excessively when many files show this
   usage pattern, and so:

2. a list of inodes that contain shadow entries is maintained.  If the
   global number of shadows exceeds a certain threshold, a shrinker is
   activated that reclaims old entries from the mappings.  This is
   heavy-handed but it should not be a common case and is only there
   to protect from accidentally/maliciously induced OOM kills.  The
   global list is also not a problem because the modifications are
   very rare: inodes are added once in their lifetime when the first
   shadow entry is stored (i.e. the first page reclaimed) and lazily
   removed when the inode exits.  Or if the shrinker removes all
   shadow entries.

Signed-off-by: Johannes Weiner 
---
 fs/inode.c |   5 +-
 include/linux/fs.h |   2 +
 include/linux/mmzone.h |   1 +
 include/linux/swap.h   |   5 +
 mm/filemap.c   |   5 +-
 mm/truncate.c  |   2 +-
 mm/vmstat.c|   1 +
 mm/workingset.c| 248 +
 8 files changed, 263 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8862b1b..b23b141 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -169,6 +169,7 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
mapping->private_data = NULL;
mapping->backing_dev_info = _backing_dev_info;
mapping->writeback_index = 0;
+   workingset_init_mapping(mapping);
 
/*
 * If the block_device provides a backing_dev_info for client
@@ -546,9 +547,7 @@ static void evict(struct inode *inode)
 */
inode_wait_for_writeback(inode);
 
-   spin_lock_irq(>i_data.tree_lock);
-   mapping_set_exiting(>i_data);
-   spin_unlock_irq(>i_data.tree_lock);
+   workingset_exit_mapping(>i_data);
 
if (op->evict_inode) {
op->evict_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ac5d84e..ea3c25b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -417,6 +417,8 @@ struct address_space {
/* Protected by tree_lock together with the radix tree */
unsigned long   nrpages;/* number of total pages */
unsigned long   nrshadows;  /* number of shadow entries */
+   struct list_headshadow_list;/* list of mappings with 
shadows */
+   unsigned long   shadow_debt;/* shadow entries with 
unmatched refaults */
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops;   /* methods */
unsigned long   flags;  /* error bits/gfp mask */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e75fc92..6e74ac5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
WORKINGSET_STALE,
WORKINGSET_BALANCE,
WORKINGSET_BALANCE_FORCE,
+   WORKINGSET_SHADOWS_RECLAIMED,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 441845d..4816c50 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,9 +261,14 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
+void workingset_init_mapping(struct address_space *mapping);
+void workingset_exit_mapping(struct address_space *mapping);
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 void workingset_refault(void *shadow);
+void workingset_count_refault(struct address_space *mapping);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index ab4351e..bd4121b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -132,7 +132,7 @@ void 

[patch 9/9] mm: workingset: keep shadow entries in check

2013-08-06 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, two mechanisms are used:

1. a refault balance counter is maintained per inode that grows with
   each shadow entry planted and shrinks with each refault.  Once the
   counter grows beyond a certain threshold, planting new shadows in
   that file is throttled.  It's per file so that a single file can
   not disable thrashing detection globally.  However, this still
   allows shadow entries to grow excessively when many files show this
   usage pattern, and so:

2. a list of inodes that contain shadow entries is maintained.  If the
   global number of shadows exceeds a certain threshold, a shrinker is
   activated that reclaims old entries from the mappings.  This is
   heavy-handed but it should not be a common case and is only there
   to protect from accidentally/maliciously induced OOM kills.  The
   global list is also not a problem because the modifications are
   very rare: inodes are added once in their lifetime when the first
   shadow entry is stored (i.e. the first page reclaimed) and lazily
   removed when the inode exits.  Or if the shrinker removes all
   shadow entries.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
---
 fs/inode.c |   5 +-
 include/linux/fs.h |   2 +
 include/linux/mmzone.h |   1 +
 include/linux/swap.h   |   5 +
 mm/filemap.c   |   5 +-
 mm/truncate.c  |   2 +-
 mm/vmstat.c|   1 +
 mm/workingset.c| 248 +
 8 files changed, 263 insertions(+), 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 8862b1b..b23b141 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -169,6 +169,7 @@ int inode_init_always(struct super_block *sb, struct inode 
*inode)
mapping-private_data = NULL;
mapping-backing_dev_info = default_backing_dev_info;
mapping-writeback_index = 0;
+   workingset_init_mapping(mapping);
 
/*
 * If the block_device provides a backing_dev_info for client
@@ -546,9 +547,7 @@ static void evict(struct inode *inode)
 */
inode_wait_for_writeback(inode);
 
-   spin_lock_irq(inode-i_data.tree_lock);
-   mapping_set_exiting(inode-i_data);
-   spin_unlock_irq(inode-i_data.tree_lock);
+   workingset_exit_mapping(inode-i_data);
 
if (op-evict_inode) {
op-evict_inode(inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ac5d84e..ea3c25b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -417,6 +417,8 @@ struct address_space {
/* Protected by tree_lock together with the radix tree */
unsigned long   nrpages;/* number of total pages */
unsigned long   nrshadows;  /* number of shadow entries */
+   struct list_headshadow_list;/* list of mappings with 
shadows */
+   unsigned long   shadow_debt;/* shadow entries with 
unmatched refaults */
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops;   /* methods */
unsigned long   flags;  /* error bits/gfp mask */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e75fc92..6e74ac5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
WORKINGSET_STALE,
WORKINGSET_BALANCE,
WORKINGSET_BALANCE_FORCE,
+   WORKINGSET_SHADOWS_RECLAIMED,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 441845d..4816c50 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,9 +261,14 @@ struct swap_list_t {
 };
 
 /* linux/mm/workingset.c */
+void workingset_init_mapping(struct address_space *mapping);
+void workingset_exit_mapping(struct address_space *mapping);
 void *workingset_eviction(struct address_space *mapping, struct page *page);
 void workingset_refault(void *shadow);
+void workingset_count_refault(struct address_space *mapping);
 void workingset_activation(struct page *page);
+void workingset_shadows_inc(struct address_space *mapping);
+void workingset_shadows_dec(struct address_space *mapping);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index ab4351e..bd4121b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -132,7