Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-26 Thread Minchan Kim
On Thu, Jan 23, 2014 at 02:22:12PM -0500, Johannes Weiner wrote:
> On Thu, Jan 23, 2014 at 02:20:14PM +0900, Minchan Kim wrote:
> > On Wed, Jan 22, 2014 at 01:42:17PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
> > > > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > > > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct 
> > > > > address_space *mapping,
> > > > >* same time and miss a shadow entry.
> > > > >*/
> > > > >   smp_wmb();
> > > > > - } else
> > > > > - radix_tree_delete(>page_tree, page->index);
> > > > > + }
> > > > >   mapping->nrpages--;
> > > > > +
> > > > > + if (!node) {
> > > > > + /* Clear direct pointer tags in root node */
> > > > > + mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > > > > + radix_tree_replace_slot(slot, shadow);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + /* Clear tree tags for the removed page */
> > > > > + index = page->index;
> > > > > + offset = index & RADIX_TREE_MAP_MASK;
> > > > > + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > > > > + if (test_bit(offset, node->tags[tag]))
> > > > > + radix_tree_tag_clear(>page_tree, 
> > > > > index, tag);
> > > > > + }
> > > > > +
> > > > > + /* Delete page, swap shadow entry */
> > > > > + radix_tree_replace_slot(slot, shadow);
> > > > > + node->count--;
> > > > > + if (shadow)
> > > > > + node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> > > > 
> > > > Nitpick2:
> > > > It should be a function of workingset.c rather than exposing
> > > > RADIX_TREE_COUNT_SHIFT?
> > > > 
> > > > IMO, It would be better to provide some accessor functions here, too.
> > > 
> > > The shadow maintenance and node lifetime management are pretty
> > > interwoven to share branches and reduce instructions as these are
> > > common paths.  I don't see how this could result in cleaner code while
> > > keeping these advantages.
> > 
> > What I want is just put a inline accessor in somewhere like workingset.h
> > 
> > static inline void inc_shadow_entry(struct radix_tree_node *node)
> > {
> > node->count += 1U << RADIX_TREE_COUNT_MASK;
> > }
> > 
> > So, anyone don't need to know that node->count upper bits present
> > count of shadow entry.
> 
> Okay, but then you have to cover lower bits as well, without explicit
> higher bit access it would be confusing to use the mask for lower
> bits.
> 
> Something like the following?

LGTM.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-26 Thread Minchan Kim
On Thu, Jan 23, 2014 at 02:22:12PM -0500, Johannes Weiner wrote:
 On Thu, Jan 23, 2014 at 02:20:14PM +0900, Minchan Kim wrote:
  On Wed, Jan 22, 2014 at 01:42:17PM -0500, Johannes Weiner wrote:
   On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
 @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct 
 address_space *mapping,
* same time and miss a shadow entry.
*/
   smp_wmb();
 - } else
 - radix_tree_delete(mapping-page_tree, page-index);
 + }
   mapping-nrpages--;
 +
 + if (!node) {
 + /* Clear direct pointer tags in root node */
 + mapping-page_tree.gfp_mask = __GFP_BITS_MASK;
 + radix_tree_replace_slot(slot, shadow);
 + return;
 + }
 +
 + /* Clear tree tags for the removed page */
 + index = page-index;
 + offset = index  RADIX_TREE_MAP_MASK;
 + for (tag = 0; tag  RADIX_TREE_MAX_TAGS; tag++) {
 + if (test_bit(offset, node-tags[tag]))
 + radix_tree_tag_clear(mapping-page_tree, 
 index, tag);
 + }
 +
 + /* Delete page, swap shadow entry */
 + radix_tree_replace_slot(slot, shadow);
 + node-count--;
 + if (shadow)
 + node-count += 1U  RADIX_TREE_COUNT_SHIFT;

Nitpick2:
It should be a function of workingset.c rather than exposing
RADIX_TREE_COUNT_SHIFT?

IMO, It would be better to provide some accessor functions here, too.
   
   The shadow maintenance and node lifetime management are pretty
   interwoven to share branches and reduce instructions as these are
   common paths.  I don't see how this could result in cleaner code while
   keeping these advantages.
  
  What I want is just put a inline accessor in somewhere like workingset.h
  
  static inline void inc_shadow_entry(struct radix_tree_node *node)
  {
  node-count += 1U  RADIX_TREE_COUNT_MASK;
  }
  
  So, anyone don't need to know that node-count upper bits present
  count of shadow entry.
 
 Okay, but then you have to cover lower bits as well, without explicit
 higher bit access it would be confusing to use the mask for lower
 bits.
 
 Something like the following?

LGTM.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-23 Thread Johannes Weiner
On Thu, Jan 23, 2014 at 02:20:14PM +0900, Minchan Kim wrote:
> On Wed, Jan 22, 2014 at 01:42:17PM -0500, Johannes Weiner wrote:
> > On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
> > > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct 
> > > > address_space *mapping,
> > > >  * same time and miss a shadow entry.
> > > >  */
> > > > smp_wmb();
> > > > -   } else
> > > > -   radix_tree_delete(>page_tree, page->index);
> > > > +   }
> > > > mapping->nrpages--;
> > > > +
> > > > +   if (!node) {
> > > > +   /* Clear direct pointer tags in root node */
> > > > +   mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > > > +   radix_tree_replace_slot(slot, shadow);
> > > > +   return;
> > > > +   }
> > > > +
> > > > +   /* Clear tree tags for the removed page */
> > > > +   index = page->index;
> > > > +   offset = index & RADIX_TREE_MAP_MASK;
> > > > +   for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > > > +   if (test_bit(offset, node->tags[tag]))
> > > > +   radix_tree_tag_clear(>page_tree, 
> > > > index, tag);
> > > > +   }
> > > > +
> > > > +   /* Delete page, swap shadow entry */
> > > > +   radix_tree_replace_slot(slot, shadow);
> > > > +   node->count--;
> > > > +   if (shadow)
> > > > +   node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> > > 
> > > Nitpick2:
> > > It should be a function of workingset.c rather than exposing
> > > RADIX_TREE_COUNT_SHIFT?
> > > 
> > > IMO, It would be better to provide some accessor functions here, too.
> > 
> > The shadow maintenance and node lifetime management are pretty
> > interwoven to share branches and reduce instructions as these are
> > common paths.  I don't see how this could result in cleaner code while
> > keeping these advantages.
> 
> What I want is just put a inline accessor in somewhere like workingset.h
> 
> static inline void inc_shadow_entry(struct radix_tree_node *node)
> {
> node->count += 1U << RADIX_TREE_COUNT_MASK;
> }
> 
> So, anyone don't need to know that node->count upper bits present
> count of shadow entry.

Okay, but then you have to cover lower bits as well, without explicit
higher bit access it would be confusing to use the mask for lower
bits.

Something like the following?

---

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 102e37bc82d5..b33171a3673c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -266,6 +266,36 @@ bool workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
 extern struct list_lru workingset_shadow_nodes;
 
+static inline unsigned int workingset_node_pages(struct radix_tree_node *node)
+{
+   return node->count & RADIX_TREE_COUNT_MASK;
+}
+
+static inline void workingset_node_pages_inc(struct radix_tree_node *node)
+{
+   return node->count++;
+}
+
+static inline void workingset_node_pages_dec(struct radix_tree_node *node)
+{
+   return node->count--;
+}
+
+static inline unsigned int workingset_node_shadows(struct radix_tree_node 
*node)
+{
+   return node->count >> RADIX_TREE_COUNT_SHIFT;
+}
+
+static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
+{
+   node->count += 1U << RADIX_TREE_COUNT_SHIFT;
+}
+
+static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
+{
+   node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+}
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index a63e89484d18..ac7f62db9ccd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -149,9 +149,9 @@ static void page_cache_tree_delete(struct address_space 
*mapping,
 
/* Delete page, swap shadow entry */
radix_tree_replace_slot(slot, shadow);
-   node->count--;
+   workingset_node_pages_dec(node);
if (shadow)
-   node->count += 1U << RADIX_TREE_COUNT_SHIFT;
+   workingset_node_shadow_inc(node);
else
if (__radix_tree_delete_node(>page_tree, node))
return;
@@ -163,7 +163,7 @@ static void page_cache_tree_delete(struct address_space 
*mapping,
 * list_empty() test is safe as node->private_list is
 * protected by mapping->tree_lock.
 */
-   if (!(node->count & RADIX_TREE_COUNT_MASK) &&
+   if (!workingset_node_pages(node) &&
list_empty(>private_list)) {
node->private_data = mapping;
list_lru_add(_shadow_nodes, >private_list);
@@ -531,12 +531,12 @@ static int page_cache_tree_insert(struct address_space 
*mapping,
*shadowp = p;
mapping->nrshadows--;
if (node)
-   node->count -= 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-23 Thread Johannes Weiner
On Thu, Jan 23, 2014 at 02:20:14PM +0900, Minchan Kim wrote:
 On Wed, Jan 22, 2014 at 01:42:17PM -0500, Johannes Weiner wrote:
  On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
   On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
@@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct 
address_space *mapping,
 * same time and miss a shadow entry.
 */
smp_wmb();
-   } else
-   radix_tree_delete(mapping-page_tree, page-index);
+   }
mapping-nrpages--;
+
+   if (!node) {
+   /* Clear direct pointer tags in root node */
+   mapping-page_tree.gfp_mask = __GFP_BITS_MASK;
+   radix_tree_replace_slot(slot, shadow);
+   return;
+   }
+
+   /* Clear tree tags for the removed page */
+   index = page-index;
+   offset = index  RADIX_TREE_MAP_MASK;
+   for (tag = 0; tag  RADIX_TREE_MAX_TAGS; tag++) {
+   if (test_bit(offset, node-tags[tag]))
+   radix_tree_tag_clear(mapping-page_tree, 
index, tag);
+   }
+
+   /* Delete page, swap shadow entry */
+   radix_tree_replace_slot(slot, shadow);
+   node-count--;
+   if (shadow)
+   node-count += 1U  RADIX_TREE_COUNT_SHIFT;
   
   Nitpick2:
   It should be a function of workingset.c rather than exposing
   RADIX_TREE_COUNT_SHIFT?
   
   IMO, It would be better to provide some accessor functions here, too.
  
  The shadow maintenance and node lifetime management are pretty
  interwoven to share branches and reduce instructions as these are
  common paths.  I don't see how this could result in cleaner code while
  keeping these advantages.
 
 What I want is just put a inline accessor in somewhere like workingset.h
 
 static inline void inc_shadow_entry(struct radix_tree_node *node)
 {
 node-count += 1U  RADIX_TREE_COUNT_MASK;
 }
 
 So, anyone don't need to know that node-count upper bits present
 count of shadow entry.

Okay, but then you have to cover lower bits as well, without explicit
higher bit access it would be confusing to use the mask for lower
bits.

Something like the following?

---

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 102e37bc82d5..b33171a3673c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -266,6 +266,36 @@ bool workingset_refault(void *shadow);
 void workingset_activation(struct page *page);
 extern struct list_lru workingset_shadow_nodes;
 
+static inline unsigned int workingset_node_pages(struct radix_tree_node *node)
+{
+   return node-count  RADIX_TREE_COUNT_MASK;
+}
+
+static inline void workingset_node_pages_inc(struct radix_tree_node *node)
+{
+   return node-count++;
+}
+
+static inline void workingset_node_pages_dec(struct radix_tree_node *node)
+{
+   return node-count--;
+}
+
+static inline unsigned int workingset_node_shadows(struct radix_tree_node 
*node)
+{
+   return node-count  RADIX_TREE_COUNT_SHIFT;
+}
+
+static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
+{
+   node-count += 1U  RADIX_TREE_COUNT_SHIFT;
+}
+
+static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
+{
+   node-count -= 1U  RADIX_TREE_COUNT_SHIFT;
+}
+
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
 extern unsigned long totalreserve_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index a63e89484d18..ac7f62db9ccd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -149,9 +149,9 @@ static void page_cache_tree_delete(struct address_space 
*mapping,
 
/* Delete page, swap shadow entry */
radix_tree_replace_slot(slot, shadow);
-   node-count--;
+   workingset_node_pages_dec(node);
if (shadow)
-   node-count += 1U  RADIX_TREE_COUNT_SHIFT;
+   workingset_node_shadow_inc(node);
else
if (__radix_tree_delete_node(mapping-page_tree, node))
return;
@@ -163,7 +163,7 @@ static void page_cache_tree_delete(struct address_space 
*mapping,
 * list_empty() test is safe as node-private_list is
 * protected by mapping-tree_lock.
 */
-   if (!(node-count  RADIX_TREE_COUNT_MASK) 
+   if (!workingset_node_pages(node) 
list_empty(node-private_list)) {
node-private_data = mapping;
list_lru_add(workingset_shadow_nodes, node-private_list);
@@ -531,12 +531,12 @@ static int page_cache_tree_insert(struct address_space 
*mapping,
*shadowp = p;
mapping-nrshadows--;
if (node)
-   node-count -= 1U  RADIX_TREE_COUNT_SHIFT;
+   workingset_node_shadows_dec(node);
}
radix_tree_replace_slot(slot, page);
mapping-nrpages++;
  

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-22 Thread Minchan Kim
On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > Previously, page cache radix tree nodes were freed after reclaim
> > > emptied out their page pointers.  But now reclaim stores shadow
> > > entries in their place, which are only reclaimed when the inodes
> > > themselves are reclaimed.  This is problematic for bigger files that
> > > are still in use after they have a significant amount of their cache
> > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > entries will just sit there and waste memory.  In the worst case, the
> > > shadow entries will accumulate until the machine runs out of memory.
> > > 
> > > To get this under control, the VM will track radix tree nodes
> > > exclusively containing shadow entries on a per-NUMA node list.
> > > Per-NUMA rather than global because we expect the radix tree nodes
> > > themselves to be allocated node-locally and we want to reduce
> > > cross-node references of otherwise independent cache workloads.  A
> > > simple shrinker will then reclaim these nodes on memory pressure.
> > > 
> > > A few things need to be stored in the radix tree node to implement the
> > > shadow node LRU and allow tree deletions coming from the list:
> > 
> > Just a couple of things with the list_lru interfaces.
> > 
> > 
> > > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct 
> > > address_space *mapping,
> > >* same time and miss a shadow entry.
> > >*/
> > >   smp_wmb();
> > > - } else
> > > - radix_tree_delete(>page_tree, page->index);
> > > + }
> > >   mapping->nrpages--;
> > > +
> > > + if (!node) {
> > > + /* Clear direct pointer tags in root node */
> > > + mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > > + radix_tree_replace_slot(slot, shadow);
> > > + return;
> > > + }
> > > +
> > > + /* Clear tree tags for the removed page */
> > > + index = page->index;
> > > + offset = index & RADIX_TREE_MAP_MASK;
> > > + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > > + if (test_bit(offset, node->tags[tag]))
> > > + radix_tree_tag_clear(>page_tree, index, tag);
> > > + }
> > > +
> > > + /* Delete page, swap shadow entry */
> > > + radix_tree_replace_slot(slot, shadow);
> > > + node->count--;
> > > + if (shadow)
> > > + node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> > > + else
> > > + if (__radix_tree_delete_node(>page_tree, node))
> > > + return;
> > > +
> > > + /* Only shadow entries in there, keep track of this node */
> > > + if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > > + list_empty(>private_list)) {
> > > + node->private_data = mapping;
> > > + list_lru_add(_shadow_nodes, >private_list);
> > > + }
> > 
> > You can't do this list_empty(>private_list) check safely
> > externally to the list_lru code - only time that entry can be
> > checked safely is under the LRU list locks. This is the reason that
> > list_lru_add/list_lru_del return a boolean to indicate is the object
> > was added/removed from the list - they do this list_empty() check
> > internally. i.e. the correct, safe way to do conditionally update
> > state iff the object was added to the LRU is:
> > 
> > if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> > if (list_lru_add(_shadow_nodes, >private_list))
> > node->private_data = mapping;
> > }
> > 
> > > + radix_tree_replace_slot(slot, page);
> > > + mapping->nrpages++;
> > > + if (node) {
> > > + node->count++;
> > > + /* Installed page, can't be shadow-only anymore */
> > > + if (!list_empty(>private_list))
> > > + list_lru_del(_shadow_nodes,
> > > +  >private_list);
> > > + }
> > 
> > Same issue here:
> > 
> > if (node) {
> > node->count++;
> > list_lru_del(_shadow_nodes, >private_list);
> > }
> 
> All modifications to node->private_list happen under
> mapping->tree_lock, and modifications of a neighboring link should not
> affect the outcome of the list_empty(), so I don't think the lru lock
> is necessary.
> 
> It would be cleaner to take it of course, but that would mean adding
> an unconditional NUMAnode-wide lock to every page cache population.
> 
> > >  static int __add_to_page_cache_locked(struct page *page,
> > > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > > index 72f9decb0104..47a9faf4070b 100644
> > > --- a/mm/list_lru.c
> > > +++ b/mm/list_lru.c
> > > @@ -88,10 +88,18 @@ restart:
> > >   ret = isolate(item, >lock, cb_arg);
> > >   switch (ret) {
> > >   case LRU_REMOVED:
> > > + case LRU_REMOVED_RETRY:
> > >   if (--nlru->nr_items == 0)
> > >   node_clear(nid, lru->active_nodes);
> > > 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-22 Thread Minchan Kim
On Wed, Jan 22, 2014 at 01:42:17PM -0500, Johannes Weiner wrote:
> On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > Previously, page cache radix tree nodes were freed after reclaim
> > > emptied out their page pointers.  But now reclaim stores shadow
> > > entries in their place, which are only reclaimed when the inodes
> > > themselves are reclaimed.  This is problematic for bigger files that
> > > are still in use after they have a significant amount of their cache
> > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > entries will just sit there and waste memory.  In the worst case, the
> > > shadow entries will accumulate until the machine runs out of memory.
> > > 
> > > To get this under control, the VM will track radix tree nodes
> > > exclusively containing shadow entries on a per-NUMA node list.
> > > Per-NUMA rather than global because we expect the radix tree nodes
> > > themselves to be allocated node-locally and we want to reduce
> > > cross-node references of otherwise independent cache workloads.  A
> > > simple shrinker will then reclaim these nodes on memory pressure.
> > > 
> > > A few things need to be stored in the radix tree node to implement the
> > > shadow node LRU and allow tree deletions coming from the list:
> > > 
> > > 1. There is no index available that would describe the reverse path
> > >from the node up to the tree root, which is needed to perform a
> > >deletion.  To solve this, encode in each node its offset inside the
> > >parent.  This can be stored in the unused upper bits of the same
> > >member that stores the node's height at no extra space cost.
> > > 
> > > 2. The number of shadow entries needs to be counted in addition to the
> > >regular entries, to quickly detect when the node is ready to go to
> > >the shadow node LRU list.  The current entry count is an unsigned
> > >int but the maximum number of entries is 64, so a shadow counter
> > >can easily be stored in the unused upper bits.
> > > 
> > > 3. Tree modification needs tree lock and tree root, which are located
> > >in the address space, so store an address_space backpointer in the
> > >node.  The parent pointer of the node is in a union with the 2-word
> > >rcu_head, so the backpointer comes at no extra cost as well.
> > > 
> > > 4. The node needs to be linked to an LRU list, which requires a list
> > >head inside the node.  This does increase the size of the node, but
> > >it does not change the number of objects that fit into a slab page.
> > > 
> > > Signed-off-by: Johannes Weiner 
> > > ---
> > >  include/linux/list_lru.h   |   2 +
> > >  include/linux/mmzone.h |   1 +
> > >  include/linux/radix-tree.h |  32 +---
> > >  include/linux/swap.h   |   1 +
> > >  lib/radix-tree.c   |  36 --
> > >  mm/filemap.c   |  77 +++--
> > >  mm/list_lru.c  |   8 +++
> > >  mm/truncate.c  |  20 +++-
> > >  mm/vmstat.c|   1 +
> > >  mm/workingset.c| 121 
> > > +
> > >  10 files changed, 259 insertions(+), 40 deletions(-)
> > > 
> > > diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> > > index 3ce541753c88..b02fc233eadd 100644
> > > --- a/include/linux/list_lru.h
> > > +++ b/include/linux/list_lru.h
> > > @@ -13,6 +13,8 @@
> > >  /* list_lru_walk_cb has to always return one of those */
> > >  enum lru_status {
> > >   LRU_REMOVED,/* item removed from list */
> > > + LRU_REMOVED_RETRY,  /* item removed, but lock has been
> > > +dropped and reacquired */
> > >   LRU_ROTATE, /* item referenced, give another pass */
> > >   LRU_SKIP,   /* item cannot be locked, skip */
> > >   LRU_RETRY,  /* item not freeable. May drop the lock
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 118ba9f51e86..8cac5a7ef7a7 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -144,6 +144,7 @@ enum zone_stat_item {
> > >  #endif
> > >   WORKINGSET_REFAULT,
> > >   WORKINGSET_ACTIVATE,
> > > + WORKINGSET_NODERECLAIM,
> > >   NR_ANON_TRANSPARENT_HUGEPAGES,
> > >   NR_FREE_CMA_PAGES,
> > >   NR_VM_ZONE_STAT_ITEMS };
> > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > > index 13636c40bc42..33170dbd9db4 100644
> > > --- a/include/linux/radix-tree.h
> > > +++ b/include/linux/radix-tree.h
> > > @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void 
> > > *ptr)
> > >  #define RADIX_TREE_TAG_LONGS \
> > >   ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
> > >  
> > > +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> > > +#define RADIX_TREE_MAX_PATH 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-22 Thread Johannes Weiner
On Wed, Jan 22, 2014 at 01:57:14AM -0500, Johannes Weiner wrote:
> Not at this time, I'll try to look into that.  For now, I am updating
> the patch to revert the shrinker back to DEFAULT_SEEKS and change the
> object count to only include objects above a certain threshold, which
> assumes a worst-case population of 4 in 64 slots.  It's not perfect,
> but neither was the seeks magic, and it's easier to reason about what
> it's actually doing.

Ah, the quality of 2am submissions...  8 out of 64 of course.

> @@ -266,14 +269,38 @@ struct list_lru workingset_shadow_nodes;
>  static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>   struct shrink_control *sc)
>  {
> - return list_lru_count_node(_shadow_nodes, sc->nid);
> + unsigned long shadow_nodes;
> + unsigned long max_nodes;
> + unsigned long pages;
> +
> + shadow_nodes = list_lru_count_node(_shadow_nodes, sc->nid);
> + pages = node_present_pages(sc->nid);
> + /*
> +  * Active cache pages are limited to 50% of memory, and shadow
> +  * entries that represent a refault distance bigger than that
> +  * do not have any effect.  Limit the number of shadow nodes
> +  * such that shadow entries do not exceed the number of active
> +  * cache pages, assuming a worst-case node population density
> +  * of 1/16th on average.

1/8th.  The actual code is consistent:

> +  * On 64-bit with 7 radix_tree_nodes per page and 64 slots
> +  * each, this will reclaim shadow entries when they consume
> +  * ~2% of available memory:
> +  *
> +  * PAGE_SIZE / radix_tree_nodes / node_entries / PAGE_SIZE
> +  */
> + max_nodes = pages >> (1 + RADIX_TREE_MAP_SHIFT - 3);
> +
> + if (shadow_nodes <= max_nodes)
> + return 0;
> +
> + return shadow_nodes - max_nodes;
>  }
>  
>  static enum lru_status shadow_lru_isolate(struct list_head *item,
> spinlock_t *lru_lock,
> void *arg)
>  {
> - unsigned long *nr_reclaimed = arg;
>   struct address_space *mapping;
>   struct radix_tree_node *node;
>   unsigned int i;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-22 Thread Johannes Weiner
On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
> On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.
> > 
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >from the node up to the tree root, which is needed to perform a
> >deletion.  To solve this, encode in each node its offset inside the
> >parent.  This can be stored in the unused upper bits of the same
> >member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >regular entries, to quickly detect when the node is ready to go to
> >the shadow node LRU list.  The current entry count is an unsigned
> >int but the maximum number of entries is 64, so a shadow counter
> >can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs tree lock and tree root, which are located
> >in the address space, so store an address_space backpointer in the
> >node.  The parent pointer of the node is in a union with the 2-word
> >rcu_head, so the backpointer comes at no extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >head inside the node.  This does increase the size of the node, but
> >it does not change the number of objects that fit into a slab page.
> > 
> > Signed-off-by: Johannes Weiner 
> > ---
> >  include/linux/list_lru.h   |   2 +
> >  include/linux/mmzone.h |   1 +
> >  include/linux/radix-tree.h |  32 +---
> >  include/linux/swap.h   |   1 +
> >  lib/radix-tree.c   |  36 --
> >  mm/filemap.c   |  77 +++--
> >  mm/list_lru.c  |   8 +++
> >  mm/truncate.c  |  20 +++-
> >  mm/vmstat.c|   1 +
> >  mm/workingset.c| 121 
> > +
> >  10 files changed, 259 insertions(+), 40 deletions(-)
> > 
> > diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> > index 3ce541753c88..b02fc233eadd 100644
> > --- a/include/linux/list_lru.h
> > +++ b/include/linux/list_lru.h
> > @@ -13,6 +13,8 @@
> >  /* list_lru_walk_cb has to always return one of those */
> >  enum lru_status {
> > LRU_REMOVED,/* item removed from list */
> > +   LRU_REMOVED_RETRY,  /* item removed, but lock has been
> > +  dropped and reacquired */
> > LRU_ROTATE, /* item referenced, give another pass */
> > LRU_SKIP,   /* item cannot be locked, skip */
> > LRU_RETRY,  /* item not freeable. May drop the lock
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 118ba9f51e86..8cac5a7ef7a7 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -144,6 +144,7 @@ enum zone_stat_item {
> >  #endif
> > WORKINGSET_REFAULT,
> > WORKINGSET_ACTIVATE,
> > +   WORKINGSET_NODERECLAIM,
> > NR_ANON_TRANSPARENT_HUGEPAGES,
> > NR_FREE_CMA_PAGES,
> > NR_VM_ZONE_STAT_ITEMS };
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index 13636c40bc42..33170dbd9db4 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
> >  #define RADIX_TREE_TAG_LONGS   \
> > ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
> >  
> > +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> > +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> > + RADIX_TREE_MAP_SHIFT))
> > +
> > +/* Height component in node->path */
> > +#define RADIX_TREE_HEIGHT_SHIFT(RADIX_TREE_MAX_PATH + 1)
> > +#define RADIX_TREE_HEIGHT_MASK 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-22 Thread Johannes Weiner
On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
 On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
  Previously, page cache radix tree nodes were freed after reclaim
  emptied out their page pointers.  But now reclaim stores shadow
  entries in their place, which are only reclaimed when the inodes
  themselves are reclaimed.  This is problematic for bigger files that
  are still in use after they have a significant amount of their cache
  reclaimed, without any of those pages actually refaulting.  The shadow
  entries will just sit there and waste memory.  In the worst case, the
  shadow entries will accumulate until the machine runs out of memory.
  
  To get this under control, the VM will track radix tree nodes
  exclusively containing shadow entries on a per-NUMA node list.
  Per-NUMA rather than global because we expect the radix tree nodes
  themselves to be allocated node-locally and we want to reduce
  cross-node references of otherwise independent cache workloads.  A
  simple shrinker will then reclaim these nodes on memory pressure.
  
  A few things need to be stored in the radix tree node to implement the
  shadow node LRU and allow tree deletions coming from the list:
  
  1. There is no index available that would describe the reverse path
 from the node up to the tree root, which is needed to perform a
 deletion.  To solve this, encode in each node its offset inside the
 parent.  This can be stored in the unused upper bits of the same
 member that stores the node's height at no extra space cost.
  
  2. The number of shadow entries needs to be counted in addition to the
 regular entries, to quickly detect when the node is ready to go to
 the shadow node LRU list.  The current entry count is an unsigned
 int but the maximum number of entries is 64, so a shadow counter
 can easily be stored in the unused upper bits.
  
  3. Tree modification needs tree lock and tree root, which are located
 in the address space, so store an address_space backpointer in the
 node.  The parent pointer of the node is in a union with the 2-word
 rcu_head, so the backpointer comes at no extra cost as well.
  
  4. The node needs to be linked to an LRU list, which requires a list
 head inside the node.  This does increase the size of the node, but
 it does not change the number of objects that fit into a slab page.
  
  Signed-off-by: Johannes Weiner han...@cmpxchg.org
  ---
   include/linux/list_lru.h   |   2 +
   include/linux/mmzone.h |   1 +
   include/linux/radix-tree.h |  32 +---
   include/linux/swap.h   |   1 +
   lib/radix-tree.c   |  36 --
   mm/filemap.c   |  77 +++--
   mm/list_lru.c  |   8 +++
   mm/truncate.c  |  20 +++-
   mm/vmstat.c|   1 +
   mm/workingset.c| 121 
  +
   10 files changed, 259 insertions(+), 40 deletions(-)
  
  diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
  index 3ce541753c88..b02fc233eadd 100644
  --- a/include/linux/list_lru.h
  +++ b/include/linux/list_lru.h
  @@ -13,6 +13,8 @@
   /* list_lru_walk_cb has to always return one of those */
   enum lru_status {
  LRU_REMOVED,/* item removed from list */
  +   LRU_REMOVED_RETRY,  /* item removed, but lock has been
  +  dropped and reacquired */
  LRU_ROTATE, /* item referenced, give another pass */
  LRU_SKIP,   /* item cannot be locked, skip */
  LRU_RETRY,  /* item not freeable. May drop the lock
  diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
  index 118ba9f51e86..8cac5a7ef7a7 100644
  --- a/include/linux/mmzone.h
  +++ b/include/linux/mmzone.h
  @@ -144,6 +144,7 @@ enum zone_stat_item {
   #endif
  WORKINGSET_REFAULT,
  WORKINGSET_ACTIVATE,
  +   WORKINGSET_NODERECLAIM,
  NR_ANON_TRANSPARENT_HUGEPAGES,
  NR_FREE_CMA_PAGES,
  NR_VM_ZONE_STAT_ITEMS };
  diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
  index 13636c40bc42..33170dbd9db4 100644
  --- a/include/linux/radix-tree.h
  +++ b/include/linux/radix-tree.h
  @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
   #define RADIX_TREE_TAG_LONGS   \
  ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
   
  +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
  +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
  + RADIX_TREE_MAP_SHIFT))
  +
  +/* Height component in node-path */
  +#define RADIX_TREE_HEIGHT_SHIFT(RADIX_TREE_MAX_PATH + 1)
  +#define RADIX_TREE_HEIGHT_MASK ((1UL  RADIX_TREE_HEIGHT_SHIFT) - 1)
  +
  +/* Internally used bits of node-count */
  +#define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1)
  +#define 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-22 Thread Johannes Weiner
On Wed, Jan 22, 2014 at 01:57:14AM -0500, Johannes Weiner wrote:
 Not at this time, I'll try to look into that.  For now, I am updating
 the patch to revert the shrinker back to DEFAULT_SEEKS and change the
 object count to only include objects above a certain threshold, which
 assumes a worst-case population of 4 in 64 slots.  It's not perfect,
 but neither was the seeks magic, and it's easier to reason about what
 it's actually doing.

Ah, the quality of 2am submissions...  8 out of 64 of course.

 @@ -266,14 +269,38 @@ struct list_lru workingset_shadow_nodes;
  static unsigned long count_shadow_nodes(struct shrinker *shrinker,
   struct shrink_control *sc)
  {
 - return list_lru_count_node(workingset_shadow_nodes, sc-nid);
 + unsigned long shadow_nodes;
 + unsigned long max_nodes;
 + unsigned long pages;
 +
 + shadow_nodes = list_lru_count_node(workingset_shadow_nodes, sc-nid);
 + pages = node_present_pages(sc-nid);
 + /*
 +  * Active cache pages are limited to 50% of memory, and shadow
 +  * entries that represent a refault distance bigger than that
 +  * do not have any effect.  Limit the number of shadow nodes
 +  * such that shadow entries do not exceed the number of active
 +  * cache pages, assuming a worst-case node population density
 +  * of 1/16th on average.

1/8th.  The actual code is consistent:

 +  * On 64-bit with 7 radix_tree_nodes per page and 64 slots
 +  * each, this will reclaim shadow entries when they consume
 +  * ~2% of available memory:
 +  *
 +  * PAGE_SIZE / radix_tree_nodes / node_entries / PAGE_SIZE
 +  */
 + max_nodes = pages  (1 + RADIX_TREE_MAP_SHIFT - 3);
 +
 + if (shadow_nodes = max_nodes)
 + return 0;
 +
 + return shadow_nodes - max_nodes;
  }
  
  static enum lru_status shadow_lru_isolate(struct list_head *item,
 spinlock_t *lru_lock,
 void *arg)
  {
 - unsigned long *nr_reclaimed = arg;
   struct address_space *mapping;
   struct radix_tree_node *node;
   unsigned int i;
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-22 Thread Minchan Kim
On Wed, Jan 22, 2014 at 01:42:17PM -0500, Johannes Weiner wrote:
 On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
  On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
   Previously, page cache radix tree nodes were freed after reclaim
   emptied out their page pointers.  But now reclaim stores shadow
   entries in their place, which are only reclaimed when the inodes
   themselves are reclaimed.  This is problematic for bigger files that
   are still in use after they have a significant amount of their cache
   reclaimed, without any of those pages actually refaulting.  The shadow
   entries will just sit there and waste memory.  In the worst case, the
   shadow entries will accumulate until the machine runs out of memory.
   
   To get this under control, the VM will track radix tree nodes
   exclusively containing shadow entries on a per-NUMA node list.
   Per-NUMA rather than global because we expect the radix tree nodes
   themselves to be allocated node-locally and we want to reduce
   cross-node references of otherwise independent cache workloads.  A
   simple shrinker will then reclaim these nodes on memory pressure.
   
   A few things need to be stored in the radix tree node to implement the
   shadow node LRU and allow tree deletions coming from the list:
   
   1. There is no index available that would describe the reverse path
  from the node up to the tree root, which is needed to perform a
  deletion.  To solve this, encode in each node its offset inside the
  parent.  This can be stored in the unused upper bits of the same
  member that stores the node's height at no extra space cost.
   
   2. The number of shadow entries needs to be counted in addition to the
  regular entries, to quickly detect when the node is ready to go to
  the shadow node LRU list.  The current entry count is an unsigned
  int but the maximum number of entries is 64, so a shadow counter
  can easily be stored in the unused upper bits.
   
   3. Tree modification needs tree lock and tree root, which are located
  in the address space, so store an address_space backpointer in the
  node.  The parent pointer of the node is in a union with the 2-word
  rcu_head, so the backpointer comes at no extra cost as well.
   
   4. The node needs to be linked to an LRU list, which requires a list
  head inside the node.  This does increase the size of the node, but
  it does not change the number of objects that fit into a slab page.
   
   Signed-off-by: Johannes Weiner han...@cmpxchg.org
   ---
include/linux/list_lru.h   |   2 +
include/linux/mmzone.h |   1 +
include/linux/radix-tree.h |  32 +---
include/linux/swap.h   |   1 +
lib/radix-tree.c   |  36 --
mm/filemap.c   |  77 +++--
mm/list_lru.c  |   8 +++
mm/truncate.c  |  20 +++-
mm/vmstat.c|   1 +
mm/workingset.c| 121 
   +
10 files changed, 259 insertions(+), 40 deletions(-)
   
   diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
   index 3ce541753c88..b02fc233eadd 100644
   --- a/include/linux/list_lru.h
   +++ b/include/linux/list_lru.h
   @@ -13,6 +13,8 @@
/* list_lru_walk_cb has to always return one of those */
enum lru_status {
 LRU_REMOVED,/* item removed from list */
   + LRU_REMOVED_RETRY,  /* item removed, but lock has been
   +dropped and reacquired */
 LRU_ROTATE, /* item referenced, give another pass */
 LRU_SKIP,   /* item cannot be locked, skip */
 LRU_RETRY,  /* item not freeable. May drop the lock
   diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
   index 118ba9f51e86..8cac5a7ef7a7 100644
   --- a/include/linux/mmzone.h
   +++ b/include/linux/mmzone.h
   @@ -144,6 +144,7 @@ enum zone_stat_item {
#endif
 WORKINGSET_REFAULT,
 WORKINGSET_ACTIVATE,
   + WORKINGSET_NODERECLAIM,
 NR_ANON_TRANSPARENT_HUGEPAGES,
 NR_FREE_CMA_PAGES,
 NR_VM_ZONE_STAT_ITEMS };
   diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
   index 13636c40bc42..33170dbd9db4 100644
   --- a/include/linux/radix-tree.h
   +++ b/include/linux/radix-tree.h
   @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void 
   *ptr)
#define RADIX_TREE_TAG_LONGS \
 ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)

   +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
   +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
   +   RADIX_TREE_MAP_SHIFT))
   +
   +/* Height component in node-path */
   +#define RADIX_TREE_HEIGHT_SHIFT  (RADIX_TREE_MAX_PATH + 1)
   +#define RADIX_TREE_HEIGHT_MASK   ((1UL  RADIX_TREE_HEIGHT_SHIFT) - 1)
   +
   

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-22 Thread Minchan Kim
On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
 On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
  On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
   Previously, page cache radix tree nodes were freed after reclaim
   emptied out their page pointers.  But now reclaim stores shadow
   entries in their place, which are only reclaimed when the inodes
   themselves are reclaimed.  This is problematic for bigger files that
   are still in use after they have a significant amount of their cache
   reclaimed, without any of those pages actually refaulting.  The shadow
   entries will just sit there and waste memory.  In the worst case, the
   shadow entries will accumulate until the machine runs out of memory.
   
   To get this under control, the VM will track radix tree nodes
   exclusively containing shadow entries on a per-NUMA node list.
   Per-NUMA rather than global because we expect the radix tree nodes
   themselves to be allocated node-locally and we want to reduce
   cross-node references of otherwise independent cache workloads.  A
   simple shrinker will then reclaim these nodes on memory pressure.
   
   A few things need to be stored in the radix tree node to implement the
   shadow node LRU and allow tree deletions coming from the list:
  
  Just a couple of things with the list_lru interfaces.
  
  
   @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct 
   address_space *mapping,
  * same time and miss a shadow entry.
  */
 smp_wmb();
   - } else
   - radix_tree_delete(mapping-page_tree, page-index);
   + }
 mapping-nrpages--;
   +
   + if (!node) {
   + /* Clear direct pointer tags in root node */
   + mapping-page_tree.gfp_mask = __GFP_BITS_MASK;
   + radix_tree_replace_slot(slot, shadow);
   + return;
   + }
   +
   + /* Clear tree tags for the removed page */
   + index = page-index;
   + offset = index  RADIX_TREE_MAP_MASK;
   + for (tag = 0; tag  RADIX_TREE_MAX_TAGS; tag++) {
   + if (test_bit(offset, node-tags[tag]))
   + radix_tree_tag_clear(mapping-page_tree, index, tag);
   + }
   +
   + /* Delete page, swap shadow entry */
   + radix_tree_replace_slot(slot, shadow);
   + node-count--;
   + if (shadow)
   + node-count += 1U  RADIX_TREE_COUNT_SHIFT;
   + else
   + if (__radix_tree_delete_node(mapping-page_tree, node))
   + return;
   +
   + /* Only shadow entries in there, keep track of this node */
   + if (!(node-count  RADIX_TREE_COUNT_MASK) 
   + list_empty(node-private_list)) {
   + node-private_data = mapping;
   + list_lru_add(workingset_shadow_nodes, node-private_list);
   + }
  
  You can't do this list_empty(node-private_list) check safely
  externally to the list_lru code - only time that entry can be
  checked safely is under the LRU list locks. This is the reason that
  list_lru_add/list_lru_del return a boolean to indicate is the object
  was added/removed from the list - they do this list_empty() check
  internally. i.e. the correct, safe way to do conditionally update
  state iff the object was added to the LRU is:
  
  if (!(node-count  RADIX_TREE_COUNT_MASK)) {
  if (list_lru_add(workingset_shadow_nodes, node-private_list))
  node-private_data = mapping;
  }
  
   + radix_tree_replace_slot(slot, page);
   + mapping-nrpages++;
   + if (node) {
   + node-count++;
   + /* Installed page, can't be shadow-only anymore */
   + if (!list_empty(node-private_list))
   + list_lru_del(workingset_shadow_nodes,
   +  node-private_list);
   + }
  
  Same issue here:
  
  if (node) {
  node-count++;
  list_lru_del(workingset_shadow_nodes, node-private_list);
  }
 
 All modifications to node-private_list happen under
 mapping-tree_lock, and modifications of a neighboring link should not
 affect the outcome of the list_empty(), so I don't think the lru lock
 is necessary.
 
 It would be cleaner to take it of course, but that would mean adding
 an unconditional NUMAnode-wide lock to every page cache population.
 
static int __add_to_page_cache_locked(struct page *page,
   diff --git a/mm/list_lru.c b/mm/list_lru.c
   index 72f9decb0104..47a9faf4070b 100644
   --- a/mm/list_lru.c
   +++ b/mm/list_lru.c
   @@ -88,10 +88,18 @@ restart:
 ret = isolate(item, nlru-lock, cb_arg);
 switch (ret) {
 case LRU_REMOVED:
   + case LRU_REMOVED_RETRY:
 if (--nlru-nr_items == 0)
 node_clear(nid, lru-active_nodes);
 WARN_ON_ONCE(nlru-nr_items  0);
 isolated++;
   + /*
   +  * If the lru lock has been dropped, our list
   +  * traversal is now invalid and so we have to
   

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-21 Thread Johannes Weiner
On Wed, Jan 22, 2014 at 02:06:07PM +1100, Dave Chinner wrote:
> On Tue, Jan 21, 2014 at 12:50:17AM -0500, Johannes Weiner wrote:
> > On Tue, Jan 21, 2014 at 02:03:58PM +1100, Dave Chinner wrote:
> > > On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> > > > On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > > > > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > > > > +static struct shrinker workingset_shadow_shrinker = {
> > > > > > +   .count_objects = count_shadow_nodes,
> > > > > > +   .scan_objects = scan_shadow_nodes,
> > > > > > +   .seeks = DEFAULT_SEEKS * 4,
> > > > > > +   .flags = SHRINKER_NUMA_AWARE,
> > > > > > +};
> > > > > 
> > > > > Can you add a comment explaining how you calculated the .seeks
> > > > > value? It's important to document the weighings/importance
> > > > > we give to slab reclaim so we can determine if it's actually
> > > > > acheiving the desired balance under different loads...
> > > > 
> > > > This is not an exact science, to say the least.
> > > 
> > > I know, that's why I asked it be documented rather than be something
> > > kept in your head.
> > > 
> > > > The shadow entries are mostly self-regulated, so I don't want the
> > > > shrinker to interfere while the machine is just regularly trimming
> > > > caches during normal operation.
> > > > 
> > > > It should only kick in when either a) reclaim is picking up and the
> > > > scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> > > > swapping etc. or b) the number of objects compared to LRU pages
> > > > becomes excessive.
> > > > 
> > > > I think that is what most shrinkers with an elevated seeks value want,
> > > > but this translates very awkwardly (and not completely) to the current
> > > > cost model, and we should probably rework that interface.
> > > > 
> > > > "Seeks" currently encodes 3 ratios:
> > > > 
> > > >   1. the cost of creating an object vs. a page
> > > > 
> > > >   2. the expected number of objects vs. pages
> > > 
> > > It doesn't encode that at all. If it did, then the default value
> > > wouldn't be "2".
> > >
> > > >   3. the cost of reclaiming an object vs. a page
> > > 
> > > Which, when you consider #3 in conjunction with #1, the actual
> > > intended meaning of .seeks is "the cost of replacing this object in
> > > the cache compared to the cost of replacing a page cache page."
> > 
> > But what it actually seems to do is translate scan rate from LRU pages
> > to scan rate in another object pool.  The actual replacement cost
> > varies based on hotness of each set, an in-use object is more
> > expensive to replace than a cold page and vice versa, the dentry and
> > inode shrinkers reflect this by rotating hot objects and refusing to
> > actually reclaim items while they are in active use.
> 
> Right, but so does the page cache when the page referenced bit is
> seen by the LRU scanner. That's a scanned page, so what is passed to
> shrink_slab is a ratio of pages scanned vs pages eligible for
> reclaim. IOWs, the fact that the slab caches rotate rather than
> reclaim is irrelevant - what matters is the same proportional
> pressure is applied to the slab cache that was applied to the page
> cache

Oh, but it does.  You apply the same pressure to both, but the actual
reclaim outcome depends on object valuation measures specific to each
pool (e.g. recently referenced or not), whereas my shrinker takes
sc->nr_to_scan objects and reclaims them without looking at their
individual value, which varies just like the value of slab objects
varies.

I thought I could compensate for the lack of object valuation in the
shadow shrinker by tweaking that fixed pressure factor between page
cache and shadow entries, but I'm no longer convinced this can work.

One thing that does affect the value of shadow entries is the overall
health of the system, memory-wise, so reclaim efficiency would be one
factor that affects individual object value, albeit a secondary one.

The most obvious value factor is whether the shadow entries in a node
are expired or not, but there are potentially 64 of them, potentially
from different zones with different "inactive ages" atomic_t's, so
that is fairly expensive to assess.

> > So I am having a hard time deriving a meaningful value out of this
> > definition for my usecase because I want to push back objects based on
> > reclaim efficiency (scan rate vs. reclaim rate).  The other shrinkers
> > with non-standard seek settings reek of magic number as well, which
> > suggests I am not alone with this.
> 
> Right, which is exactly why I'm asking you to document it. I've got
> no idea how other subsystems have come up with their magic numbers
> because they are not documented, and so it's just about impossible
> to determine what the author of the code really needed and hence the
> best way to improve the interface is difficult to determine.
> 
> > I wonder if we can come up with a better interface that allows both
> 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-21 Thread Dave Chinner
On Tue, Jan 21, 2014 at 12:50:17AM -0500, Johannes Weiner wrote:
> On Tue, Jan 21, 2014 at 02:03:58PM +1100, Dave Chinner wrote:
> > On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> > > On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > > > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > > > +static struct shrinker workingset_shadow_shrinker = {
> > > > > + .count_objects = count_shadow_nodes,
> > > > > + .scan_objects = scan_shadow_nodes,
> > > > > + .seeks = DEFAULT_SEEKS * 4,
> > > > > + .flags = SHRINKER_NUMA_AWARE,
> > > > > +};
> > > > 
> > > > Can you add a comment explaining how you calculated the .seeks
> > > > value? It's important to document the weighings/importance
> > > > we give to slab reclaim so we can determine if it's actually
> > > > acheiving the desired balance under different loads...
> > > 
> > > This is not an exact science, to say the least.
> > 
> > I know, that's why I asked it be documented rather than be something
> > kept in your head.
> > 
> > > The shadow entries are mostly self-regulated, so I don't want the
> > > shrinker to interfere while the machine is just regularly trimming
> > > caches during normal operation.
> > > 
> > > It should only kick in when either a) reclaim is picking up and the
> > > scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> > > swapping etc. or b) the number of objects compared to LRU pages
> > > becomes excessive.
> > > 
> > > I think that is what most shrinkers with an elevated seeks value want,
> > > but this translates very awkwardly (and not completely) to the current
> > > cost model, and we should probably rework that interface.
> > > 
> > > "Seeks" currently encodes 3 ratios:
> > > 
> > >   1. the cost of creating an object vs. a page
> > > 
> > >   2. the expected number of objects vs. pages
> > 
> > It doesn't encode that at all. If it did, then the default value
> > wouldn't be "2".
> >
> > >   3. the cost of reclaiming an object vs. a page
> > 
> > Which, when you consider #3 in conjunction with #1, the actual
> > intended meaning of .seeks is "the cost of replacing this object in
> > the cache compared to the cost of replacing a page cache page."
> 
> But what it actually seems to do is translate scan rate from LRU pages
> to scan rate in another object pool.  The actual replacement cost
> varies based on hotness of each set, an in-use object is more
> expensive to replace than a cold page and vice versa, the dentry and
> inode shrinkers reflect this by rotating hot objects and refusing to
> actually reclaim items while they are in active use.

Right, but so does the page cache when the page referenced bit is
seen by the LRU scanner. That's a scanned page, so what is passed to
shrink_slab is a ratio of pages scanned vs pages eligible for
reclaim. IOWs, the fact that the slab caches rotate rather than
reclaim is irrelevant - what matters is the same proportional
pressure is applied to the slab cache that was applied to the page
cache

> So I am having a hard time deriving a meaningful value out of this
> definition for my usecase because I want to push back objects based on
> reclaim efficiency (scan rate vs. reclaim rate).  The other shrinkers
> with non-standard seek settings reek of magic number as well, which
> suggests I am not alone with this.

Right, which is exactly why I'm asking you to document it. I've got
no idea how other subsystems have come up with their magic numbers
because they are not documented, and so it's just about impossible
to determine what the author of the code really needed and hence the
best way to improve the interface is difficult to determine.

> I wonder if we can come up with a better interface that allows both
> traditional cache shrinkers with their own aging, as well as object
> pools that want to push back based on reclaim efficiency.

We probably can, though I'd prefer we don't end up with some
alternative algorithm that is specific to a single shrinker.

So, how do we measure page cache reclaim efficiency? How can that be
communicated to a shrinker? how can we tell a shrinker what measure
to use? How do we tell shrinker authors what measure to use?  How do
we translate that new method useful scan count information?

> > > but they are not necessarily correlated.  How I would like to
> > > configure the shadow shrinker instead is:
> > > 
> > >   o scan objects when reclaim efficiency is down to 75%, because they
> > > are more valuable than use-once cache but less than workingset
> > > 
> > >   o scan objects when the ratio between them and the number of pages
> > > exceeds 1/32 (one shadow entry for each resident page, up to 64
> > > entries per shrinkable object, assume 50% packing for robustness)
> > > 
> > >   o as the expected balance between objects and lru pages is 1:32,
> > > reclaim one object for every 32 reclaimed LRU pages, instead of
> > > assuming that number of 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-21 Thread Dave Chinner
On Tue, Jan 21, 2014 at 12:50:17AM -0500, Johannes Weiner wrote:
 On Tue, Jan 21, 2014 at 02:03:58PM +1100, Dave Chinner wrote:
  On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
   On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
 +static struct shrinker workingset_shadow_shrinker = {
 + .count_objects = count_shadow_nodes,
 + .scan_objects = scan_shadow_nodes,
 + .seeks = DEFAULT_SEEKS * 4,
 + .flags = SHRINKER_NUMA_AWARE,
 +};

Can you add a comment explaining how you calculated the .seeks
value? It's important to document the weighings/importance
we give to slab reclaim so we can determine if it's actually
acheiving the desired balance under different loads...
   
   This is not an exact science, to say the least.
  
  I know, that's why I asked it be documented rather than be something
  kept in your head.
  
   The shadow entries are mostly self-regulated, so I don't want the
   shrinker to interfere while the machine is just regularly trimming
   caches during normal operation.
   
   It should only kick in when either a) reclaim is picking up and the
   scan-to-reclaim ratio increases due to mapped pages, dirty cache,
   swapping etc. or b) the number of objects compared to LRU pages
   becomes excessive.
   
   I think that is what most shrinkers with an elevated seeks value want,
   but this translates very awkwardly (and not completely) to the current
   cost model, and we should probably rework that interface.
   
   Seeks currently encodes 3 ratios:
   
 1. the cost of creating an object vs. a page
   
 2. the expected number of objects vs. pages
  
  It doesn't encode that at all. If it did, then the default value
  wouldn't be 2.
 
 3. the cost of reclaiming an object vs. a page
  
  Which, when you consider #3 in conjunction with #1, the actual
  intended meaning of .seeks is the cost of replacing this object in
  the cache compared to the cost of replacing a page cache page.
 
 But what it actually seems to do is translate scan rate from LRU pages
 to scan rate in another object pool.  The actual replacement cost
 varies based on hotness of each set, an in-use object is more
 expensive to replace than a cold page and vice versa, the dentry and
 inode shrinkers reflect this by rotating hot objects and refusing to
 actually reclaim items while they are in active use.

Right, but so does the page cache when the page referenced bit is
seen by the LRU scanner. That's a scanned page, so what is passed to
shrink_slab is a ratio of pages scanned vs pages eligible for
reclaim. IOWs, the fact that the slab caches rotate rather than
reclaim is irrelevant - what matters is the same proportional
pressure is applied to the slab cache that was applied to the page
cache

 So I am having a hard time deriving a meaningful value out of this
 definition for my usecase because I want to push back objects based on
 reclaim efficiency (scan rate vs. reclaim rate).  The other shrinkers
 with non-standard seek settings reek of magic number as well, which
 suggests I am not alone with this.

Right, which is exactly why I'm asking you to document it. I've got
no idea how other subsystems have come up with their magic numbers
because they are not documented, and so it's just about impossible
to determine what the author of the code really needed and hence the
best way to improve the interface is difficult to determine.

 I wonder if we can come up with a better interface that allows both
 traditional cache shrinkers with their own aging, as well as object
 pools that want to push back based on reclaim efficiency.

We probably can, though I'd prefer we don't end up with some
alternative algorithm that is specific to a single shrinker.

So, how do we measure page cache reclaim efficiency? How can that be
communicated to a shrinker? how can we tell a shrinker what measure
to use? How do we tell shrinker authors what measure to use?  How do
we translate that new method useful scan count information?

   but they are not necessarily correlated.  How I would like to
   configure the shadow shrinker instead is:
   
 o scan objects when reclaim efficiency is down to 75%, because they
   are more valuable than use-once cache but less than workingset
   
 o scan objects when the ratio between them and the number of pages
   exceeds 1/32 (one shadow entry for each resident page, up to 64
   entries per shrinkable object, assume 50% packing for robustness)
   
 o as the expected balance between objects and lru pages is 1:32,
   reclaim one object for every 32 reclaimed LRU pages, instead of
   assuming that number of scanned pages corresponds meaningfully to
   number of objects to scan.
  
  You're assuming that every radix tree node has a full population of
  pages. This only occurs on sequential read and write 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-21 Thread Johannes Weiner
On Wed, Jan 22, 2014 at 02:06:07PM +1100, Dave Chinner wrote:
 On Tue, Jan 21, 2014 at 12:50:17AM -0500, Johannes Weiner wrote:
  On Tue, Jan 21, 2014 at 02:03:58PM +1100, Dave Chinner wrote:
   On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
 On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
  +static struct shrinker workingset_shadow_shrinker = {
  +   .count_objects = count_shadow_nodes,
  +   .scan_objects = scan_shadow_nodes,
  +   .seeks = DEFAULT_SEEKS * 4,
  +   .flags = SHRINKER_NUMA_AWARE,
  +};
 
 Can you add a comment explaining how you calculated the .seeks
 value? It's important to document the weighings/importance
 we give to slab reclaim so we can determine if it's actually
 acheiving the desired balance under different loads...

This is not an exact science, to say the least.
   
   I know, that's why I asked it be documented rather than be something
   kept in your head.
   
The shadow entries are mostly self-regulated, so I don't want the
shrinker to interfere while the machine is just regularly trimming
caches during normal operation.

It should only kick in when either a) reclaim is picking up and the
scan-to-reclaim ratio increases due to mapped pages, dirty cache,
swapping etc. or b) the number of objects compared to LRU pages
becomes excessive.

I think that is what most shrinkers with an elevated seeks value want,
but this translates very awkwardly (and not completely) to the current
cost model, and we should probably rework that interface.

Seeks currently encodes 3 ratios:

  1. the cost of creating an object vs. a page

  2. the expected number of objects vs. pages
   
   It doesn't encode that at all. If it did, then the default value
   wouldn't be 2.
  
  3. the cost of reclaiming an object vs. a page
   
   Which, when you consider #3 in conjunction with #1, the actual
   intended meaning of .seeks is the cost of replacing this object in
   the cache compared to the cost of replacing a page cache page.
  
  But what it actually seems to do is translate scan rate from LRU pages
  to scan rate in another object pool.  The actual replacement cost
  varies based on hotness of each set, an in-use object is more
  expensive to replace than a cold page and vice versa, the dentry and
  inode shrinkers reflect this by rotating hot objects and refusing to
  actually reclaim items while they are in active use.
 
 Right, but so does the page cache when the page referenced bit is
 seen by the LRU scanner. That's a scanned page, so what is passed to
 shrink_slab is a ratio of pages scanned vs pages eligible for
 reclaim. IOWs, the fact that the slab caches rotate rather than
 reclaim is irrelevant - what matters is the same proportional
 pressure is applied to the slab cache that was applied to the page
 cache

Oh, but it does.  You apply the same pressure to both, but the actual
reclaim outcome depends on object valuation measures specific to each
pool (e.g. recently referenced or not), whereas my shrinker takes
sc-nr_to_scan objects and reclaims them without looking at their
individual value, which varies just like the value of slab objects
varies.

I thought I could compensate for the lack of object valuation in the
shadow shrinker by tweaking that fixed pressure factor between page
cache and shadow entries, but I'm no longer convinced this can work.

One thing that does affect the value of shadow entries is the overall
health of the system, memory-wise, so reclaim efficiency would be one
factor that affects individual object value, albeit a secondary one.

The most obvious value factor is whether the shadow entries in a node
are expired or not, but there are potentially 64 of them, potentially
from different zones with different inactive ages atomic_t's, so
that is fairly expensive to assess.

  So I am having a hard time deriving a meaningful value out of this
  definition for my usecase because I want to push back objects based on
  reclaim efficiency (scan rate vs. reclaim rate).  The other shrinkers
  with non-standard seek settings reek of magic number as well, which
  suggests I am not alone with this.
 
 Right, which is exactly why I'm asking you to document it. I've got
 no idea how other subsystems have come up with their magic numbers
 because they are not documented, and so it's just about impossible
 to determine what the author of the code really needed and hence the
 best way to improve the interface is difficult to determine.
 
  I wonder if we can come up with a better interface that allows both
  traditional cache shrinkers with their own aging, as well as object
  pools that want to push back based on reclaim efficiency.
 
 We probably can, though I'd prefer we don't end up with some
 alternative algorithm that is specific to a single 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-20 Thread Johannes Weiner
On Tue, Jan 21, 2014 at 02:03:58PM +1100, Dave Chinner wrote:
> On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> > On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > > +   /* Only shadow entries in there, keep track of this node */
> > > > +   if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > > > +   list_empty(>private_list)) {
> > > > +   node->private_data = mapping;
> > > > +   list_lru_add(_shadow_nodes, 
> > > > >private_list);
> > > > +   }
> > > 
> > > You can't do this list_empty(>private_list) check safely
> > > externally to the list_lru code - only time that entry can be
> > > checked safely is under the LRU list locks. This is the reason that
> > > list_lru_add/list_lru_del return a boolean to indicate is the object
> > > was added/removed from the list - they do this list_empty() check
> > > internally. i.e. the correct, safe way to do conditionally update
> > > state iff the object was added to the LRU is:
> > > 
> > >   if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> > >   if (list_lru_add(_shadow_nodes, >private_list))
> > >   node->private_data = mapping;
> > >   }
> > > 
> > > > +   radix_tree_replace_slot(slot, page);
> > > > +   mapping->nrpages++;
> > > > +   if (node) {
> > > > +   node->count++;
> > > > +   /* Installed page, can't be shadow-only anymore */
> > > > +   if (!list_empty(>private_list))
> > > > +   list_lru_del(_shadow_nodes,
> > > > +>private_list);
> > > > +   }
> > > 
> > > Same issue here:
> > > 
> > >   if (node) {
> > >   node->count++;
> > >   list_lru_del(_shadow_nodes, >private_list);
> > >   }
> > 
> > All modifications to node->private_list happen under
> > mapping->tree_lock, and modifications of a neighboring link should not
> > affect the outcome of the list_empty(), so I don't think the lru lock
> > is necessary.
> 
> Can you please add that as a comment somewhere explaining why it is
> safe to do this?

Absolutely.

> > > > +   case LRU_REMOVED_RETRY:
> > > > if (--nlru->nr_items == 0)
> > > > node_clear(nid, lru->active_nodes);
> > > > WARN_ON_ONCE(nlru->nr_items < 0);
> > > > isolated++;
> > > > +   /*
> > > > +* If the lru lock has been dropped, our list
> > > > +* traversal is now invalid and so we have to
> > > > +* restart from scratch.
> > > > +*/
> > > > +   if (ret == LRU_REMOVED_RETRY)
> > > > +   goto restart;
> > > > break;
> > > > case LRU_ROTATE:
> > > > list_move_tail(item, >list);
> > > 
> > > I think that we need to assert that the list lru lock is correctly
> > > held here on return with LRU_REMOVED_RETRY. i.e.
> > > 
> > >   case LRU_REMOVED_RETRY:
> > >   assert_spin_locked(>lock);
> > >   case LRU_REMOVED:
> > 
> > Ah, good idea.  How about adding it to LRU_RETRY as well?
> 
> Yup, good idea.

Ok, will do.

> > > > +static struct shrinker workingset_shadow_shrinker = {
> > > > +   .count_objects = count_shadow_nodes,
> > > > +   .scan_objects = scan_shadow_nodes,
> > > > +   .seeks = DEFAULT_SEEKS * 4,
> > > > +   .flags = SHRINKER_NUMA_AWARE,
> > > > +};
> > > 
> > > Can you add a comment explaining how you calculated the .seeks
> > > value? It's important to document the weighings/importance
> > > we give to slab reclaim so we can determine if it's actually
> > > acheiving the desired balance under different loads...
> > 
> > This is not an exact science, to say the least.
> 
> I know, that's why I asked it be documented rather than be something
> kept in your head.
> 
> > The shadow entries are mostly self-regulated, so I don't want the
> > shrinker to interfere while the machine is just regularly trimming
> > caches during normal operation.
> > 
> > It should only kick in when either a) reclaim is picking up and the
> > scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> > swapping etc. or b) the number of objects compared to LRU pages
> > becomes excessive.
> > 
> > I think that is what most shrinkers with an elevated seeks value want,
> > but this translates very awkwardly (and not completely) to the current
> > cost model, and we should probably rework that interface.
> > 
> > "Seeks" currently encodes 3 ratios:
> > 
> >   1. the cost of creating an object vs. a page
> > 
> >   2. the expected number of objects vs. pages
> 
> It doesn't encode that at all. If it did, then the default value
> wouldn't be "2".
>
> >   3. the cost of 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-20 Thread Dave Chinner
On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
> On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> > On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > > + /* Only shadow entries in there, keep track of this node */
> > > + if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > > + list_empty(>private_list)) {
> > > + node->private_data = mapping;
> > > + list_lru_add(_shadow_nodes, >private_list);
> > > + }
> > 
> > You can't do this list_empty(>private_list) check safely
> > externally to the list_lru code - only time that entry can be
> > checked safely is under the LRU list locks. This is the reason that
> > list_lru_add/list_lru_del return a boolean to indicate is the object
> > was added/removed from the list - they do this list_empty() check
> > internally. i.e. the correct, safe way to do conditionally update
> > state iff the object was added to the LRU is:
> > 
> > if (!(node->count & RADIX_TREE_COUNT_MASK)) {
> > if (list_lru_add(_shadow_nodes, >private_list))
> > node->private_data = mapping;
> > }
> > 
> > > + radix_tree_replace_slot(slot, page);
> > > + mapping->nrpages++;
> > > + if (node) {
> > > + node->count++;
> > > + /* Installed page, can't be shadow-only anymore */
> > > + if (!list_empty(>private_list))
> > > + list_lru_del(_shadow_nodes,
> > > +  >private_list);
> > > + }
> > 
> > Same issue here:
> > 
> > if (node) {
> > node->count++;
> > list_lru_del(_shadow_nodes, >private_list);
> > }
> 
> All modifications to node->private_list happen under
> mapping->tree_lock, and modifications of a neighboring link should not
> affect the outcome of the list_empty(), so I don't think the lru lock
> is necessary.

Can you please add that as a comment somewhere explaining why it is
safe to do this?

> > > + case LRU_REMOVED_RETRY:
> > >   if (--nlru->nr_items == 0)
> > >   node_clear(nid, lru->active_nodes);
> > >   WARN_ON_ONCE(nlru->nr_items < 0);
> > >   isolated++;
> > > + /*
> > > +  * If the lru lock has been dropped, our list
> > > +  * traversal is now invalid and so we have to
> > > +  * restart from scratch.
> > > +  */
> > > + if (ret == LRU_REMOVED_RETRY)
> > > + goto restart;
> > >   break;
> > >   case LRU_ROTATE:
> > >   list_move_tail(item, >list);
> > 
> > I think that we need to assert that the list lru lock is correctly
> > held here on return with LRU_REMOVED_RETRY. i.e.
> > 
> > case LRU_REMOVED_RETRY:
> > assert_spin_locked(>lock);
> > case LRU_REMOVED:
> 
> Ah, good idea.  How about adding it to LRU_RETRY as well?

Yup, good idea.

> > > +static struct shrinker workingset_shadow_shrinker = {
> > > + .count_objects = count_shadow_nodes,
> > > + .scan_objects = scan_shadow_nodes,
> > > + .seeks = DEFAULT_SEEKS * 4,
> > > + .flags = SHRINKER_NUMA_AWARE,
> > > +};
> > 
> > Can you add a comment explaining how you calculated the .seeks
> > value? It's important to document the weighings/importance
> > we give to slab reclaim so we can determine if it's actually
> > acheiving the desired balance under different loads...
> 
> This is not an exact science, to say the least.

I know, that's why I asked it be documented rather than be something
kept in your head.

> The shadow entries are mostly self-regulated, so I don't want the
> shrinker to interfere while the machine is just regularly trimming
> caches during normal operation.
> 
> It should only kick in when either a) reclaim is picking up and the
> scan-to-reclaim ratio increases due to mapped pages, dirty cache,
> swapping etc. or b) the number of objects compared to LRU pages
> becomes excessive.
> 
> I think that is what most shrinkers with an elevated seeks value want,
> but this translates very awkwardly (and not completely) to the current
> cost model, and we should probably rework that interface.
> 
> "Seeks" currently encodes 3 ratios:
> 
>   1. the cost of creating an object vs. a page
> 
>   2. the expected number of objects vs. pages

It doesn't encode that at all. If it did, then the default value
wouldn't be "2".

>   3. the cost of reclaiming an object vs. a page

Which, when you consider #3 in conjunction with #1, the actual
intended meaning of .seeks is "the cost of replacing this object in
the cache compared to the cost of replacing a page cache page."

> but they are not necessarily correlated.  How I would like to
> configure the shadow shrinker instead is:
> 
>   o scan objects when reclaim efficiency is down to 75%, because they
> are more valuable than use-once cache but less than workingset
> 
>   o scan objects 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-20 Thread Johannes Weiner
On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
> On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.
> > 
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> 
> Just a couple of things with the list_lru interfaces.
> 
> 
> > @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct 
> > address_space *mapping,
> >  * same time and miss a shadow entry.
> >  */
> > smp_wmb();
> > -   } else
> > -   radix_tree_delete(>page_tree, page->index);
> > +   }
> > mapping->nrpages--;
> > +
> > +   if (!node) {
> > +   /* Clear direct pointer tags in root node */
> > +   mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> > +   radix_tree_replace_slot(slot, shadow);
> > +   return;
> > +   }
> > +
> > +   /* Clear tree tags for the removed page */
> > +   index = page->index;
> > +   offset = index & RADIX_TREE_MAP_MASK;
> > +   for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> > +   if (test_bit(offset, node->tags[tag]))
> > +   radix_tree_tag_clear(>page_tree, index, tag);
> > +   }
> > +
> > +   /* Delete page, swap shadow entry */
> > +   radix_tree_replace_slot(slot, shadow);
> > +   node->count--;
> > +   if (shadow)
> > +   node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> > +   else
> > +   if (__radix_tree_delete_node(>page_tree, node))
> > +   return;
> > +
> > +   /* Only shadow entries in there, keep track of this node */
> > +   if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> > +   list_empty(>private_list)) {
> > +   node->private_data = mapping;
> > +   list_lru_add(_shadow_nodes, >private_list);
> > +   }
> 
> You can't do this list_empty(>private_list) check safely
> externally to the list_lru code - only time that entry can be
> checked safely is under the LRU list locks. This is the reason that
> list_lru_add/list_lru_del return a boolean to indicate is the object
> was added/removed from the list - they do this list_empty() check
> internally. i.e. the correct, safe way to do conditionally update
> state iff the object was added to the LRU is:
> 
>   if (!(node->count & RADIX_TREE_COUNT_MASK)) {
>   if (list_lru_add(_shadow_nodes, >private_list))
>   node->private_data = mapping;
>   }
> 
> > +   radix_tree_replace_slot(slot, page);
> > +   mapping->nrpages++;
> > +   if (node) {
> > +   node->count++;
> > +   /* Installed page, can't be shadow-only anymore */
> > +   if (!list_empty(>private_list))
> > +   list_lru_del(_shadow_nodes,
> > +>private_list);
> > +   }
> 
> Same issue here:
> 
>   if (node) {
>   node->count++;
>   list_lru_del(_shadow_nodes, >private_list);
>   }

All modifications to node->private_list happen under
mapping->tree_lock, and modifications of a neighboring link should not
affect the outcome of the list_empty(), so I don't think the lru lock
is necessary.

It would be cleaner to take it of course, but that would mean adding
an unconditional NUMAnode-wide lock to every page cache population.

> >  static int __add_to_page_cache_locked(struct page *page,
> > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > index 72f9decb0104..47a9faf4070b 100644
> > --- a/mm/list_lru.c
> > +++ b/mm/list_lru.c
> > @@ -88,10 +88,18 @@ restart:
> > ret = isolate(item, >lock, cb_arg);
> > switch (ret) {
> > case LRU_REMOVED:
> > +   case LRU_REMOVED_RETRY:
> > if (--nlru->nr_items == 0)
> > node_clear(nid, lru->active_nodes);
> > WARN_ON_ONCE(nlru->nr_items < 0);
> > isolated++;
> > +   /*
> > +* If the lru lock has been dropped, our list
> > 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-20 Thread Johannes Weiner
On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
 On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
  Previously, page cache radix tree nodes were freed after reclaim
  emptied out their page pointers.  But now reclaim stores shadow
  entries in their place, which are only reclaimed when the inodes
  themselves are reclaimed.  This is problematic for bigger files that
  are still in use after they have a significant amount of their cache
  reclaimed, without any of those pages actually refaulting.  The shadow
  entries will just sit there and waste memory.  In the worst case, the
  shadow entries will accumulate until the machine runs out of memory.
  
  To get this under control, the VM will track radix tree nodes
  exclusively containing shadow entries on a per-NUMA node list.
  Per-NUMA rather than global because we expect the radix tree nodes
  themselves to be allocated node-locally and we want to reduce
  cross-node references of otherwise independent cache workloads.  A
  simple shrinker will then reclaim these nodes on memory pressure.
  
  A few things need to be stored in the radix tree node to implement the
  shadow node LRU and allow tree deletions coming from the list:
 
 Just a couple of things with the list_lru interfaces.
 
 
  @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct 
  address_space *mapping,
   * same time and miss a shadow entry.
   */
  smp_wmb();
  -   } else
  -   radix_tree_delete(mapping-page_tree, page-index);
  +   }
  mapping-nrpages--;
  +
  +   if (!node) {
  +   /* Clear direct pointer tags in root node */
  +   mapping-page_tree.gfp_mask = __GFP_BITS_MASK;
  +   radix_tree_replace_slot(slot, shadow);
  +   return;
  +   }
  +
  +   /* Clear tree tags for the removed page */
  +   index = page-index;
  +   offset = index  RADIX_TREE_MAP_MASK;
  +   for (tag = 0; tag  RADIX_TREE_MAX_TAGS; tag++) {
  +   if (test_bit(offset, node-tags[tag]))
  +   radix_tree_tag_clear(mapping-page_tree, index, tag);
  +   }
  +
  +   /* Delete page, swap shadow entry */
  +   radix_tree_replace_slot(slot, shadow);
  +   node-count--;
  +   if (shadow)
  +   node-count += 1U  RADIX_TREE_COUNT_SHIFT;
  +   else
  +   if (__radix_tree_delete_node(mapping-page_tree, node))
  +   return;
  +
  +   /* Only shadow entries in there, keep track of this node */
  +   if (!(node-count  RADIX_TREE_COUNT_MASK) 
  +   list_empty(node-private_list)) {
  +   node-private_data = mapping;
  +   list_lru_add(workingset_shadow_nodes, node-private_list);
  +   }
 
 You can't do this list_empty(node-private_list) check safely
 externally to the list_lru code - only time that entry can be
 checked safely is under the LRU list locks. This is the reason that
 list_lru_add/list_lru_del return a boolean to indicate is the object
 was added/removed from the list - they do this list_empty() check
 internally. i.e. the correct, safe way to do conditionally update
 state iff the object was added to the LRU is:
 
   if (!(node-count  RADIX_TREE_COUNT_MASK)) {
   if (list_lru_add(workingset_shadow_nodes, node-private_list))
   node-private_data = mapping;
   }
 
  +   radix_tree_replace_slot(slot, page);
  +   mapping-nrpages++;
  +   if (node) {
  +   node-count++;
  +   /* Installed page, can't be shadow-only anymore */
  +   if (!list_empty(node-private_list))
  +   list_lru_del(workingset_shadow_nodes,
  +node-private_list);
  +   }
 
 Same issue here:
 
   if (node) {
   node-count++;
   list_lru_del(workingset_shadow_nodes, node-private_list);
   }

All modifications to node-private_list happen under
mapping-tree_lock, and modifications of a neighboring link should not
affect the outcome of the list_empty(), so I don't think the lru lock
is necessary.

It would be cleaner to take it of course, but that would mean adding
an unconditional NUMAnode-wide lock to every page cache population.

   static int __add_to_page_cache_locked(struct page *page,
  diff --git a/mm/list_lru.c b/mm/list_lru.c
  index 72f9decb0104..47a9faf4070b 100644
  --- a/mm/list_lru.c
  +++ b/mm/list_lru.c
  @@ -88,10 +88,18 @@ restart:
  ret = isolate(item, nlru-lock, cb_arg);
  switch (ret) {
  case LRU_REMOVED:
  +   case LRU_REMOVED_RETRY:
  if (--nlru-nr_items == 0)
  node_clear(nid, lru-active_nodes);
  WARN_ON_ONCE(nlru-nr_items  0);
  isolated++;
  +   /*
  +* If the lru lock has been dropped, our list
  +* traversal is now invalid and so we have to
  +* restart from scratch.
  +  

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-20 Thread Dave Chinner
On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
 On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
  On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
   + /* Only shadow entries in there, keep track of this node */
   + if (!(node-count  RADIX_TREE_COUNT_MASK) 
   + list_empty(node-private_list)) {
   + node-private_data = mapping;
   + list_lru_add(workingset_shadow_nodes, node-private_list);
   + }
  
  You can't do this list_empty(node-private_list) check safely
  externally to the list_lru code - only time that entry can be
  checked safely is under the LRU list locks. This is the reason that
  list_lru_add/list_lru_del return a boolean to indicate is the object
  was added/removed from the list - they do this list_empty() check
  internally. i.e. the correct, safe way to do conditionally update
  state iff the object was added to the LRU is:
  
  if (!(node-count  RADIX_TREE_COUNT_MASK)) {
  if (list_lru_add(workingset_shadow_nodes, node-private_list))
  node-private_data = mapping;
  }
  
   + radix_tree_replace_slot(slot, page);
   + mapping-nrpages++;
   + if (node) {
   + node-count++;
   + /* Installed page, can't be shadow-only anymore */
   + if (!list_empty(node-private_list))
   + list_lru_del(workingset_shadow_nodes,
   +  node-private_list);
   + }
  
  Same issue here:
  
  if (node) {
  node-count++;
  list_lru_del(workingset_shadow_nodes, node-private_list);
  }
 
 All modifications to node-private_list happen under
 mapping-tree_lock, and modifications of a neighboring link should not
 affect the outcome of the list_empty(), so I don't think the lru lock
 is necessary.

Can you please add that as a comment somewhere explaining why it is
safe to do this?

   + case LRU_REMOVED_RETRY:
 if (--nlru-nr_items == 0)
 node_clear(nid, lru-active_nodes);
 WARN_ON_ONCE(nlru-nr_items  0);
 isolated++;
   + /*
   +  * If the lru lock has been dropped, our list
   +  * traversal is now invalid and so we have to
   +  * restart from scratch.
   +  */
   + if (ret == LRU_REMOVED_RETRY)
   + goto restart;
 break;
 case LRU_ROTATE:
 list_move_tail(item, nlru-list);
  
  I think that we need to assert that the list lru lock is correctly
  held here on return with LRU_REMOVED_RETRY. i.e.
  
  case LRU_REMOVED_RETRY:
  assert_spin_locked(nlru-lock);
  case LRU_REMOVED:
 
 Ah, good idea.  How about adding it to LRU_RETRY as well?

Yup, good idea.

   +static struct shrinker workingset_shadow_shrinker = {
   + .count_objects = count_shadow_nodes,
   + .scan_objects = scan_shadow_nodes,
   + .seeks = DEFAULT_SEEKS * 4,
   + .flags = SHRINKER_NUMA_AWARE,
   +};
  
  Can you add a comment explaining how you calculated the .seeks
  value? It's important to document the weighings/importance
  we give to slab reclaim so we can determine if it's actually
  acheiving the desired balance under different loads...
 
 This is not an exact science, to say the least.

I know, that's why I asked it be documented rather than be something
kept in your head.

 The shadow entries are mostly self-regulated, so I don't want the
 shrinker to interfere while the machine is just regularly trimming
 caches during normal operation.
 
 It should only kick in when either a) reclaim is picking up and the
 scan-to-reclaim ratio increases due to mapped pages, dirty cache,
 swapping etc. or b) the number of objects compared to LRU pages
 becomes excessive.
 
 I think that is what most shrinkers with an elevated seeks value want,
 but this translates very awkwardly (and not completely) to the current
 cost model, and we should probably rework that interface.
 
 Seeks currently encodes 3 ratios:
 
   1. the cost of creating an object vs. a page
 
   2. the expected number of objects vs. pages

It doesn't encode that at all. If it did, then the default value
wouldn't be 2.

   3. the cost of reclaiming an object vs. a page

Which, when you consider #3 in conjunction with #1, the actual
intended meaning of .seeks is the cost of replacing this object in
the cache compared to the cost of replacing a page cache page.

 but they are not necessarily correlated.  How I would like to
 configure the shadow shrinker instead is:
 
   o scan objects when reclaim efficiency is down to 75%, because they
 are more valuable than use-once cache but less than workingset
 
   o scan objects when the ratio between them and the number of pages
 exceeds 1/32 (one shadow entry for each resident page, up to 64
 entries per shrinkable object, 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-20 Thread Johannes Weiner
On Tue, Jan 21, 2014 at 02:03:58PM +1100, Dave Chinner wrote:
 On Mon, Jan 20, 2014 at 06:17:37PM -0500, Johannes Weiner wrote:
  On Fri, Jan 17, 2014 at 11:05:17AM +1100, Dave Chinner wrote:
   On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
+   /* Only shadow entries in there, keep track of this node */
+   if (!(node-count  RADIX_TREE_COUNT_MASK) 
+   list_empty(node-private_list)) {
+   node-private_data = mapping;
+   list_lru_add(workingset_shadow_nodes, 
node-private_list);
+   }
   
   You can't do this list_empty(node-private_list) check safely
   externally to the list_lru code - only time that entry can be
   checked safely is under the LRU list locks. This is the reason that
   list_lru_add/list_lru_del return a boolean to indicate is the object
   was added/removed from the list - they do this list_empty() check
   internally. i.e. the correct, safe way to do conditionally update
   state iff the object was added to the LRU is:
   
 if (!(node-count  RADIX_TREE_COUNT_MASK)) {
 if (list_lru_add(workingset_shadow_nodes, node-private_list))
 node-private_data = mapping;
 }
   
+   radix_tree_replace_slot(slot, page);
+   mapping-nrpages++;
+   if (node) {
+   node-count++;
+   /* Installed page, can't be shadow-only anymore */
+   if (!list_empty(node-private_list))
+   list_lru_del(workingset_shadow_nodes,
+node-private_list);
+   }
   
   Same issue here:
   
 if (node) {
 node-count++;
 list_lru_del(workingset_shadow_nodes, node-private_list);
 }
  
  All modifications to node-private_list happen under
  mapping-tree_lock, and modifications of a neighboring link should not
  affect the outcome of the list_empty(), so I don't think the lru lock
  is necessary.
 
 Can you please add that as a comment somewhere explaining why it is
 safe to do this?

Absolutely.

+   case LRU_REMOVED_RETRY:
if (--nlru-nr_items == 0)
node_clear(nid, lru-active_nodes);
WARN_ON_ONCE(nlru-nr_items  0);
isolated++;
+   /*
+* If the lru lock has been dropped, our list
+* traversal is now invalid and so we have to
+* restart from scratch.
+*/
+   if (ret == LRU_REMOVED_RETRY)
+   goto restart;
break;
case LRU_ROTATE:
list_move_tail(item, nlru-list);
   
   I think that we need to assert that the list lru lock is correctly
   held here on return with LRU_REMOVED_RETRY. i.e.
   
 case LRU_REMOVED_RETRY:
 assert_spin_locked(nlru-lock);
 case LRU_REMOVED:
  
  Ah, good idea.  How about adding it to LRU_RETRY as well?
 
 Yup, good idea.

Ok, will do.

+static struct shrinker workingset_shadow_shrinker = {
+   .count_objects = count_shadow_nodes,
+   .scan_objects = scan_shadow_nodes,
+   .seeks = DEFAULT_SEEKS * 4,
+   .flags = SHRINKER_NUMA_AWARE,
+};
   
   Can you add a comment explaining how you calculated the .seeks
   value? It's important to document the weighings/importance
   we give to slab reclaim so we can determine if it's actually
   acheiving the desired balance under different loads...
  
  This is not an exact science, to say the least.
 
 I know, that's why I asked it be documented rather than be something
 kept in your head.
 
  The shadow entries are mostly self-regulated, so I don't want the
  shrinker to interfere while the machine is just regularly trimming
  caches during normal operation.
  
  It should only kick in when either a) reclaim is picking up and the
  scan-to-reclaim ratio increases due to mapped pages, dirty cache,
  swapping etc. or b) the number of objects compared to LRU pages
  becomes excessive.
  
  I think that is what most shrinkers with an elevated seeks value want,
  but this translates very awkwardly (and not completely) to the current
  cost model, and we should probably rework that interface.
  
  Seeks currently encodes 3 ratios:
  
1. the cost of creating an object vs. a page
  
2. the expected number of objects vs. pages
 
 It doesn't encode that at all. If it did, then the default value
 wouldn't be 2.

3. the cost of reclaiming an object vs. a page
 
 Which, when you consider #3 in conjunction with #1, the actual
 intended meaning of .seeks is the cost of replacing this object in
 the cache compared to the cost of replacing a page cache page.

But what it actually seems to do is 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-16 Thread Dave Chinner
On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.
> Per-NUMA rather than global because we expect the radix tree nodes
> themselves to be allocated node-locally and we want to reduce
> cross-node references of otherwise independent cache workloads.  A
> simple shrinker will then reclaim these nodes on memory pressure.
> 
> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:

Just a couple of things with the list_lru interfaces.


> @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space 
> *mapping,
>* same time and miss a shadow entry.
>*/
>   smp_wmb();
> - } else
> - radix_tree_delete(>page_tree, page->index);
> + }
>   mapping->nrpages--;
> +
> + if (!node) {
> + /* Clear direct pointer tags in root node */
> + mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
> + radix_tree_replace_slot(slot, shadow);
> + return;
> + }
> +
> + /* Clear tree tags for the removed page */
> + index = page->index;
> + offset = index & RADIX_TREE_MAP_MASK;
> + for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
> + if (test_bit(offset, node->tags[tag]))
> + radix_tree_tag_clear(>page_tree, index, tag);
> + }
> +
> + /* Delete page, swap shadow entry */
> + radix_tree_replace_slot(slot, shadow);
> + node->count--;
> + if (shadow)
> + node->count += 1U << RADIX_TREE_COUNT_SHIFT;
> + else
> + if (__radix_tree_delete_node(>page_tree, node))
> + return;
> +
> + /* Only shadow entries in there, keep track of this node */
> + if (!(node->count & RADIX_TREE_COUNT_MASK) &&
> + list_empty(>private_list)) {
> + node->private_data = mapping;
> + list_lru_add(_shadow_nodes, >private_list);
> + }

You can't do this list_empty(>private_list) check safely
externally to the list_lru code - only time that entry can be
checked safely is under the LRU list locks. This is the reason that
list_lru_add/list_lru_del return a boolean to indicate is the object
was added/removed from the list - they do this list_empty() check
internally. i.e. the correct, safe way to do conditionally update
state iff the object was added to the LRU is:

if (!(node->count & RADIX_TREE_COUNT_MASK)) {
if (list_lru_add(_shadow_nodes, >private_list))
node->private_data = mapping;
}

> + radix_tree_replace_slot(slot, page);
> + mapping->nrpages++;
> + if (node) {
> + node->count++;
> + /* Installed page, can't be shadow-only anymore */
> + if (!list_empty(>private_list))
> + list_lru_del(_shadow_nodes,
> +  >private_list);
> + }

Same issue here:

if (node) {
node->count++;
list_lru_del(_shadow_nodes, >private_list);
}


> + return 0;
>  }
>  
>  static int __add_to_page_cache_locked(struct page *page,
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 72f9decb0104..47a9faf4070b 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -88,10 +88,18 @@ restart:
>   ret = isolate(item, >lock, cb_arg);
>   switch (ret) {
>   case LRU_REMOVED:
> + case LRU_REMOVED_RETRY:
>   if (--nlru->nr_items == 0)
>   node_clear(nid, lru->active_nodes);
>   WARN_ON_ONCE(nlru->nr_items < 0);
>   isolated++;
> + /*
> +  * If the lru lock has been dropped, our list
> +  * traversal is now invalid and so we have to
> +  * restart from scratch.
> +  */
> + if (ret == LRU_REMOVED_RETRY)
> + goto restart;
>   break;
>   case LRU_ROTATE:
>   list_move_tail(item, >list);

I think that we need to assert that the list lru lock is correctly
held here on return with LRU_REMOVED_RETRY. i.e.

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-16 Thread Johannes Weiner
On Wed, Jan 15, 2014 at 01:55:01PM +0800, Bob Liu wrote:
> Hi Johannes,
> 
> On 01/11/2014 02:10 AM, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> 
> I have one more question. It seems that other algorithm only remember
> history information of a limit number of evicted pages where the number
> is usually the same as the total cache or memory size.
> But in your patch, I didn't see a preferred value that how many evicted
> pages' history information should be recorded. It all depends on the
> workingset_shadow_shrinker?

That "same as total cache" number is a fairly arbitrary cut-off that
defines how far we record eviction history.  For this patch set, we
technically do not need more shadow entries than active pages, but
strict enforcement would be very expensive.  So we leave it mostly to
refaults and inode reclaim to keep the number of shadow entries low,
with the shadow shrinker as an emergency backup.  Keep in mind that
the shadow entries represent that part of the working set that exceeds
available memory.  So the only way the number of shadow entries
exceeds the number of RAM pages in the system is if your workingset is
more than twice that of memory, otherwise the shadow entries refault
before they can accumulate.  And because of inode reclaim, that huge
working set would have to be backed by a very small number of files,
otherwise the shadow entries are reclaimed along with the inodes.  But
this theoretical workload would be entirely IO bound and a few extra
MB wasted on shadow entries should make no difference.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-16 Thread Johannes Weiner
On Wed, Jan 15, 2014 at 01:55:01PM +0800, Bob Liu wrote:
 Hi Johannes,
 
 On 01/11/2014 02:10 AM, Johannes Weiner wrote:
  Previously, page cache radix tree nodes were freed after reclaim
  emptied out their page pointers.  But now reclaim stores shadow
  entries in their place, which are only reclaimed when the inodes
  themselves are reclaimed.  This is problematic for bigger files that
  are still in use after they have a significant amount of their cache
  reclaimed, without any of those pages actually refaulting.  The shadow
  entries will just sit there and waste memory.  In the worst case, the
  shadow entries will accumulate until the machine runs out of memory.
  
 
 I have one more question. It seems that other algorithm only remember
 history information of a limit number of evicted pages where the number
 is usually the same as the total cache or memory size.
 But in your patch, I didn't see a preferred value that how many evicted
 pages' history information should be recorded. It all depends on the
 workingset_shadow_shrinker?

That same as total cache number is a fairly arbitrary cut-off that
defines how far we record eviction history.  For this patch set, we
technically do not need more shadow entries than active pages, but
strict enforcement would be very expensive.  So we leave it mostly to
refaults and inode reclaim to keep the number of shadow entries low,
with the shadow shrinker as an emergency backup.  Keep in mind that
the shadow entries represent that part of the working set that exceeds
available memory.  So the only way the number of shadow entries
exceeds the number of RAM pages in the system is if your workingset is
more than twice that of memory, otherwise the shadow entries refault
before they can accumulate.  And because of inode reclaim, that huge
working set would have to be backed by a very small number of files,
otherwise the shadow entries are reclaimed along with the inodes.  But
this theoretical workload would be entirely IO bound and a few extra
MB wasted on shadow entries should make no difference.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-16 Thread Dave Chinner
On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
 Previously, page cache radix tree nodes were freed after reclaim
 emptied out their page pointers.  But now reclaim stores shadow
 entries in their place, which are only reclaimed when the inodes
 themselves are reclaimed.  This is problematic for bigger files that
 are still in use after they have a significant amount of their cache
 reclaimed, without any of those pages actually refaulting.  The shadow
 entries will just sit there and waste memory.  In the worst case, the
 shadow entries will accumulate until the machine runs out of memory.
 
 To get this under control, the VM will track radix tree nodes
 exclusively containing shadow entries on a per-NUMA node list.
 Per-NUMA rather than global because we expect the radix tree nodes
 themselves to be allocated node-locally and we want to reduce
 cross-node references of otherwise independent cache workloads.  A
 simple shrinker will then reclaim these nodes on memory pressure.
 
 A few things need to be stored in the radix tree node to implement the
 shadow node LRU and allow tree deletions coming from the list:

Just a couple of things with the list_lru interfaces.


 @@ -123,9 +129,39 @@ static void page_cache_tree_delete(struct address_space 
 *mapping,
* same time and miss a shadow entry.
*/
   smp_wmb();
 - } else
 - radix_tree_delete(mapping-page_tree, page-index);
 + }
   mapping-nrpages--;
 +
 + if (!node) {
 + /* Clear direct pointer tags in root node */
 + mapping-page_tree.gfp_mask = __GFP_BITS_MASK;
 + radix_tree_replace_slot(slot, shadow);
 + return;
 + }
 +
 + /* Clear tree tags for the removed page */
 + index = page-index;
 + offset = index  RADIX_TREE_MAP_MASK;
 + for (tag = 0; tag  RADIX_TREE_MAX_TAGS; tag++) {
 + if (test_bit(offset, node-tags[tag]))
 + radix_tree_tag_clear(mapping-page_tree, index, tag);
 + }
 +
 + /* Delete page, swap shadow entry */
 + radix_tree_replace_slot(slot, shadow);
 + node-count--;
 + if (shadow)
 + node-count += 1U  RADIX_TREE_COUNT_SHIFT;
 + else
 + if (__radix_tree_delete_node(mapping-page_tree, node))
 + return;
 +
 + /* Only shadow entries in there, keep track of this node */
 + if (!(node-count  RADIX_TREE_COUNT_MASK) 
 + list_empty(node-private_list)) {
 + node-private_data = mapping;
 + list_lru_add(workingset_shadow_nodes, node-private_list);
 + }

You can't do this list_empty(node-private_list) check safely
externally to the list_lru code - only time that entry can be
checked safely is under the LRU list locks. This is the reason that
list_lru_add/list_lru_del return a boolean to indicate is the object
was added/removed from the list - they do this list_empty() check
internally. i.e. the correct, safe way to do conditionally update
state iff the object was added to the LRU is:

if (!(node-count  RADIX_TREE_COUNT_MASK)) {
if (list_lru_add(workingset_shadow_nodes, node-private_list))
node-private_data = mapping;
}

 + radix_tree_replace_slot(slot, page);
 + mapping-nrpages++;
 + if (node) {
 + node-count++;
 + /* Installed page, can't be shadow-only anymore */
 + if (!list_empty(node-private_list))
 + list_lru_del(workingset_shadow_nodes,
 +  node-private_list);
 + }

Same issue here:

if (node) {
node-count++;
list_lru_del(workingset_shadow_nodes, node-private_list);
}


 + return 0;
  }
  
  static int __add_to_page_cache_locked(struct page *page,
 diff --git a/mm/list_lru.c b/mm/list_lru.c
 index 72f9decb0104..47a9faf4070b 100644
 --- a/mm/list_lru.c
 +++ b/mm/list_lru.c
 @@ -88,10 +88,18 @@ restart:
   ret = isolate(item, nlru-lock, cb_arg);
   switch (ret) {
   case LRU_REMOVED:
 + case LRU_REMOVED_RETRY:
   if (--nlru-nr_items == 0)
   node_clear(nid, lru-active_nodes);
   WARN_ON_ONCE(nlru-nr_items  0);
   isolated++;
 + /*
 +  * If the lru lock has been dropped, our list
 +  * traversal is now invalid and so we have to
 +  * restart from scratch.
 +  */
 + if (ret == LRU_REMOVED_RETRY)
 + goto restart;
   break;
   case LRU_ROTATE:
   list_move_tail(item, nlru-list);

I think that we need to assert that the list lru lock is correctly
held here on return with LRU_REMOVED_RETRY. i.e.

case 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-14 Thread Bob Liu
Hi Johannes,

On 01/11/2014 02:10 AM, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 

I have one more question. It seems that other algorithm only remember
history information of a limit number of evicted pages where the number
is usually the same as the total cache or memory size.
But in your patch, I didn't see a preferred value that how many evicted
pages' history information should be recorded. It all depends on the
workingset_shadow_shrinker?

Thanks,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-14 Thread Bob Liu
Hi Johannes,

On 01/11/2014 02:10 AM, Johannes Weiner wrote:
 Previously, page cache radix tree nodes were freed after reclaim
 emptied out their page pointers.  But now reclaim stores shadow
 entries in their place, which are only reclaimed when the inodes
 themselves are reclaimed.  This is problematic for bigger files that
 are still in use after they have a significant amount of their cache
 reclaimed, without any of those pages actually refaulting.  The shadow
 entries will just sit there and waste memory.  In the worst case, the
 shadow entries will accumulate until the machine runs out of memory.
 

I have one more question. It seems that other algorithm only remember
history information of a limit number of evicted pages where the number
is usually the same as the total cache or memory size.
But in your patch, I didn't see a preferred value that how many evicted
pages' history information should be recorded. It all depends on the
workingset_shadow_shrinker?

Thanks,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-13 Thread Minchan Kim
On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
> On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.
> > 
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >from the node up to the tree root, which is needed to perform a
> >deletion.  To solve this, encode in each node its offset inside the
> >parent.  This can be stored in the unused upper bits of the same
> >member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >regular entries, to quickly detect when the node is ready to go to
> >the shadow node LRU list.  The current entry count is an unsigned
> >int but the maximum number of entries is 64, so a shadow counter
> >can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs tree lock and tree root, which are located
> >in the address space, so store an address_space backpointer in the
> >node.  The parent pointer of the node is in a union with the 2-word
> >rcu_head, so the backpointer comes at no extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >head inside the node.  This does increase the size of the node, but
> >it does not change the number of objects that fit into a slab page.
> > 
> > Signed-off-by: Johannes Weiner 
> > ---
> >  include/linux/list_lru.h   |   2 +
> >  include/linux/mmzone.h |   1 +
> >  include/linux/radix-tree.h |  32 +---
> >  include/linux/swap.h   |   1 +
> >  lib/radix-tree.c   |  36 --
> >  mm/filemap.c   |  77 +++--
> >  mm/list_lru.c  |   8 +++
> >  mm/truncate.c  |  20 +++-
> >  mm/vmstat.c|   1 +
> >  mm/workingset.c| 121 
> > +
> >  10 files changed, 259 insertions(+), 40 deletions(-)
> > 
> > diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> > index 3ce541753c88..b02fc233eadd 100644
> > --- a/include/linux/list_lru.h
> > +++ b/include/linux/list_lru.h
> > @@ -13,6 +13,8 @@
> >  /* list_lru_walk_cb has to always return one of those */
> >  enum lru_status {
> > LRU_REMOVED,/* item removed from list */
> > +   LRU_REMOVED_RETRY,  /* item removed, but lock has been
> > +  dropped and reacquired */
> > LRU_ROTATE, /* item referenced, give another pass */
> > LRU_SKIP,   /* item cannot be locked, skip */
> > LRU_RETRY,  /* item not freeable. May drop the lock
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 118ba9f51e86..8cac5a7ef7a7 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -144,6 +144,7 @@ enum zone_stat_item {
> >  #endif
> > WORKINGSET_REFAULT,
> > WORKINGSET_ACTIVATE,
> > +   WORKINGSET_NODERECLAIM,
> > NR_ANON_TRANSPARENT_HUGEPAGES,
> > NR_FREE_CMA_PAGES,
> > NR_VM_ZONE_STAT_ITEMS };
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index 13636c40bc42..33170dbd9db4 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
> >  #define RADIX_TREE_TAG_LONGS   \
> > ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
> >  
> > +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> > +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> > + RADIX_TREE_MAP_SHIFT))
> > +
> > +/* Height component in node->path */
> > +#define RADIX_TREE_HEIGHT_SHIFT(RADIX_TREE_MAX_PATH + 1)
> > +#define RADIX_TREE_HEIGHT_MASK 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-13 Thread Minchan Kim
On Mon, Jan 13, 2014 at 04:39:47PM +0900, Minchan Kim wrote:
 On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
  Previously, page cache radix tree nodes were freed after reclaim
  emptied out their page pointers.  But now reclaim stores shadow
  entries in their place, which are only reclaimed when the inodes
  themselves are reclaimed.  This is problematic for bigger files that
  are still in use after they have a significant amount of their cache
  reclaimed, without any of those pages actually refaulting.  The shadow
  entries will just sit there and waste memory.  In the worst case, the
  shadow entries will accumulate until the machine runs out of memory.
  
  To get this under control, the VM will track radix tree nodes
  exclusively containing shadow entries on a per-NUMA node list.
  Per-NUMA rather than global because we expect the radix tree nodes
  themselves to be allocated node-locally and we want to reduce
  cross-node references of otherwise independent cache workloads.  A
  simple shrinker will then reclaim these nodes on memory pressure.
  
  A few things need to be stored in the radix tree node to implement the
  shadow node LRU and allow tree deletions coming from the list:
  
  1. There is no index available that would describe the reverse path
 from the node up to the tree root, which is needed to perform a
 deletion.  To solve this, encode in each node its offset inside the
 parent.  This can be stored in the unused upper bits of the same
 member that stores the node's height at no extra space cost.
  
  2. The number of shadow entries needs to be counted in addition to the
 regular entries, to quickly detect when the node is ready to go to
 the shadow node LRU list.  The current entry count is an unsigned
 int but the maximum number of entries is 64, so a shadow counter
 can easily be stored in the unused upper bits.
  
  3. Tree modification needs tree lock and tree root, which are located
 in the address space, so store an address_space backpointer in the
 node.  The parent pointer of the node is in a union with the 2-word
 rcu_head, so the backpointer comes at no extra cost as well.
  
  4. The node needs to be linked to an LRU list, which requires a list
 head inside the node.  This does increase the size of the node, but
 it does not change the number of objects that fit into a slab page.
  
  Signed-off-by: Johannes Weiner han...@cmpxchg.org
  ---
   include/linux/list_lru.h   |   2 +
   include/linux/mmzone.h |   1 +
   include/linux/radix-tree.h |  32 +---
   include/linux/swap.h   |   1 +
   lib/radix-tree.c   |  36 --
   mm/filemap.c   |  77 +++--
   mm/list_lru.c  |   8 +++
   mm/truncate.c  |  20 +++-
   mm/vmstat.c|   1 +
   mm/workingset.c| 121 
  +
   10 files changed, 259 insertions(+), 40 deletions(-)
  
  diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
  index 3ce541753c88..b02fc233eadd 100644
  --- a/include/linux/list_lru.h
  +++ b/include/linux/list_lru.h
  @@ -13,6 +13,8 @@
   /* list_lru_walk_cb has to always return one of those */
   enum lru_status {
  LRU_REMOVED,/* item removed from list */
  +   LRU_REMOVED_RETRY,  /* item removed, but lock has been
  +  dropped and reacquired */
  LRU_ROTATE, /* item referenced, give another pass */
  LRU_SKIP,   /* item cannot be locked, skip */
  LRU_RETRY,  /* item not freeable. May drop the lock
  diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
  index 118ba9f51e86..8cac5a7ef7a7 100644
  --- a/include/linux/mmzone.h
  +++ b/include/linux/mmzone.h
  @@ -144,6 +144,7 @@ enum zone_stat_item {
   #endif
  WORKINGSET_REFAULT,
  WORKINGSET_ACTIVATE,
  +   WORKINGSET_NODERECLAIM,
  NR_ANON_TRANSPARENT_HUGEPAGES,
  NR_FREE_CMA_PAGES,
  NR_VM_ZONE_STAT_ITEMS };
  diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
  index 13636c40bc42..33170dbd9db4 100644
  --- a/include/linux/radix-tree.h
  +++ b/include/linux/radix-tree.h
  @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
   #define RADIX_TREE_TAG_LONGS   \
  ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
   
  +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
  +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
  + RADIX_TREE_MAP_SHIFT))
  +
  +/* Height component in node-path */
  +#define RADIX_TREE_HEIGHT_SHIFT(RADIX_TREE_MAX_PATH + 1)
  +#define RADIX_TREE_HEIGHT_MASK ((1UL  RADIX_TREE_HEIGHT_SHIFT) - 1)
  +
  +/* Internally used bits of node-count */
  +#define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1)
  +#define 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-12 Thread Minchan Kim
On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.
> Per-NUMA rather than global because we expect the radix tree nodes
> themselves to be allocated node-locally and we want to reduce
> cross-node references of otherwise independent cache workloads.  A
> simple shrinker will then reclaim these nodes on memory pressure.
> 
> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:
> 
> 1. There is no index available that would describe the reverse path
>from the node up to the tree root, which is needed to perform a
>deletion.  To solve this, encode in each node its offset inside the
>parent.  This can be stored in the unused upper bits of the same
>member that stores the node's height at no extra space cost.
> 
> 2. The number of shadow entries needs to be counted in addition to the
>regular entries, to quickly detect when the node is ready to go to
>the shadow node LRU list.  The current entry count is an unsigned
>int but the maximum number of entries is 64, so a shadow counter
>can easily be stored in the unused upper bits.
> 
> 3. Tree modification needs tree lock and tree root, which are located
>in the address space, so store an address_space backpointer in the
>node.  The parent pointer of the node is in a union with the 2-word
>rcu_head, so the backpointer comes at no extra cost as well.
> 
> 4. The node needs to be linked to an LRU list, which requires a list
>head inside the node.  This does increase the size of the node, but
>it does not change the number of objects that fit into a slab page.
> 
> Signed-off-by: Johannes Weiner 
> ---
>  include/linux/list_lru.h   |   2 +
>  include/linux/mmzone.h |   1 +
>  include/linux/radix-tree.h |  32 +---
>  include/linux/swap.h   |   1 +
>  lib/radix-tree.c   |  36 --
>  mm/filemap.c   |  77 +++--
>  mm/list_lru.c  |   8 +++
>  mm/truncate.c  |  20 +++-
>  mm/vmstat.c|   1 +
>  mm/workingset.c| 121 
> +
>  10 files changed, 259 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 3ce541753c88..b02fc233eadd 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -13,6 +13,8 @@
>  /* list_lru_walk_cb has to always return one of those */
>  enum lru_status {
>   LRU_REMOVED,/* item removed from list */
> + LRU_REMOVED_RETRY,  /* item removed, but lock has been
> +dropped and reacquired */
>   LRU_ROTATE, /* item referenced, give another pass */
>   LRU_SKIP,   /* item cannot be locked, skip */
>   LRU_RETRY,  /* item not freeable. May drop the lock
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 118ba9f51e86..8cac5a7ef7a7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -144,6 +144,7 @@ enum zone_stat_item {
>  #endif
>   WORKINGSET_REFAULT,
>   WORKINGSET_ACTIVATE,
> + WORKINGSET_NODERECLAIM,
>   NR_ANON_TRANSPARENT_HUGEPAGES,
>   NR_FREE_CMA_PAGES,
>   NR_VM_ZONE_STAT_ITEMS };
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 13636c40bc42..33170dbd9db4 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
>  #define RADIX_TREE_TAG_LONGS \
>   ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
>  
> +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
> +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
> +   RADIX_TREE_MAP_SHIFT))
> +
> +/* Height component in node->path */
> +#define RADIX_TREE_HEIGHT_SHIFT  (RADIX_TREE_MAX_PATH + 1)
> +#define RADIX_TREE_HEIGHT_MASK   ((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
> +
> +/* Internally used bits of node->count */
> +#define RADIX_TREE_COUNT_SHIFT   (RADIX_TREE_MAP_SHIFT + 1)
> +#define RADIX_TREE_COUNT_MASK((1UL << RADIX_TREE_COUNT_SHIFT) - 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-12 Thread Minchan Kim
On Fri, Jan 10, 2014 at 01:10:43PM -0500, Johannes Weiner wrote:
 Previously, page cache radix tree nodes were freed after reclaim
 emptied out their page pointers.  But now reclaim stores shadow
 entries in their place, which are only reclaimed when the inodes
 themselves are reclaimed.  This is problematic for bigger files that
 are still in use after they have a significant amount of their cache
 reclaimed, without any of those pages actually refaulting.  The shadow
 entries will just sit there and waste memory.  In the worst case, the
 shadow entries will accumulate until the machine runs out of memory.
 
 To get this under control, the VM will track radix tree nodes
 exclusively containing shadow entries on a per-NUMA node list.
 Per-NUMA rather than global because we expect the radix tree nodes
 themselves to be allocated node-locally and we want to reduce
 cross-node references of otherwise independent cache workloads.  A
 simple shrinker will then reclaim these nodes on memory pressure.
 
 A few things need to be stored in the radix tree node to implement the
 shadow node LRU and allow tree deletions coming from the list:
 
 1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion.  To solve this, encode in each node its offset inside the
parent.  This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
 
 2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list.  The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
 
 3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node.  The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
 
 4. The node needs to be linked to an LRU list, which requires a list
head inside the node.  This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
 
 Signed-off-by: Johannes Weiner han...@cmpxchg.org
 ---
  include/linux/list_lru.h   |   2 +
  include/linux/mmzone.h |   1 +
  include/linux/radix-tree.h |  32 +---
  include/linux/swap.h   |   1 +
  lib/radix-tree.c   |  36 --
  mm/filemap.c   |  77 +++--
  mm/list_lru.c  |   8 +++
  mm/truncate.c  |  20 +++-
  mm/vmstat.c|   1 +
  mm/workingset.c| 121 
 +
  10 files changed, 259 insertions(+), 40 deletions(-)
 
 diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
 index 3ce541753c88..b02fc233eadd 100644
 --- a/include/linux/list_lru.h
 +++ b/include/linux/list_lru.h
 @@ -13,6 +13,8 @@
  /* list_lru_walk_cb has to always return one of those */
  enum lru_status {
   LRU_REMOVED,/* item removed from list */
 + LRU_REMOVED_RETRY,  /* item removed, but lock has been
 +dropped and reacquired */
   LRU_ROTATE, /* item referenced, give another pass */
   LRU_SKIP,   /* item cannot be locked, skip */
   LRU_RETRY,  /* item not freeable. May drop the lock
 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
 index 118ba9f51e86..8cac5a7ef7a7 100644
 --- a/include/linux/mmzone.h
 +++ b/include/linux/mmzone.h
 @@ -144,6 +144,7 @@ enum zone_stat_item {
  #endif
   WORKINGSET_REFAULT,
   WORKINGSET_ACTIVATE,
 + WORKINGSET_NODERECLAIM,
   NR_ANON_TRANSPARENT_HUGEPAGES,
   NR_FREE_CMA_PAGES,
   NR_VM_ZONE_STAT_ITEMS };
 diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
 index 13636c40bc42..33170dbd9db4 100644
 --- a/include/linux/radix-tree.h
 +++ b/include/linux/radix-tree.h
 @@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
  #define RADIX_TREE_TAG_LONGS \
   ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
  
 +#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
 +#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
 +   RADIX_TREE_MAP_SHIFT))
 +
 +/* Height component in node-path */
 +#define RADIX_TREE_HEIGHT_SHIFT  (RADIX_TREE_MAX_PATH + 1)
 +#define RADIX_TREE_HEIGHT_MASK   ((1UL  RADIX_TREE_HEIGHT_SHIFT) - 1)
 +
 +/* Internally used bits of node-count */
 +#define RADIX_TREE_COUNT_SHIFT   (RADIX_TREE_MAP_SHIFT + 1)
 +#define RADIX_TREE_COUNT_MASK((1UL  RADIX_TREE_COUNT_SHIFT) - 1)
 +
  struct radix_tree_node {
 - unsigned intheight; /* Height from 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-10 Thread Rik van Riel
On 01/10/2014 01:10 PM, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.
> Per-NUMA rather than global because we expect the radix tree nodes
> themselves to be allocated node-locally and we want to reduce
> cross-node references of otherwise independent cache workloads.  A
> simple shrinker will then reclaim these nodes on memory pressure.

> Signed-off-by: Johannes Weiner 

Reviewed-by: Rik van Riel 

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-10 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.
Per-NUMA rather than global because we expect the radix tree nodes
themselves to be allocated node-locally and we want to reduce
cross-node references of otherwise independent cache workloads.  A
simple shrinker will then reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
   in the address space, so store an address_space backpointer in the
   node.  The parent pointer of the node is in a union with the 2-word
   rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner 
---
 include/linux/list_lru.h   |   2 +
 include/linux/mmzone.h |   1 +
 include/linux/radix-tree.h |  32 +---
 include/linux/swap.h   |   1 +
 lib/radix-tree.c   |  36 --
 mm/filemap.c   |  77 +++--
 mm/list_lru.c  |   8 +++
 mm/truncate.c  |  20 +++-
 mm/vmstat.c|   1 +
 mm/workingset.c| 121 +
 10 files changed, 259 insertions(+), 40 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce541753c88..b02fc233eadd 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -13,6 +13,8 @@
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
LRU_REMOVED,/* item removed from list */
+   LRU_REMOVED_RETRY,  /* item removed, but lock has been
+  dropped and reacquired */
LRU_ROTATE, /* item referenced, give another pass */
LRU_SKIP,   /* item cannot be locked, skip */
LRU_RETRY,  /* item not freeable. May drop the lock
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 118ba9f51e86..8cac5a7ef7a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
 #endif
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
+   WORKINGSET_NODERECLAIM,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c40bc42..33170dbd9db4 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 #define RADIX_TREE_TAG_LONGS   \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
+#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+ RADIX_TREE_MAP_SHIFT))
+
+/* Height component in node->path */
+#define RADIX_TREE_HEIGHT_SHIFT(RADIX_TREE_MAX_PATH + 1)
+#define RADIX_TREE_HEIGHT_MASK ((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
+
+/* Internally used bits of node->count */
+#define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1)
+#define RADIX_TREE_COUNT_MASK  ((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
+
 struct radix_tree_node {
-   unsigned intheight; /* Height from the bottom */
+   unsigned intpath;   /* Offset in parent & height from the bottom */
unsigned intcount;
union {
-   struct 

[patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-10 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.
Per-NUMA rather than global because we expect the radix tree nodes
themselves to be allocated node-locally and we want to reduce
cross-node references of otherwise independent cache workloads.  A
simple shrinker will then reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
   in the address space, so store an address_space backpointer in the
   node.  The parent pointer of the node is in a union with the 2-word
   rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
---
 include/linux/list_lru.h   |   2 +
 include/linux/mmzone.h |   1 +
 include/linux/radix-tree.h |  32 +---
 include/linux/swap.h   |   1 +
 lib/radix-tree.c   |  36 --
 mm/filemap.c   |  77 +++--
 mm/list_lru.c  |   8 +++
 mm/truncate.c  |  20 +++-
 mm/vmstat.c|   1 +
 mm/workingset.c| 121 +
 10 files changed, 259 insertions(+), 40 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce541753c88..b02fc233eadd 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -13,6 +13,8 @@
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
LRU_REMOVED,/* item removed from list */
+   LRU_REMOVED_RETRY,  /* item removed, but lock has been
+  dropped and reacquired */
LRU_ROTATE, /* item referenced, give another pass */
LRU_SKIP,   /* item cannot be locked, skip */
LRU_RETRY,  /* item not freeable. May drop the lock
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 118ba9f51e86..8cac5a7ef7a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
 #endif
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
+   WORKINGSET_NODERECLAIM,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c40bc42..33170dbd9db4 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 #define RADIX_TREE_TAG_LONGS   \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
+#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+ RADIX_TREE_MAP_SHIFT))
+
+/* Height component in node-path */
+#define RADIX_TREE_HEIGHT_SHIFT(RADIX_TREE_MAX_PATH + 1)
+#define RADIX_TREE_HEIGHT_MASK ((1UL  RADIX_TREE_HEIGHT_SHIFT) - 1)
+
+/* Internally used bits of node-count */
+#define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1)
+#define RADIX_TREE_COUNT_MASK  ((1UL  RADIX_TREE_COUNT_SHIFT) - 1)
+
 struct radix_tree_node {
-   unsigned intheight; /* Height from the bottom */
+   unsigned intpath;   /* Offset in parent  height from the bottom */
unsigned intcount;
union {
-

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2014-01-10 Thread Rik van Riel
On 01/10/2014 01:10 PM, Johannes Weiner wrote:
 Previously, page cache radix tree nodes were freed after reclaim
 emptied out their page pointers.  But now reclaim stores shadow
 entries in their place, which are only reclaimed when the inodes
 themselves are reclaimed.  This is problematic for bigger files that
 are still in use after they have a significant amount of their cache
 reclaimed, without any of those pages actually refaulting.  The shadow
 entries will just sit there and waste memory.  In the worst case, the
 shadow entries will accumulate until the machine runs out of memory.
 
 To get this under control, the VM will track radix tree nodes
 exclusively containing shadow entries on a per-NUMA node list.
 Per-NUMA rather than global because we expect the radix tree nodes
 themselves to be allocated node-locally and we want to reduce
 cross-node references of otherwise independent cache workloads.  A
 simple shrinker will then reclaim these nodes on memory pressure.

 Signed-off-by: Johannes Weiner han...@cmpxchg.org

Reviewed-by: Rik van Riel r...@redhat.com

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-12-02 Thread Johannes Weiner
On Tue, Dec 03, 2013 at 09:10:52AM +1100, Dave Chinner wrote:
> On Mon, Dec 02, 2013 at 02:21:48PM -0500, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.
> > 
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >from the node up to the tree root, which is needed to perform a
> >deletion.  To solve this, encode in each node its offset inside the
> >parent.  This can be stored in the unused upper bits of the same
> >member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >regular entries, to quickly detect when the node is ready to go to
> >the shadow node LRU list.  The current entry count is an unsigned
> >int but the maximum number of entries is 64, so a shadow counter
> >can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs tree lock and tree root, which are located
> >in the address space, so store an address_space backpointer in the
> >node.  The parent pointer of the node is in a union with the 2-word
> >rcu_head, so the backpointer comes at no extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >head inside the node.  This does increase the size of the node, but
> >it does not change the number of objects that fit into a slab page.
> > 
> > Signed-off-by: Johannes Weiner 
> 
> Mostly looks ok, though there is no need to expose the internals of
> list_lru_add/del. The reason for the different return values was so
> that the isolate callback could simply use list_del_init() and not
> have to worry about all the internal accounting stuff. We can drop
> the lock and then do the accounting after regaining it because it
> won't result in the count of objects going negative and triggering
> warnings.
> 
> Hence I think that all we need to do is add a new isolate return
> value "LRU_REMOVED_RETRY" and add it to list_lru_walk_node() like
> so:
> 
>   switch (ret) {
> + case LRU_REMOVED_RETRY:
> + /*
> +  * object was removed from the list so we need to
> +  * account for it just like LRU_REMOVED hence the
> +  * fallthrough.  However, the list lock was also
> +  * dropped so we need to restart the list walk.
> +  */
>   case LRU_REMOVED:
>   if (--nlru->nr_items == 0)
>   node_clear(nid, lru->active_nodes);
>   WARN_ON_ONCE(nlru->nr_items < 0);
>   isolated++;
> + if (ret == LRU_REMOVED_RETRY)
> + goto restart;
>   break;

Ha, that is actually exactly what I did when I first implemented it
but decided to change it towards giving the walker callback a bit more
flexibility rather than hardcoding a certain behavior like this (they
might want to put it back if the item can't be shrunk, would that mean
LRU_ROTATE_RETRY?).  That and I wasn't too thrilled with taking the
item off the list without accounting for it before dropping the lock;
just didn't seem right.

But I don't feel strongly about it and we might not want to make the
interface more flexible than we have to at this point.  I'll change
this in the next revision or send a delta to akpm, depending on how
things go from here.

> > +static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
> > +  struct shrink_control *sc)
> > +{
> > +   unsigned long nr_reclaimed = 0;
> > +
> > +   list_lru_walk_node(_shadow_nodes, sc->nid,
> > +  shadow_lru_isolate, _reclaimed, >nr_to_scan);
> > +
> > +   return nr_reclaimed;
> > +}
> 
> Do we need check against GFP_NOFS here? I 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-12-02 Thread Dave Chinner
On Mon, Dec 02, 2013 at 02:21:48PM -0500, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.
> Per-NUMA rather than global because we expect the radix tree nodes
> themselves to be allocated node-locally and we want to reduce
> cross-node references of otherwise independent cache workloads.  A
> simple shrinker will then reclaim these nodes on memory pressure.
> 
> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:
> 
> 1. There is no index available that would describe the reverse path
>from the node up to the tree root, which is needed to perform a
>deletion.  To solve this, encode in each node its offset inside the
>parent.  This can be stored in the unused upper bits of the same
>member that stores the node's height at no extra space cost.
> 
> 2. The number of shadow entries needs to be counted in addition to the
>regular entries, to quickly detect when the node is ready to go to
>the shadow node LRU list.  The current entry count is an unsigned
>int but the maximum number of entries is 64, so a shadow counter
>can easily be stored in the unused upper bits.
> 
> 3. Tree modification needs tree lock and tree root, which are located
>in the address space, so store an address_space backpointer in the
>node.  The parent pointer of the node is in a union with the 2-word
>rcu_head, so the backpointer comes at no extra cost as well.
> 
> 4. The node needs to be linked to an LRU list, which requires a list
>head inside the node.  This does increase the size of the node, but
>it does not change the number of objects that fit into a slab page.
> 
> Signed-off-by: Johannes Weiner 

Mostly looks ok, though there is no need to expose the internals of
list_lru_add/del. The reason for the different return values was so
that the isolate callback could simply use list_del_init() and not
have to worry about all the internal accounting stuff. We can drop
the lock and then do the accounting after regaining it because it
won't result in the count of objects going negative and triggering
warnings.

Hence I think that all we need to do is add a new isolate return
value "LRU_REMOVED_RETRY" and add it to list_lru_walk_node() like
so:

switch (ret) {
+   case LRU_REMOVED_RETRY:
+   /*
+* object was removed from the list so we need to
+* account for it just like LRU_REMOVED hence the
+* fallthrough.  However, the list lock was also
+* dropped so we need to restart the list walk.
+*/
case LRU_REMOVED:
if (--nlru->nr_items == 0)
node_clear(nid, lru->active_nodes);
WARN_ON_ONCE(nlru->nr_items < 0);
isolated++;
+   if (ret == LRU_REMOVED_RETRY)
+   goto restart;
break;

> +static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
> +struct shrink_control *sc)
> +{
> + unsigned long nr_reclaimed = 0;
> +
> + list_lru_walk_node(_shadow_nodes, sc->nid,
> +shadow_lru_isolate, _reclaimed, >nr_to_scan);
> +
> + return nr_reclaimed;
> +}

Do we need check against GFP_NOFS here? I don't think so, but I just
wanted to check...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 9/9] mm: keep page cache radix tree nodes in check

2013-12-02 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.
Per-NUMA rather than global because we expect the radix tree nodes
themselves to be allocated node-locally and we want to reduce
cross-node references of otherwise independent cache workloads.  A
simple shrinker will then reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
   in the address space, so store an address_space backpointer in the
   node.  The parent pointer of the node is in a union with the 2-word
   rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner 
---
 include/linux/list_lru.h  |  21 
 include/linux/mmzone.h|   1 +
 include/linux/radix-tree.h|  32 +---
 include/linux/swap.h  |   1 +
 include/linux/vm_event_item.h |   1 +
 lib/radix-tree.c  |  36 -
 mm/filemap.c  |  77 +--
 mm/list_lru.c |  26 ++
 mm/truncate.c |  20 ++-
 mm/vmstat.c   |   1 +
 mm/workingset.c   | 118 ++
 11 files changed, 294 insertions(+), 40 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce5417..9a7ad61 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -128,4 +128,25 @@ list_lru_walk(struct list_lru *lru, list_lru_walk_cb 
isolate,
}
return isolated;
 }
+
+/**
+ * __list_lru_add: add an element to the lru list
+ * @list_lru: the lru pointer
+ * @item: the item to be added
+ *
+ * This is a version of list_lru_add() for use from within the list
+ * walker.  The lock must be held and the item can't be on the list.
+ */
+void __list_lru_add(struct list_lru *lru, struct list_head *item);
+
+/**
+ * __list_lru_del: delete an element from the lru list
+ * @list_lru: the lru pointer
+ * @item: the item to be deleted
+ *
+ * This is a version of list_lru_del() for use from within the list
+ * walker.  The lock must be held and the item must be on the list.
+ */
+void __list_lru_del(struct list_lru *lru, struct list_head *item);
+
 #endif /* _LRU_LIST_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 118ba9f..8cac5a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
 #endif
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
+   WORKINGSET_NODERECLAIM,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c4..33170db 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 #define RADIX_TREE_TAG_LONGS   \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
+#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+ RADIX_TREE_MAP_SHIFT))
+
+/* Height component in node->path */
+#define RADIX_TREE_HEIGHT_SHIFT(RADIX_TREE_MAX_PATH + 1)
+#define RADIX_TREE_HEIGHT_MASK ((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
+
+/* Internally used bits of 

[patch 9/9] mm: keep page cache radix tree nodes in check

2013-12-02 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.
Per-NUMA rather than global because we expect the radix tree nodes
themselves to be allocated node-locally and we want to reduce
cross-node references of otherwise independent cache workloads.  A
simple shrinker will then reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
   in the address space, so store an address_space backpointer in the
   node.  The parent pointer of the node is in a union with the 2-word
   rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
---
 include/linux/list_lru.h  |  21 
 include/linux/mmzone.h|   1 +
 include/linux/radix-tree.h|  32 +---
 include/linux/swap.h  |   1 +
 include/linux/vm_event_item.h |   1 +
 lib/radix-tree.c  |  36 -
 mm/filemap.c  |  77 +--
 mm/list_lru.c |  26 ++
 mm/truncate.c |  20 ++-
 mm/vmstat.c   |   1 +
 mm/workingset.c   | 118 ++
 11 files changed, 294 insertions(+), 40 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce5417..9a7ad61 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -128,4 +128,25 @@ list_lru_walk(struct list_lru *lru, list_lru_walk_cb 
isolate,
}
return isolated;
 }
+
+/**
+ * __list_lru_add: add an element to the lru list
+ * @list_lru: the lru pointer
+ * @item: the item to be added
+ *
+ * This is a version of list_lru_add() for use from within the list
+ * walker.  The lock must be held and the item can't be on the list.
+ */
+void __list_lru_add(struct list_lru *lru, struct list_head *item);
+
+/**
+ * __list_lru_del: delete an element from the lru list
+ * @list_lru: the lru pointer
+ * @item: the item to be deleted
+ *
+ * This is a version of list_lru_del() for use from within the list
+ * walker.  The lock must be held and the item must be on the list.
+ */
+void __list_lru_del(struct list_lru *lru, struct list_head *item);
+
 #endif /* _LRU_LIST_H */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 118ba9f..8cac5a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
 #endif
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
+   WORKINGSET_NODERECLAIM,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c4..33170db 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 #define RADIX_TREE_TAG_LONGS   \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
+#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+ RADIX_TREE_MAP_SHIFT))
+
+/* Height component in node-path */
+#define RADIX_TREE_HEIGHT_SHIFT(RADIX_TREE_MAX_PATH + 1)
+#define RADIX_TREE_HEIGHT_MASK ((1UL  RADIX_TREE_HEIGHT_SHIFT) - 1)
+
+/* Internally used 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-12-02 Thread Dave Chinner
On Mon, Dec 02, 2013 at 02:21:48PM -0500, Johannes Weiner wrote:
 Previously, page cache radix tree nodes were freed after reclaim
 emptied out their page pointers.  But now reclaim stores shadow
 entries in their place, which are only reclaimed when the inodes
 themselves are reclaimed.  This is problematic for bigger files that
 are still in use after they have a significant amount of their cache
 reclaimed, without any of those pages actually refaulting.  The shadow
 entries will just sit there and waste memory.  In the worst case, the
 shadow entries will accumulate until the machine runs out of memory.
 
 To get this under control, the VM will track radix tree nodes
 exclusively containing shadow entries on a per-NUMA node list.
 Per-NUMA rather than global because we expect the radix tree nodes
 themselves to be allocated node-locally and we want to reduce
 cross-node references of otherwise independent cache workloads.  A
 simple shrinker will then reclaim these nodes on memory pressure.
 
 A few things need to be stored in the radix tree node to implement the
 shadow node LRU and allow tree deletions coming from the list:
 
 1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion.  To solve this, encode in each node its offset inside the
parent.  This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
 
 2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list.  The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
 
 3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node.  The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
 
 4. The node needs to be linked to an LRU list, which requires a list
head inside the node.  This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
 
 Signed-off-by: Johannes Weiner han...@cmpxchg.org

Mostly looks ok, though there is no need to expose the internals of
list_lru_add/del. The reason for the different return values was so
that the isolate callback could simply use list_del_init() and not
have to worry about all the internal accounting stuff. We can drop
the lock and then do the accounting after regaining it because it
won't result in the count of objects going negative and triggering
warnings.

Hence I think that all we need to do is add a new isolate return
value LRU_REMOVED_RETRY and add it to list_lru_walk_node() like
so:

switch (ret) {
+   case LRU_REMOVED_RETRY:
+   /*
+* object was removed from the list so we need to
+* account for it just like LRU_REMOVED hence the
+* fallthrough.  However, the list lock was also
+* dropped so we need to restart the list walk.
+*/
case LRU_REMOVED:
if (--nlru-nr_items == 0)
node_clear(nid, lru-active_nodes);
WARN_ON_ONCE(nlru-nr_items  0);
isolated++;
+   if (ret == LRU_REMOVED_RETRY)
+   goto restart;
break;

 +static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
 +struct shrink_control *sc)
 +{
 + unsigned long nr_reclaimed = 0;
 +
 + list_lru_walk_node(workingset_shadow_nodes, sc-nid,
 +shadow_lru_isolate, nr_reclaimed, sc-nr_to_scan);
 +
 + return nr_reclaimed;
 +}

Do we need check against GFP_NOFS here? I don't think so, but I just
wanted to check...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-12-02 Thread Johannes Weiner
On Tue, Dec 03, 2013 at 09:10:52AM +1100, Dave Chinner wrote:
 On Mon, Dec 02, 2013 at 02:21:48PM -0500, Johannes Weiner wrote:
  Previously, page cache radix tree nodes were freed after reclaim
  emptied out their page pointers.  But now reclaim stores shadow
  entries in their place, which are only reclaimed when the inodes
  themselves are reclaimed.  This is problematic for bigger files that
  are still in use after they have a significant amount of their cache
  reclaimed, without any of those pages actually refaulting.  The shadow
  entries will just sit there and waste memory.  In the worst case, the
  shadow entries will accumulate until the machine runs out of memory.
  
  To get this under control, the VM will track radix tree nodes
  exclusively containing shadow entries on a per-NUMA node list.
  Per-NUMA rather than global because we expect the radix tree nodes
  themselves to be allocated node-locally and we want to reduce
  cross-node references of otherwise independent cache workloads.  A
  simple shrinker will then reclaim these nodes on memory pressure.
  
  A few things need to be stored in the radix tree node to implement the
  shadow node LRU and allow tree deletions coming from the list:
  
  1. There is no index available that would describe the reverse path
 from the node up to the tree root, which is needed to perform a
 deletion.  To solve this, encode in each node its offset inside the
 parent.  This can be stored in the unused upper bits of the same
 member that stores the node's height at no extra space cost.
  
  2. The number of shadow entries needs to be counted in addition to the
 regular entries, to quickly detect when the node is ready to go to
 the shadow node LRU list.  The current entry count is an unsigned
 int but the maximum number of entries is 64, so a shadow counter
 can easily be stored in the unused upper bits.
  
  3. Tree modification needs tree lock and tree root, which are located
 in the address space, so store an address_space backpointer in the
 node.  The parent pointer of the node is in a union with the 2-word
 rcu_head, so the backpointer comes at no extra cost as well.
  
  4. The node needs to be linked to an LRU list, which requires a list
 head inside the node.  This does increase the size of the node, but
 it does not change the number of objects that fit into a slab page.
  
  Signed-off-by: Johannes Weiner han...@cmpxchg.org
 
 Mostly looks ok, though there is no need to expose the internals of
 list_lru_add/del. The reason for the different return values was so
 that the isolate callback could simply use list_del_init() and not
 have to worry about all the internal accounting stuff. We can drop
 the lock and then do the accounting after regaining it because it
 won't result in the count of objects going negative and triggering
 warnings.
 
 Hence I think that all we need to do is add a new isolate return
 value LRU_REMOVED_RETRY and add it to list_lru_walk_node() like
 so:
 
   switch (ret) {
 + case LRU_REMOVED_RETRY:
 + /*
 +  * object was removed from the list so we need to
 +  * account for it just like LRU_REMOVED hence the
 +  * fallthrough.  However, the list lock was also
 +  * dropped so we need to restart the list walk.
 +  */
   case LRU_REMOVED:
   if (--nlru-nr_items == 0)
   node_clear(nid, lru-active_nodes);
   WARN_ON_ONCE(nlru-nr_items  0);
   isolated++;
 + if (ret == LRU_REMOVED_RETRY)
 + goto restart;
   break;

Ha, that is actually exactly what I did when I first implemented it
but decided to change it towards giving the walker callback a bit more
flexibility rather than hardcoding a certain behavior like this (they
might want to put it back if the item can't be shrunk, would that mean
LRU_ROTATE_RETRY?).  That and I wasn't too thrilled with taking the
item off the list without accounting for it before dropping the lock;
just didn't seem right.

But I don't feel strongly about it and we might not want to make the
interface more flexible than we have to at this point.  I'll change
this in the next revision or send a delta to akpm, depending on how
things go from here.

  +static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
  +  struct shrink_control *sc)
  +{
  +   unsigned long nr_reclaimed = 0;
  +
  +   list_lru_walk_node(workingset_shadow_nodes, sc-nid,
  +  shadow_lru_isolate, nr_reclaimed, sc-nr_to_scan);
  +
  +   return nr_reclaimed;
  +}
 
 Do we need check against GFP_NOFS here? I don't think so, but I just
 wanted to check...

No, that should be okay.  We do the same radix tree 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-26 Thread Dave Chinner
On Tue, Nov 26, 2013 at 06:00:10PM -0500, Johannes Weiner wrote:
> On Wed, Nov 27, 2013 at 09:29:37AM +1100, Dave Chinner wrote:
> > On Tue, Nov 26, 2013 at 04:27:25PM -0500, Johannes Weiner wrote:
> > > On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
> > > > On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> > > > > Previously, page cache radix tree nodes were freed after reclaim
> > > > > emptied out their page pointers.  But now reclaim stores shadow
> > > > > entries in their place, which are only reclaimed when the inodes
> > > > > themselves are reclaimed.  This is problematic for bigger files that
> > > > > are still in use after they have a significant amount of their cache
> > > > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > > > entries will just sit there and waste memory.  In the worst case, the
> > > > > shadow entries will accumulate until the machine runs out of memory.
> > 
> > > > 
> > > > > + radix_tree_replace_slot(slot, page);
> > > > > + if (node) {
> > > > > + node->count++;
> > > > > + /* Installed page, can't be shadow-only anymore */
> > > > > + if (!list_empty(>lru))
> > > > > + list_lru_del(_shadow_nodes, 
> > > > > >lru);
> > > > > + }
> > > > > + return 0;
> > > > 
> > > > Hm - what's the overhead of direct management of LRU removal
> > > > here? Most list_lru code uses lazy removal (i.e. via the shrinker)
> > > > to avoid having to touch the LRU when adding new references to an
> > > > object.
> > > 
> > > It's measurable in microbenchmarks, but not when any real IO is
> > > involved.  The difference was in the noise even on SSD drives.
> > 
> > Well, it's not an SSD or two I'm worried about - it's devices that
> > can do millions of IOPS where this is likely to be noticable...
> > 
> > > The other list_lru users see items only once they become unused and
> > > subsequent references are expected to be few and temporary, right?
> > 
> > They go onto the list when the refcount falls to zero, but reuse can
> > be frequent when being referenced repeatedly by a single user. That
> > avoids every reuse from removing the object from the LRU then
> > putting it back on the LRU for every reference cycle...
> 
> That's true, but it's less of a concern in the radix_tree_node case
> because it takes a full inactive list cycle after a refault before the
> node is put back on the LRU.  Or a really unlikely placed partial node
> truncation/invalidation (full truncation would just delete the whole
> node anyway).

OK, fair enough. We can deal with the problem if we see it being a
limitation.

> > > We expect pages to refault in spades on certain loads, at which point
> > > we may have thousands of those nodes on the list that are no longer
> > > reclaimable (10k nodes for about 2.5G of cache).
> > 
> > Sure, look at the way the inode and dentry caches work - entire
> > caches of millions of inodes and dentries often sit on the LRUs. A
> > quick look at my workstations dentry cache shows:
> > 
> > $ at /proc/sys/fs/dentry-state 
> > 180108  170596  45  0   0   0
> > 
> > 180k allocated dentries, 170k sitting on the LRU...
> 
> Hm, and a significant amount of those 170k could rotate on the next
> shrinker scan due to recent references or do you generally have
> smaller spikes?

I see very little dentry/inode reclaim because the shrinker tends to
skip most inodes and dentries because they have the referenced bit
set on them whenever the shrinker runs. i.e. that's the working set,
and it gets maintained pretty well...

> But as per above I think the case for lazily removing shadow nodes is
> less convincing than for inodes and dentries.

Agreed.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-26 Thread Johannes Weiner
On Wed, Nov 27, 2013 at 09:29:37AM +1100, Dave Chinner wrote:
> On Tue, Nov 26, 2013 at 04:27:25PM -0500, Johannes Weiner wrote:
> > On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
> > > On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> > > > Previously, page cache radix tree nodes were freed after reclaim
> > > > emptied out their page pointers.  But now reclaim stores shadow
> > > > entries in their place, which are only reclaimed when the inodes
> > > > themselves are reclaimed.  This is problematic for bigger files that
> > > > are still in use after they have a significant amount of their cache
> > > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > > entries will just sit there and waste memory.  In the worst case, the
> > > > shadow entries will accumulate until the machine runs out of memory.
> 
> > > 
> > > > +   radix_tree_replace_slot(slot, page);
> > > > +   if (node) {
> > > > +   node->count++;
> > > > +   /* Installed page, can't be shadow-only anymore */
> > > > +   if (!list_empty(>lru))
> > > > +   list_lru_del(_shadow_nodes, 
> > > > >lru);
> > > > +   }
> > > > +   return 0;
> > > 
> > > Hm - what's the overhead of direct management of LRU removal
> > > here? Most list_lru code uses lazy removal (i.e. via the shrinker)
> > > to avoid having to touch the LRU when adding new references to an
> > > object.
> > 
> > It's measurable in microbenchmarks, but not when any real IO is
> > involved.  The difference was in the noise even on SSD drives.
> 
> Well, it's not an SSD or two I'm worried about - it's devices that
> can do millions of IOPS where this is likely to be noticable...
> 
> > The other list_lru users see items only once they become unused and
> > subsequent references are expected to be few and temporary, right?
> 
> They go onto the list when the refcount falls to zero, but reuse can
> be frequent when being referenced repeatedly by a single user. That
> avoids every reuse from removing the object from the LRU then
> putting it back on the LRU for every reference cycle...

That's true, but it's less of a concern in the radix_tree_node case
because it takes a full inactive list cycle after a refault before the
node is put back on the LRU.  Or a really unlikely placed partial node
truncation/invalidation (full truncation would just delete the whole
node anyway).

> > We expect pages to refault in spades on certain loads, at which point
> > we may have thousands of those nodes on the list that are no longer
> > reclaimable (10k nodes for about 2.5G of cache).
> 
> Sure, look at the way the inode and dentry caches work - entire
> caches of millions of inodes and dentries often sit on the LRUs. A
> quick look at my workstations dentry cache shows:
> 
> $ at /proc/sys/fs/dentry-state 
> 180108  170596  45  0   0   0
> 
> 180k allocated dentries, 170k sitting on the LRU...

Hm, and a significant amount of those 170k could rotate on the next
shrinker scan due to recent references or do you generally have
smaller spikes?

But as per above I think the case for lazily removing shadow nodes is
less convincing than for inodes and dentries.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-26 Thread Dave Chinner
On Tue, Nov 26, 2013 at 04:27:25PM -0500, Johannes Weiner wrote:
> On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
> > On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> > > Previously, page cache radix tree nodes were freed after reclaim
> > > emptied out their page pointers.  But now reclaim stores shadow
> > > entries in their place, which are only reclaimed when the inodes
> > > themselves are reclaimed.  This is problematic for bigger files that
> > > are still in use after they have a significant amount of their cache
> > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > entries will just sit there and waste memory.  In the worst case, the
> > > shadow entries will accumulate until the machine runs out of memory.

> > 
> > > + radix_tree_replace_slot(slot, page);
> > > + if (node) {
> > > + node->count++;
> > > + /* Installed page, can't be shadow-only anymore */
> > > + if (!list_empty(>lru))
> > > + list_lru_del(_shadow_nodes, >lru);
> > > + }
> > > + return 0;
> > 
> > Hm - what's the overhead of direct management of LRU removal
> > here? Most list_lru code uses lazy removal (i.e. via the shrinker)
> > to avoid having to touch the LRU when adding new references to an
> > object.
> 
> It's measurable in microbenchmarks, but not when any real IO is
> involved.  The difference was in the noise even on SSD drives.

Well, it's not an SSD or two I'm worried about - it's devices that
can do millions of IOPS where this is likely to be noticable...

> The other list_lru users see items only once they become unused and
> subsequent references are expected to be few and temporary, right?

They go onto the list when the refcount falls to zero, but reuse can
be frequent when being referenced repeatedly by a single user. That
avoids every reuse from removing the object from the LRU then
putting it back on the LRU for every reference cycle...

> We expect pages to refault in spades on certain loads, at which point
> we may have thousands of those nodes on the list that are no longer
> reclaimable (10k nodes for about 2.5G of cache).

Sure, look at the way the inode and dentry caches work - entire
caches of millions of inodes and dentries often sit on the LRUs. A
quick look at my workstations dentry cache shows:

$ at /proc/sys/fs/dentry-state 
180108  170596  45  0   0   0

180k allocated dentries, 170k sitting on the LRU...

> > > + * Page cache radix tree nodes containing only shadow entries can grow
> > > + * excessively on certain workloads.  That's why they are tracked on
> > > + * per-(NUMA)node lists and pushed back by a shrinker, but with a
> > > + * slightly higher threshold than regular shrinkers so we don't
> > > + * discard the entries too eagerly - after all, during light memory
> > > + * pressure is exactly when we need them.
> > > + *
> > > + * The list_lru lock nests inside the IRQ-safe mapping->tree_lock, so
> > > + * we have to disable IRQs for any list_lru operation as well.
> > > + */
> > > +
> > > +struct list_lru workingset_shadow_nodes;
> > > +
> > > +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> > > + struct shrink_control *sc)
> > > +{
> > > + unsigned long count;
> > > +
> > > + local_irq_disable();
> > > + count = list_lru_count_node(_shadow_nodes, sc->nid);
> > > + local_irq_enable();
> > 
> > The count returned is not perfectly accurate, and the use of it in
> > the shrinker will be concurrent with other modifications, so
> > disabling IRQs here doesn't add any anything but unnecessary
> > overhead.
> 
> Lockdep complains when taking an IRQ-unsafe lock (lru_lock) inside an
> IRQ-safe lock (mapping->tree_lock).

Bah - sometimes I hate lockdep because it makes people do silly
things just to shut it up. IMO, the right fix is this patch:

https://lkml.org/lkml/2013/7/31/7

> > > +#define NOIRQ_BATCH 32
> > > +
> > > +static enum lru_status shadow_lru_isolate(struct list_head *item,
> > > +   spinlock_t *lru_lock,
> > > +   void *arg)
> > > +{
> > > + struct address_space *mapping;
> > > + struct radix_tree_node *node;
> > > + unsigned long *batch = arg;
> > > + unsigned int i;
> > > +
> > > + node = container_of(item, struct radix_tree_node, lru);
> > > + mapping = node->private;
> > > +
> > > + /* Don't disable IRQs for too long */
> > > + if (--(*batch) == 0) {
> > > + spin_unlock_irq(lru_lock);
> > > + *batch = NOIRQ_BATCH;
> > > + spin_lock_irq(lru_lock);
> > > + return LRU_RETRY;
> > > + }
> > 
> > Ugh.
> > 
> > > + /* Coming from the list, inverse the lock order */
> > > + if (!spin_trylock(>tree_lock))
> > > + return LRU_SKIP;
> > 
> > Why not spin_trylock_irq(>tree_lock) and get rid of the
> > nasty irq batching stuff? The LRU list is internally consistent,
> > so I don't see why irqs need to be 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-26 Thread Johannes Weiner
On Mon, Nov 25, 2013 at 04:13:32PM -0800, Andrew Morton wrote:
> On Sun, 24 Nov 2013 18:38:28 -0500 Johannes Weiner  wrote:
> 
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> 
> Why per-node rather than a global list?

The radix tree nodes should always be allocated node-local, so it made
sense to string them up locally as well and prevent otherwise
independent workloads on separate nodes to contend for the same global
lock and list.

> >  A simple shrinker will reclaim these nodes on memory pressure.
> 
> Truncate needs to go off and massacre these things as well - some
> description of how that happens would be useful.

What do you mean?  Truncate operates on a range in the tree and
deletes these items the same way page entries are deleted.  It's a
regular tree deletion.  Only the shrinker is special because it
reaches into the tree coming from a tree node, not from the root and
an index.

> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >from the node up to the tree root, which is needed to perform a
> >deletion.  To solve this, encode in each node its offset inside the
> >parent.  This can be stored in the unused upper bits of the same
> >member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >regular entries, to quickly detect when the node is ready to go to
> >the shadow node LRU list.  The current entry count is an unsigned
> >int but the maximum number of entries is 64, so a shadow counter
> >can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs the lock, which is located in the address
> >space,
> 
> Presumably "the lock" == tree_lock.

Yes, will clarify.

> >so store a backpointer to it.
> 
> 
> 
> "it" is the address_space, not tree_lock, yes?

Yes, the address space so we can get to both the lock and the tree
root.  Will clarify.

> >   The parent pointer is in a
> >union with the 2-word rcu_head, so the backpointer comes at no
> >extra cost as well.
> 
> So we have a shrinker walking backwards from radix-tree nodes and
> reaching up into address_spaces.  We need to take steps to prevent
> those address_spaces from getting shot down (reclaim, umount, truncate,
> etc) while we're doing this.  What's happening here?

Ah, now the question about truncate above makes more sense as well.
Teardown makes sure the node is unlinked from the LRU, so holding the
lru_lock while the node is on the LRU pins the whole address space and
keeps teardown from finishing.

Dave already said that the lru_lock time is too long, though, so I'll
have to change the reclaimer to use RCU.  radix_tree_node is already
RCU-freed.  Teardown can mark the node dead under the tree lock, while
the shrinker can optimistically can take the tree lock of a node under
RCU, then verify the node is still alive.

I'll rework this and document the lifetime management properly.

> > 4. The node needs to be linked to an LRU list, which requires a list
> >head inside the node.  This does increase the size of the node, but
> >it does not change the number of objects that fit into a slab page.
> >
> > ...
> >
> > --- a/include/linux/list_lru.h
> > +++ b/include/linux/list_lru.h
> > @@ -32,7 +32,7 @@ struct list_lru {
> >  };
> >  
> >  void list_lru_destroy(struct list_lru *lru);
> > -int list_lru_init(struct list_lru *lru);
> > +int list_lru_init(struct list_lru *lru, struct lock_class_key *key);
> 
> It's a bit of a shame to be adding overhead to non-lockdep kernels.  A
> few ifdefs could fix this.
> 
> Presumably this is being done to squish some lockdep warning you hit. 
> A comment at the list_lru_init() implementation site would be useful. 
> One which describes the warning and why it's OK to squish it.

Yes, the other users of list_lru have an IRQ-unsafe lru_lock, so I
added a separate class for the IRQ-safe version.

> >  struct radix_tree_node {
> > -   unsigned intheight; /* Height from the bottom */
> > +   unsigned intpath;   /* Offset in parent & height from the bottom */
> > unsigned intcount;
> > union {
> > - 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-26 Thread Johannes Weiner
On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
> On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.  A
> > simple shrinker will reclaim these nodes on memory pressure.
> > 
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >from the node up to the tree root, which is needed to perform a
> >deletion.  To solve this, encode in each node its offset inside the
> >parent.  This can be stored in the unused upper bits of the same
> >member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >regular entries, to quickly detect when the node is ready to go to
> >the shadow node LRU list.  The current entry count is an unsigned
> >int but the maximum number of entries is 64, so a shadow counter
> >can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs the lock, which is located in the address
> >space, so store a backpointer to it.  The parent pointer is in a
> >union with the 2-word rcu_head, so the backpointer comes at no
> >extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >head inside the node.  This does increase the size of the node, but
> >it does not change the number of objects that fit into a slab page.
> > 
> > Signed-off-by: Johannes Weiner 
> > ---
> >  fs/super.c|   4 +-
> >  fs/xfs/xfs_buf.c  |   2 +-
> >  fs/xfs/xfs_qm.c   |   2 +-
> >  include/linux/list_lru.h  |   2 +-
> >  include/linux/radix-tree.h|  30 +++---
> >  include/linux/swap.h  |   1 +
> >  include/linux/vm_event_item.h |   1 +
> >  lib/radix-tree.c  |  36 +++-
> >  mm/filemap.c  |  70 
> >  mm/list_lru.c |   4 +-
> >  mm/truncate.c |  19 ++-
> >  mm/vmstat.c   |   2 +
> >  mm/workingset.c   | 124 
> > ++
> >  13 files changed, 255 insertions(+), 42 deletions(-)
> > 
> > diff --git a/fs/super.c b/fs/super.c
> > index 0225c20..a958d52 100644
> > --- a/fs/super.c
> > +++ b/fs/super.c
> > @@ -196,9 +196,9 @@ static struct super_block *alloc_super(struct 
> > file_system_type *type, int flags)
> > INIT_HLIST_BL_HEAD(>s_anon);
> > INIT_LIST_HEAD(>s_inodes);
> >  
> > -   if (list_lru_init(>s_dentry_lru))
> > +   if (list_lru_init(>s_dentry_lru, NULL))
> > goto err_out;
> > -   if (list_lru_init(>s_inode_lru))
> > +   if (list_lru_init(>s_inode_lru, NULL))
> > goto err_out_dentry_lru;
> 
> rather than modifying all the callers of list_lru_init(), can you
> add a new function list_lru_init_key() and implement list_lru_init()
> as a wrapper around it?

Ok, sure.

> >  static int page_cache_tree_insert(struct address_space *mapping,
> >   struct page *page, void **shadowp)
> >  {
> 
> > +   radix_tree_replace_slot(slot, page);
> > +   if (node) {
> > +   node->count++;
> > +   /* Installed page, can't be shadow-only anymore */
> > +   if (!list_empty(>lru))
> > +   list_lru_del(_shadow_nodes, >lru);
> > +   }
> > +   return 0;
> 
> Hm - what's the overhead of direct management of LRU removal
> here? Most list_lru code uses lazy removal (i.e. via the shrinker)
> to avoid having to touch the LRU when adding new references to an
> object.

It's measurable in microbenchmarks, but not when any real IO is
involved.  The difference was in the noise even on SSD drives.

The other list_lru users see items only once they become unused and
subsequent references are expected to be few and temporary, right?

We expect pages to refault in spades on certain loads, at which point
we may have thousands of those nodes on the list that are no longer
reclaimable (10k nodes for about 2.5G of cache).

> > + * Page cache radix tree nodes containing only shadow entries 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-26 Thread Johannes Weiner
On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
 On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
  Previously, page cache radix tree nodes were freed after reclaim
  emptied out their page pointers.  But now reclaim stores shadow
  entries in their place, which are only reclaimed when the inodes
  themselves are reclaimed.  This is problematic for bigger files that
  are still in use after they have a significant amount of their cache
  reclaimed, without any of those pages actually refaulting.  The shadow
  entries will just sit there and waste memory.  In the worst case, the
  shadow entries will accumulate until the machine runs out of memory.
  
  To get this under control, the VM will track radix tree nodes
  exclusively containing shadow entries on a per-NUMA node list.  A
  simple shrinker will reclaim these nodes on memory pressure.
  
  A few things need to be stored in the radix tree node to implement the
  shadow node LRU and allow tree deletions coming from the list:
  
  1. There is no index available that would describe the reverse path
 from the node up to the tree root, which is needed to perform a
 deletion.  To solve this, encode in each node its offset inside the
 parent.  This can be stored in the unused upper bits of the same
 member that stores the node's height at no extra space cost.
  
  2. The number of shadow entries needs to be counted in addition to the
 regular entries, to quickly detect when the node is ready to go to
 the shadow node LRU list.  The current entry count is an unsigned
 int but the maximum number of entries is 64, so a shadow counter
 can easily be stored in the unused upper bits.
  
  3. Tree modification needs the lock, which is located in the address
 space, so store a backpointer to it.  The parent pointer is in a
 union with the 2-word rcu_head, so the backpointer comes at no
 extra cost as well.
  
  4. The node needs to be linked to an LRU list, which requires a list
 head inside the node.  This does increase the size of the node, but
 it does not change the number of objects that fit into a slab page.
  
  Signed-off-by: Johannes Weiner han...@cmpxchg.org
  ---
   fs/super.c|   4 +-
   fs/xfs/xfs_buf.c  |   2 +-
   fs/xfs/xfs_qm.c   |   2 +-
   include/linux/list_lru.h  |   2 +-
   include/linux/radix-tree.h|  30 +++---
   include/linux/swap.h  |   1 +
   include/linux/vm_event_item.h |   1 +
   lib/radix-tree.c  |  36 +++-
   mm/filemap.c  |  70 
   mm/list_lru.c |   4 +-
   mm/truncate.c |  19 ++-
   mm/vmstat.c   |   2 +
   mm/workingset.c   | 124 
  ++
   13 files changed, 255 insertions(+), 42 deletions(-)
  
  diff --git a/fs/super.c b/fs/super.c
  index 0225c20..a958d52 100644
  --- a/fs/super.c
  +++ b/fs/super.c
  @@ -196,9 +196,9 @@ static struct super_block *alloc_super(struct 
  file_system_type *type, int flags)
  INIT_HLIST_BL_HEAD(s-s_anon);
  INIT_LIST_HEAD(s-s_inodes);
   
  -   if (list_lru_init(s-s_dentry_lru))
  +   if (list_lru_init(s-s_dentry_lru, NULL))
  goto err_out;
  -   if (list_lru_init(s-s_inode_lru))
  +   if (list_lru_init(s-s_inode_lru, NULL))
  goto err_out_dentry_lru;
 
 rather than modifying all the callers of list_lru_init(), can you
 add a new function list_lru_init_key() and implement list_lru_init()
 as a wrapper around it?

Ok, sure.

   static int page_cache_tree_insert(struct address_space *mapping,
struct page *page, void **shadowp)
   {
 
  +   radix_tree_replace_slot(slot, page);
  +   if (node) {
  +   node-count++;
  +   /* Installed page, can't be shadow-only anymore */
  +   if (!list_empty(node-lru))
  +   list_lru_del(workingset_shadow_nodes, node-lru);
  +   }
  +   return 0;
 
 Hm - what's the overhead of direct management of LRU removal
 here? Most list_lru code uses lazy removal (i.e. via the shrinker)
 to avoid having to touch the LRU when adding new references to an
 object.

It's measurable in microbenchmarks, but not when any real IO is
involved.  The difference was in the noise even on SSD drives.

The other list_lru users see items only once they become unused and
subsequent references are expected to be few and temporary, right?

We expect pages to refault in spades on certain loads, at which point
we may have thousands of those nodes on the list that are no longer
reclaimable (10k nodes for about 2.5G of cache).

  + * Page cache radix tree nodes containing only shadow entries can grow
  + * excessively on certain workloads.  That's why they are tracked on
  + * per-(NUMA)node lists and pushed back by a 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-26 Thread Johannes Weiner
On Mon, Nov 25, 2013 at 04:13:32PM -0800, Andrew Morton wrote:
 On Sun, 24 Nov 2013 18:38:28 -0500 Johannes Weiner han...@cmpxchg.org wrote:
 
  Previously, page cache radix tree nodes were freed after reclaim
  emptied out their page pointers.  But now reclaim stores shadow
  entries in their place, which are only reclaimed when the inodes
  themselves are reclaimed.  This is problematic for bigger files that
  are still in use after they have a significant amount of their cache
  reclaimed, without any of those pages actually refaulting.  The shadow
  entries will just sit there and waste memory.  In the worst case, the
  shadow entries will accumulate until the machine runs out of memory.
  
  To get this under control, the VM will track radix tree nodes
  exclusively containing shadow entries on a per-NUMA node list.
 
 Why per-node rather than a global list?

The radix tree nodes should always be allocated node-local, so it made
sense to string them up locally as well and prevent otherwise
independent workloads on separate nodes to contend for the same global
lock and list.

   A simple shrinker will reclaim these nodes on memory pressure.
 
 Truncate needs to go off and massacre these things as well - some
 description of how that happens would be useful.

What do you mean?  Truncate operates on a range in the tree and
deletes these items the same way page entries are deleted.  It's a
regular tree deletion.  Only the shrinker is special because it
reaches into the tree coming from a tree node, not from the root and
an index.

  A few things need to be stored in the radix tree node to implement the
  shadow node LRU and allow tree deletions coming from the list:
  
  1. There is no index available that would describe the reverse path
 from the node up to the tree root, which is needed to perform a
 deletion.  To solve this, encode in each node its offset inside the
 parent.  This can be stored in the unused upper bits of the same
 member that stores the node's height at no extra space cost.
  
  2. The number of shadow entries needs to be counted in addition to the
 regular entries, to quickly detect when the node is ready to go to
 the shadow node LRU list.  The current entry count is an unsigned
 int but the maximum number of entries is 64, so a shadow counter
 can easily be stored in the unused upper bits.
  
  3. Tree modification needs the lock, which is located in the address
 space,
 
 Presumably the lock == tree_lock.

Yes, will clarify.

 so store a backpointer to it.
 
 looks at the code
 
 it is the address_space, not tree_lock, yes?

Yes, the address space so we can get to both the lock and the tree
root.  Will clarify.

The parent pointer is in a
 union with the 2-word rcu_head, so the backpointer comes at no
 extra cost as well.
 
 So we have a shrinker walking backwards from radix-tree nodes and
 reaching up into address_spaces.  We need to take steps to prevent
 those address_spaces from getting shot down (reclaim, umount, truncate,
 etc) while we're doing this.  What's happening here?

Ah, now the question about truncate above makes more sense as well.
Teardown makes sure the node is unlinked from the LRU, so holding the
lru_lock while the node is on the LRU pins the whole address space and
keeps teardown from finishing.

Dave already said that the lru_lock time is too long, though, so I'll
have to change the reclaimer to use RCU.  radix_tree_node is already
RCU-freed.  Teardown can mark the node dead under the tree lock, while
the shrinker can optimistically can take the tree lock of a node under
RCU, then verify the node is still alive.

I'll rework this and document the lifetime management properly.

  4. The node needs to be linked to an LRU list, which requires a list
 head inside the node.  This does increase the size of the node, but
 it does not change the number of objects that fit into a slab page.
 
  ...
 
  --- a/include/linux/list_lru.h
  +++ b/include/linux/list_lru.h
  @@ -32,7 +32,7 @@ struct list_lru {
   };
   
   void list_lru_destroy(struct list_lru *lru);
  -int list_lru_init(struct list_lru *lru);
  +int list_lru_init(struct list_lru *lru, struct lock_class_key *key);
 
 It's a bit of a shame to be adding overhead to non-lockdep kernels.  A
 few ifdefs could fix this.
 
 Presumably this is being done to squish some lockdep warning you hit. 
 A comment at the list_lru_init() implementation site would be useful. 
 One which describes the warning and why it's OK to squish it.

Yes, the other users of list_lru have an IRQ-unsafe lru_lock, so I
added a separate class for the IRQ-safe version.

   struct radix_tree_node {
  -   unsigned intheight; /* Height from the bottom */
  +   unsigned intpath;   /* Offset in parent  height from the bottom */
  unsigned intcount;
  union {
  -   struct radix_tree_node *parent; /* Used when ascending tree */
  -   struct 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-26 Thread Dave Chinner
On Tue, Nov 26, 2013 at 04:27:25PM -0500, Johannes Weiner wrote:
 On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
  On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
   Previously, page cache radix tree nodes were freed after reclaim
   emptied out their page pointers.  But now reclaim stores shadow
   entries in their place, which are only reclaimed when the inodes
   themselves are reclaimed.  This is problematic for bigger files that
   are still in use after they have a significant amount of their cache
   reclaimed, without any of those pages actually refaulting.  The shadow
   entries will just sit there and waste memory.  In the worst case, the
   shadow entries will accumulate until the machine runs out of memory.

  
   + radix_tree_replace_slot(slot, page);
   + if (node) {
   + node-count++;
   + /* Installed page, can't be shadow-only anymore */
   + if (!list_empty(node-lru))
   + list_lru_del(workingset_shadow_nodes, node-lru);
   + }
   + return 0;
  
  Hm - what's the overhead of direct management of LRU removal
  here? Most list_lru code uses lazy removal (i.e. via the shrinker)
  to avoid having to touch the LRU when adding new references to an
  object.
 
 It's measurable in microbenchmarks, but not when any real IO is
 involved.  The difference was in the noise even on SSD drives.

Well, it's not an SSD or two I'm worried about - it's devices that
can do millions of IOPS where this is likely to be noticable...

 The other list_lru users see items only once they become unused and
 subsequent references are expected to be few and temporary, right?

They go onto the list when the refcount falls to zero, but reuse can
be frequent when being referenced repeatedly by a single user. That
avoids every reuse from removing the object from the LRU then
putting it back on the LRU for every reference cycle...

 We expect pages to refault in spades on certain loads, at which point
 we may have thousands of those nodes on the list that are no longer
 reclaimable (10k nodes for about 2.5G of cache).

Sure, look at the way the inode and dentry caches work - entire
caches of millions of inodes and dentries often sit on the LRUs. A
quick look at my workstations dentry cache shows:

$ at /proc/sys/fs/dentry-state 
180108  170596  45  0   0   0

180k allocated dentries, 170k sitting on the LRU...

   + * Page cache radix tree nodes containing only shadow entries can grow
   + * excessively on certain workloads.  That's why they are tracked on
   + * per-(NUMA)node lists and pushed back by a shrinker, but with a
   + * slightly higher threshold than regular shrinkers so we don't
   + * discard the entries too eagerly - after all, during light memory
   + * pressure is exactly when we need them.
   + *
   + * The list_lru lock nests inside the IRQ-safe mapping-tree_lock, so
   + * we have to disable IRQs for any list_lru operation as well.
   + */
   +
   +struct list_lru workingset_shadow_nodes;
   +
   +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
   + struct shrink_control *sc)
   +{
   + unsigned long count;
   +
   + local_irq_disable();
   + count = list_lru_count_node(workingset_shadow_nodes, sc-nid);
   + local_irq_enable();
  
  The count returned is not perfectly accurate, and the use of it in
  the shrinker will be concurrent with other modifications, so
  disabling IRQs here doesn't add any anything but unnecessary
  overhead.
 
 Lockdep complains when taking an IRQ-unsafe lock (lru_lock) inside an
 IRQ-safe lock (mapping-tree_lock).

Bah - sometimes I hate lockdep because it makes people do silly
things just to shut it up. IMO, the right fix is this patch:

https://lkml.org/lkml/2013/7/31/7

   +#define NOIRQ_BATCH 32
   +
   +static enum lru_status shadow_lru_isolate(struct list_head *item,
   +   spinlock_t *lru_lock,
   +   void *arg)
   +{
   + struct address_space *mapping;
   + struct radix_tree_node *node;
   + unsigned long *batch = arg;
   + unsigned int i;
   +
   + node = container_of(item, struct radix_tree_node, lru);
   + mapping = node-private;
   +
   + /* Don't disable IRQs for too long */
   + if (--(*batch) == 0) {
   + spin_unlock_irq(lru_lock);
   + *batch = NOIRQ_BATCH;
   + spin_lock_irq(lru_lock);
   + return LRU_RETRY;
   + }
  
  Ugh.
  
   + /* Coming from the list, inverse the lock order */
   + if (!spin_trylock(mapping-tree_lock))
   + return LRU_SKIP;
  
  Why not spin_trylock_irq(mapping-tree_lock) and get rid of the
  nasty irq batching stuff? The LRU list is internally consistent,
  so I don't see why irqs need to be disabled to walk across the
  objects in the list - we only need that to avoid taking an interrupt
  while holding the mapping-tree_lock() and the interrupt running
  I/O completion which may try to 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-26 Thread Johannes Weiner
On Wed, Nov 27, 2013 at 09:29:37AM +1100, Dave Chinner wrote:
 On Tue, Nov 26, 2013 at 04:27:25PM -0500, Johannes Weiner wrote:
  On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
   On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.
 
   
+   radix_tree_replace_slot(slot, page);
+   if (node) {
+   node-count++;
+   /* Installed page, can't be shadow-only anymore */
+   if (!list_empty(node-lru))
+   list_lru_del(workingset_shadow_nodes, 
node-lru);
+   }
+   return 0;
   
   Hm - what's the overhead of direct management of LRU removal
   here? Most list_lru code uses lazy removal (i.e. via the shrinker)
   to avoid having to touch the LRU when adding new references to an
   object.
  
  It's measurable in microbenchmarks, but not when any real IO is
  involved.  The difference was in the noise even on SSD drives.
 
 Well, it's not an SSD or two I'm worried about - it's devices that
 can do millions of IOPS where this is likely to be noticable...
 
  The other list_lru users see items only once they become unused and
  subsequent references are expected to be few and temporary, right?
 
 They go onto the list when the refcount falls to zero, but reuse can
 be frequent when being referenced repeatedly by a single user. That
 avoids every reuse from removing the object from the LRU then
 putting it back on the LRU for every reference cycle...

That's true, but it's less of a concern in the radix_tree_node case
because it takes a full inactive list cycle after a refault before the
node is put back on the LRU.  Or a really unlikely placed partial node
truncation/invalidation (full truncation would just delete the whole
node anyway).

  We expect pages to refault in spades on certain loads, at which point
  we may have thousands of those nodes on the list that are no longer
  reclaimable (10k nodes for about 2.5G of cache).
 
 Sure, look at the way the inode and dentry caches work - entire
 caches of millions of inodes and dentries often sit on the LRUs. A
 quick look at my workstations dentry cache shows:
 
 $ at /proc/sys/fs/dentry-state 
 180108  170596  45  0   0   0
 
 180k allocated dentries, 170k sitting on the LRU...

Hm, and a significant amount of those 170k could rotate on the next
shrinker scan due to recent references or do you generally have
smaller spikes?

But as per above I think the case for lazily removing shadow nodes is
less convincing than for inodes and dentries.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-26 Thread Dave Chinner
On Tue, Nov 26, 2013 at 06:00:10PM -0500, Johannes Weiner wrote:
 On Wed, Nov 27, 2013 at 09:29:37AM +1100, Dave Chinner wrote:
  On Tue, Nov 26, 2013 at 04:27:25PM -0500, Johannes Weiner wrote:
   On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
 Previously, page cache radix tree nodes were freed after reclaim
 emptied out their page pointers.  But now reclaim stores shadow
 entries in their place, which are only reclaimed when the inodes
 themselves are reclaimed.  This is problematic for bigger files that
 are still in use after they have a significant amount of their cache
 reclaimed, without any of those pages actually refaulting.  The shadow
 entries will just sit there and waste memory.  In the worst case, the
 shadow entries will accumulate until the machine runs out of memory.
  

 + radix_tree_replace_slot(slot, page);
 + if (node) {
 + node-count++;
 + /* Installed page, can't be shadow-only anymore */
 + if (!list_empty(node-lru))
 + list_lru_del(workingset_shadow_nodes, 
 node-lru);
 + }
 + return 0;

Hm - what's the overhead of direct management of LRU removal
here? Most list_lru code uses lazy removal (i.e. via the shrinker)
to avoid having to touch the LRU when adding new references to an
object.
   
   It's measurable in microbenchmarks, but not when any real IO is
   involved.  The difference was in the noise even on SSD drives.
  
  Well, it's not an SSD or two I'm worried about - it's devices that
  can do millions of IOPS where this is likely to be noticable...
  
   The other list_lru users see items only once they become unused and
   subsequent references are expected to be few and temporary, right?
  
  They go onto the list when the refcount falls to zero, but reuse can
  be frequent when being referenced repeatedly by a single user. That
  avoids every reuse from removing the object from the LRU then
  putting it back on the LRU for every reference cycle...
 
 That's true, but it's less of a concern in the radix_tree_node case
 because it takes a full inactive list cycle after a refault before the
 node is put back on the LRU.  Or a really unlikely placed partial node
 truncation/invalidation (full truncation would just delete the whole
 node anyway).

OK, fair enough. We can deal with the problem if we see it being a
limitation.

   We expect pages to refault in spades on certain loads, at which point
   we may have thousands of those nodes on the list that are no longer
   reclaimable (10k nodes for about 2.5G of cache).
  
  Sure, look at the way the inode and dentry caches work - entire
  caches of millions of inodes and dentries often sit on the LRUs. A
  quick look at my workstations dentry cache shows:
  
  $ at /proc/sys/fs/dentry-state 
  180108  170596  45  0   0   0
  
  180k allocated dentries, 170k sitting on the LRU...
 
 Hm, and a significant amount of those 170k could rotate on the next
 shrinker scan due to recent references or do you generally have
 smaller spikes?

I see very little dentry/inode reclaim because the shrinker tends to
skip most inodes and dentries because they have the referenced bit
set on them whenever the shrinker runs. i.e. that's the working set,
and it gets maintained pretty well...

 But as per above I think the case for lazily removing shadow nodes is
 less convincing than for inodes and dentries.

Agreed.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-25 Thread Andrew Morton
On Sun, 24 Nov 2013 18:38:28 -0500 Johannes Weiner  wrote:

> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.

Why per-node rather than a global list?

>  A simple shrinker will reclaim these nodes on memory pressure.

Truncate needs to go off and massacre these things as well - some
description of how that happens would be useful.

> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:
> 
> 1. There is no index available that would describe the reverse path
>from the node up to the tree root, which is needed to perform a
>deletion.  To solve this, encode in each node its offset inside the
>parent.  This can be stored in the unused upper bits of the same
>member that stores the node's height at no extra space cost.
> 
> 2. The number of shadow entries needs to be counted in addition to the
>regular entries, to quickly detect when the node is ready to go to
>the shadow node LRU list.  The current entry count is an unsigned
>int but the maximum number of entries is 64, so a shadow counter
>can easily be stored in the unused upper bits.
> 
> 3. Tree modification needs the lock, which is located in the address
>space,

Presumably "the lock" == tree_lock.

>so store a backpointer to it.



"it" is the address_space, not tree_lock, yes?

>   The parent pointer is in a
>union with the 2-word rcu_head, so the backpointer comes at no
>extra cost as well.

So we have a shrinker walking backwards from radix-tree nodes and
reaching up into address_spaces.  We need to take steps to prevent
those address_spaces from getting shot down (reclaim, umount, truncate,
etc) while we're doing this.  What's happening here?

> 4. The node needs to be linked to an LRU list, which requires a list
>head inside the node.  This does increase the size of the node, but
>it does not change the number of objects that fit into a slab page.
>
> ...
>
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -32,7 +32,7 @@ struct list_lru {
>  };
>  
>  void list_lru_destroy(struct list_lru *lru);
> -int list_lru_init(struct list_lru *lru);
> +int list_lru_init(struct list_lru *lru, struct lock_class_key *key);

It's a bit of a shame to be adding overhead to non-lockdep kernels.  A
few ifdefs could fix this.

Presumably this is being done to squish some lockdep warning you hit. 
A comment at the list_lru_init() implementation site would be useful. 
One which describes the warning and why it's OK to squish it.

>
> ...
>
>  struct radix_tree_node {
> - unsigned intheight; /* Height from the bottom */
> + unsigned intpath;   /* Offset in parent & height from the bottom */
>   unsigned intcount;
>   union {
> - struct radix_tree_node *parent; /* Used when ascending tree */
> - struct rcu_head rcu_head;   /* Used when freeing node */
> + /* Used when ascending tree */
> + struct {
> + struct radix_tree_node *parent;
> + void *private;

Private to whom?  The radix-tree implementation?  The radix-tree caller?

> + };
> + /* Used when freeing node */
> + struct rcu_head rcu_head;
>   };
> + struct list_head lru;

Locking for this list?

>   void __rcu  *slots[RADIX_TREE_MAP_SIZE];
>   unsigned long   tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
>  };
>  
>
> ...
>
> +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> + struct shrink_control *sc)
> +{
> + unsigned long count;
> +
> + local_irq_disable();
> + count = list_lru_count_node(_shadow_nodes, sc->nid);
> + local_irq_enable();

I'm struggling with the local_irq_disable() here.  Presumably it's
there to quash a lockdep warning, but page_cache_tree_delete() and
friends can get away without the local_irq_disable().  Some more
clarity here would be nice.

> + return count;
> +}
> +
> +#define NOIRQ_BATCH 32
> +
> +static enum lru_status shadow_lru_isolate(struct list_head *item,
> +   spinlock_t *lru_lock,
> +   void *arg)
> +{
> + struct address_space *mapping;
> + struct 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-25 Thread Dave Chinner
On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.  A
> simple shrinker will reclaim these nodes on memory pressure.
> 
> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:
> 
> 1. There is no index available that would describe the reverse path
>from the node up to the tree root, which is needed to perform a
>deletion.  To solve this, encode in each node its offset inside the
>parent.  This can be stored in the unused upper bits of the same
>member that stores the node's height at no extra space cost.
> 
> 2. The number of shadow entries needs to be counted in addition to the
>regular entries, to quickly detect when the node is ready to go to
>the shadow node LRU list.  The current entry count is an unsigned
>int but the maximum number of entries is 64, so a shadow counter
>can easily be stored in the unused upper bits.
> 
> 3. Tree modification needs the lock, which is located in the address
>space, so store a backpointer to it.  The parent pointer is in a
>union with the 2-word rcu_head, so the backpointer comes at no
>extra cost as well.
> 
> 4. The node needs to be linked to an LRU list, which requires a list
>head inside the node.  This does increase the size of the node, but
>it does not change the number of objects that fit into a slab page.
> 
> Signed-off-by: Johannes Weiner 
> ---
>  fs/super.c|   4 +-
>  fs/xfs/xfs_buf.c  |   2 +-
>  fs/xfs/xfs_qm.c   |   2 +-
>  include/linux/list_lru.h  |   2 +-
>  include/linux/radix-tree.h|  30 +++---
>  include/linux/swap.h  |   1 +
>  include/linux/vm_event_item.h |   1 +
>  lib/radix-tree.c  |  36 +++-
>  mm/filemap.c  |  70 
>  mm/list_lru.c |   4 +-
>  mm/truncate.c |  19 ++-
>  mm/vmstat.c   |   2 +
>  mm/workingset.c   | 124 
> ++
>  13 files changed, 255 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 0225c20..a958d52 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -196,9 +196,9 @@ static struct super_block *alloc_super(struct 
> file_system_type *type, int flags)
>   INIT_HLIST_BL_HEAD(>s_anon);
>   INIT_LIST_HEAD(>s_inodes);
>  
> - if (list_lru_init(>s_dentry_lru))
> + if (list_lru_init(>s_dentry_lru, NULL))
>   goto err_out;
> - if (list_lru_init(>s_inode_lru))
> + if (list_lru_init(>s_inode_lru, NULL))
>   goto err_out_dentry_lru;

rather than modifying all the callers of list_lru_init(), can you
add a new function list_lru_init_key() and implement list_lru_init()
as a wrapper around it?

[snip radix tree modifications I didn't look at]

>  static int page_cache_tree_insert(struct address_space *mapping,
> struct page *page, void **shadowp)
>  {

> + radix_tree_replace_slot(slot, page);
> + if (node) {
> + node->count++;
> + /* Installed page, can't be shadow-only anymore */
> + if (!list_empty(>lru))
> + list_lru_del(_shadow_nodes, >lru);
> + }
> + return 0;

Hm - what's the overhead of direct management of LRU removal
here? Most list_lru code uses lazy removal (i.e. via the shrinker)
to avoid having to touch the LRU when adding new references to an
object.

> +
> +/*
> + * Page cache radix tree nodes containing only shadow entries can grow
> + * excessively on certain workloads.  That's why they are tracked on
> + * per-(NUMA)node lists and pushed back by a shrinker, but with a
> + * slightly higher threshold than regular shrinkers so we don't
> + * discard the entries too eagerly - after all, during light memory
> + * pressure is exactly when we need them.
> + *
> + * The list_lru lock nests inside the IRQ-safe mapping->tree_lock, so
> + * we have to disable IRQs for any list_lru operation as well.
> + */
> +
> +struct list_lru workingset_shadow_nodes;
> +
> +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> +   

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-25 Thread Dave Chinner
On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
 Previously, page cache radix tree nodes were freed after reclaim
 emptied out their page pointers.  But now reclaim stores shadow
 entries in their place, which are only reclaimed when the inodes
 themselves are reclaimed.  This is problematic for bigger files that
 are still in use after they have a significant amount of their cache
 reclaimed, without any of those pages actually refaulting.  The shadow
 entries will just sit there and waste memory.  In the worst case, the
 shadow entries will accumulate until the machine runs out of memory.
 
 To get this under control, the VM will track radix tree nodes
 exclusively containing shadow entries on a per-NUMA node list.  A
 simple shrinker will reclaim these nodes on memory pressure.
 
 A few things need to be stored in the radix tree node to implement the
 shadow node LRU and allow tree deletions coming from the list:
 
 1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion.  To solve this, encode in each node its offset inside the
parent.  This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
 
 2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list.  The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
 
 3. Tree modification needs the lock, which is located in the address
space, so store a backpointer to it.  The parent pointer is in a
union with the 2-word rcu_head, so the backpointer comes at no
extra cost as well.
 
 4. The node needs to be linked to an LRU list, which requires a list
head inside the node.  This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.
 
 Signed-off-by: Johannes Weiner han...@cmpxchg.org
 ---
  fs/super.c|   4 +-
  fs/xfs/xfs_buf.c  |   2 +-
  fs/xfs/xfs_qm.c   |   2 +-
  include/linux/list_lru.h  |   2 +-
  include/linux/radix-tree.h|  30 +++---
  include/linux/swap.h  |   1 +
  include/linux/vm_event_item.h |   1 +
  lib/radix-tree.c  |  36 +++-
  mm/filemap.c  |  70 
  mm/list_lru.c |   4 +-
  mm/truncate.c |  19 ++-
  mm/vmstat.c   |   2 +
  mm/workingset.c   | 124 
 ++
  13 files changed, 255 insertions(+), 42 deletions(-)
 
 diff --git a/fs/super.c b/fs/super.c
 index 0225c20..a958d52 100644
 --- a/fs/super.c
 +++ b/fs/super.c
 @@ -196,9 +196,9 @@ static struct super_block *alloc_super(struct 
 file_system_type *type, int flags)
   INIT_HLIST_BL_HEAD(s-s_anon);
   INIT_LIST_HEAD(s-s_inodes);
  
 - if (list_lru_init(s-s_dentry_lru))
 + if (list_lru_init(s-s_dentry_lru, NULL))
   goto err_out;
 - if (list_lru_init(s-s_inode_lru))
 + if (list_lru_init(s-s_inode_lru, NULL))
   goto err_out_dentry_lru;

rather than modifying all the callers of list_lru_init(), can you
add a new function list_lru_init_key() and implement list_lru_init()
as a wrapper around it?

[snip radix tree modifications I didn't look at]

  static int page_cache_tree_insert(struct address_space *mapping,
 struct page *page, void **shadowp)
  {

 + radix_tree_replace_slot(slot, page);
 + if (node) {
 + node-count++;
 + /* Installed page, can't be shadow-only anymore */
 + if (!list_empty(node-lru))
 + list_lru_del(workingset_shadow_nodes, node-lru);
 + }
 + return 0;

Hm - what's the overhead of direct management of LRU removal
here? Most list_lru code uses lazy removal (i.e. via the shrinker)
to avoid having to touch the LRU when adding new references to an
object.

 +
 +/*
 + * Page cache radix tree nodes containing only shadow entries can grow
 + * excessively on certain workloads.  That's why they are tracked on
 + * per-(NUMA)node lists and pushed back by a shrinker, but with a
 + * slightly higher threshold than regular shrinkers so we don't
 + * discard the entries too eagerly - after all, during light memory
 + * pressure is exactly when we need them.
 + *
 + * The list_lru lock nests inside the IRQ-safe mapping-tree_lock, so
 + * we have to disable IRQs for any list_lru operation as well.
 + */
 +
 +struct list_lru workingset_shadow_nodes;
 +
 +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 + struct shrink_control *sc)
 

Re: [patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-25 Thread Andrew Morton
On Sun, 24 Nov 2013 18:38:28 -0500 Johannes Weiner han...@cmpxchg.org wrote:

 Previously, page cache radix tree nodes were freed after reclaim
 emptied out their page pointers.  But now reclaim stores shadow
 entries in their place, which are only reclaimed when the inodes
 themselves are reclaimed.  This is problematic for bigger files that
 are still in use after they have a significant amount of their cache
 reclaimed, without any of those pages actually refaulting.  The shadow
 entries will just sit there and waste memory.  In the worst case, the
 shadow entries will accumulate until the machine runs out of memory.
 
 To get this under control, the VM will track radix tree nodes
 exclusively containing shadow entries on a per-NUMA node list.

Why per-node rather than a global list?

  A simple shrinker will reclaim these nodes on memory pressure.

Truncate needs to go off and massacre these things as well - some
description of how that happens would be useful.

 A few things need to be stored in the radix tree node to implement the
 shadow node LRU and allow tree deletions coming from the list:
 
 1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion.  To solve this, encode in each node its offset inside the
parent.  This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
 
 2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list.  The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
 
 3. Tree modification needs the lock, which is located in the address
space,

Presumably the lock == tree_lock.

so store a backpointer to it.

looks at the code

it is the address_space, not tree_lock, yes?

   The parent pointer is in a
union with the 2-word rcu_head, so the backpointer comes at no
extra cost as well.

So we have a shrinker walking backwards from radix-tree nodes and
reaching up into address_spaces.  We need to take steps to prevent
those address_spaces from getting shot down (reclaim, umount, truncate,
etc) while we're doing this.  What's happening here?

 4. The node needs to be linked to an LRU list, which requires a list
head inside the node.  This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.

 ...

 --- a/include/linux/list_lru.h
 +++ b/include/linux/list_lru.h
 @@ -32,7 +32,7 @@ struct list_lru {
  };
  
  void list_lru_destroy(struct list_lru *lru);
 -int list_lru_init(struct list_lru *lru);
 +int list_lru_init(struct list_lru *lru, struct lock_class_key *key);

It's a bit of a shame to be adding overhead to non-lockdep kernels.  A
few ifdefs could fix this.

Presumably this is being done to squish some lockdep warning you hit. 
A comment at the list_lru_init() implementation site would be useful. 
One which describes the warning and why it's OK to squish it.


 ...

  struct radix_tree_node {
 - unsigned intheight; /* Height from the bottom */
 + unsigned intpath;   /* Offset in parent  height from the bottom */
   unsigned intcount;
   union {
 - struct radix_tree_node *parent; /* Used when ascending tree */
 - struct rcu_head rcu_head;   /* Used when freeing node */
 + /* Used when ascending tree */
 + struct {
 + struct radix_tree_node *parent;
 + void *private;

Private to whom?  The radix-tree implementation?  The radix-tree caller?

 + };
 + /* Used when freeing node */
 + struct rcu_head rcu_head;
   };
 + struct list_head lru;

Locking for this list?

   void __rcu  *slots[RADIX_TREE_MAP_SIZE];
   unsigned long   tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
  };
  

 ...

 +static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 + struct shrink_control *sc)
 +{
 + unsigned long count;
 +
 + local_irq_disable();
 + count = list_lru_count_node(workingset_shadow_nodes, sc-nid);
 + local_irq_enable();

I'm struggling with the local_irq_disable() here.  Presumably it's
there to quash a lockdep warning, but page_cache_tree_delete() and
friends can get away without the local_irq_disable().  Some more
clarity here would be nice.

 + return count;
 +}
 +
 +#define NOIRQ_BATCH 32
 +
 +static enum lru_status shadow_lru_isolate(struct list_head *item,
 +   spinlock_t *lru_lock,
 +   void *arg)
 +{
 + struct address_space *mapping;
 + struct radix_tree_node *node;
 + unsigned long *batch = arg;
 + 

[patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-24 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.  A
simple shrinker will reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs the lock, which is located in the address
   space, so store a backpointer to it.  The parent pointer is in a
   union with the 2-word rcu_head, so the backpointer comes at no
   extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner 
---
 fs/super.c|   4 +-
 fs/xfs/xfs_buf.c  |   2 +-
 fs/xfs/xfs_qm.c   |   2 +-
 include/linux/list_lru.h  |   2 +-
 include/linux/radix-tree.h|  30 +++---
 include/linux/swap.h  |   1 +
 include/linux/vm_event_item.h |   1 +
 lib/radix-tree.c  |  36 +++-
 mm/filemap.c  |  70 
 mm/list_lru.c |   4 +-
 mm/truncate.c |  19 ++-
 mm/vmstat.c   |   2 +
 mm/workingset.c   | 124 ++
 13 files changed, 255 insertions(+), 42 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 0225c20..a958d52 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -196,9 +196,9 @@ static struct super_block *alloc_super(struct 
file_system_type *type, int flags)
INIT_HLIST_BL_HEAD(>s_anon);
INIT_LIST_HEAD(>s_inodes);
 
-   if (list_lru_init(>s_dentry_lru))
+   if (list_lru_init(>s_dentry_lru, NULL))
goto err_out;
-   if (list_lru_init(>s_inode_lru))
+   if (list_lru_init(>s_inode_lru, NULL))
goto err_out_dentry_lru;
 
INIT_LIST_HEAD(>s_mounts);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 2634700..c49cbce 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1670,7 +1670,7 @@ xfs_alloc_buftarg(
if (xfs_setsize_buftarg_early(btp, bdev))
goto error;
 
-   if (list_lru_init(>bt_lru))
+   if (list_lru_init(>bt_lru, NULL))
goto error;
 
btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 3e6c2e6..57d6aa9 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -831,7 +831,7 @@ xfs_qm_init_quotainfo(
 
qinf = mp->m_quotainfo = kmem_zalloc(sizeof(xfs_quotainfo_t), KM_SLEEP);
 
-   if ((error = list_lru_init(>qi_lru))) {
+   if ((error = list_lru_init(>qi_lru, NULL))) {
kmem_free(qinf);
mp->m_quotainfo = NULL;
return error;
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce5417..b970a45 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -32,7 +32,7 @@ struct list_lru {
 };
 
 void list_lru_destroy(struct list_lru *lru);
-int list_lru_init(struct list_lru *lru);
+int list_lru_init(struct list_lru *lru, struct lock_class_key *key);
 
 /**
  * list_lru_add: add an element to the lru list's tail
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c4..29df11f 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,35 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 #define RADIX_TREE_TAG_LONGS   \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
+#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * 

[patch 9/9] mm: keep page cache radix tree nodes in check

2013-11-24 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.  A
simple shrinker will reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs the lock, which is located in the address
   space, so store a backpointer to it.  The parent pointer is in a
   union with the 2-word rcu_head, so the backpointer comes at no
   extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
---
 fs/super.c|   4 +-
 fs/xfs/xfs_buf.c  |   2 +-
 fs/xfs/xfs_qm.c   |   2 +-
 include/linux/list_lru.h  |   2 +-
 include/linux/radix-tree.h|  30 +++---
 include/linux/swap.h  |   1 +
 include/linux/vm_event_item.h |   1 +
 lib/radix-tree.c  |  36 +++-
 mm/filemap.c  |  70 
 mm/list_lru.c |   4 +-
 mm/truncate.c |  19 ++-
 mm/vmstat.c   |   2 +
 mm/workingset.c   | 124 ++
 13 files changed, 255 insertions(+), 42 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 0225c20..a958d52 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -196,9 +196,9 @@ static struct super_block *alloc_super(struct 
file_system_type *type, int flags)
INIT_HLIST_BL_HEAD(s-s_anon);
INIT_LIST_HEAD(s-s_inodes);
 
-   if (list_lru_init(s-s_dentry_lru))
+   if (list_lru_init(s-s_dentry_lru, NULL))
goto err_out;
-   if (list_lru_init(s-s_inode_lru))
+   if (list_lru_init(s-s_inode_lru, NULL))
goto err_out_dentry_lru;
 
INIT_LIST_HEAD(s-s_mounts);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 2634700..c49cbce 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1670,7 +1670,7 @@ xfs_alloc_buftarg(
if (xfs_setsize_buftarg_early(btp, bdev))
goto error;
 
-   if (list_lru_init(btp-bt_lru))
+   if (list_lru_init(btp-bt_lru, NULL))
goto error;
 
btp-bt_shrinker.count_objects = xfs_buftarg_shrink_count;
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 3e6c2e6..57d6aa9 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -831,7 +831,7 @@ xfs_qm_init_quotainfo(
 
qinf = mp-m_quotainfo = kmem_zalloc(sizeof(xfs_quotainfo_t), KM_SLEEP);
 
-   if ((error = list_lru_init(qinf-qi_lru))) {
+   if ((error = list_lru_init(qinf-qi_lru, NULL))) {
kmem_free(qinf);
mp-m_quotainfo = NULL;
return error;
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce5417..b970a45 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -32,7 +32,7 @@ struct list_lru {
 };
 
 void list_lru_destroy(struct list_lru *lru);
-int list_lru_init(struct list_lru *lru);
+int list_lru_init(struct list_lru *lru, struct lock_class_key *key);
 
 /**
  * list_lru_add: add an element to the lru list's tail
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c4..29df11f 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,35 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 #define RADIX_TREE_TAG_LONGS   \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
+#define