Re: [patch 10/10] mm: keep page cache radix tree nodes in check

2014-02-04 Thread Johannes Weiner
On Tue, Feb 04, 2014 at 03:07:56PM -0800, Andrew Morton wrote:
> On Mon,  3 Feb 2014 19:53:42 -0500 Johannes Weiner  wrote:
> 
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers.  But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed.  This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting.  The shadow
> > entries will just sit there and waste memory.  In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> > 
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads.  A
> > simple shrinker will then reclaim these nodes on memory pressure.

^^^
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> > 
> > 1. There is no index available that would describe the reverse path
> >from the node up to the tree root, which is needed to perform a
> >deletion.  To solve this, encode in each node its offset inside the
> >parent.  This can be stored in the unused upper bits of the same
> >member that stores the node's height at no extra space cost.
> > 
> > 2. The number of shadow entries needs to be counted in addition to the
> >regular entries, to quickly detect when the node is ready to go to
> >the shadow node LRU list.  The current entry count is an unsigned
> >int but the maximum number of entries is 64, so a shadow counter
> >can easily be stored in the unused upper bits.
> > 
> > 3. Tree modification needs tree lock and tree root, which are located
> >in the address space, so store an address_space backpointer in the
> >node.  The parent pointer of the node is in a union with the 2-word
> >rcu_head, so the backpointer comes at no extra cost as well.
> > 
> > 4. The node needs to be linked to an LRU list, which requires a list
> >head inside the node.  This does increase the size of the node, but
> >it does not change the number of objects that fit into a slab page.
> 
> changelog forgot to mention that this reclaim is performed via a
> shrinker...

Uhm...  see above? :)

> How expensive is that list walk in scan_shadow_nodes()?  I assume in
> the best case it will bale out after nr_to_scan iterations?

Yes, it scans sc->nr_to_scan radix tree nodes, cleans their pointers,
and frees them.

I ran a worst-case scenario on an 8G machine that creates one 8T
sparse file and faults one page per 64-page radix tree node, i.e. one
node per sparse file fault at CPU speed.  The profile:

 1   9.21% radixblow  [kernel.kallsyms]   [k] memset
 2   7.23% radixblow  [kernel.kallsyms]   [k] do_mpage_readpage
 3   4.76% radixblow  [kernel.kallsyms]   [k] 
copy_user_generic_string
 4   3.85% radixblow  [kernel.kallsyms]   [k] __radix_tree_lookup
 5   3.32%   kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate
 6   2.92% radixblow  [kernel.kallsyms]   [k] get_page_from_freelist
 7   2.81%   kswapd0  [kernel.kallsyms]   [k] 
__delete_from_page_cache
 8   2.50% radixblow  [kernel.kallsyms]   [k] radix_tree_node_ctor
 9   1.79% radixblow  [kernel.kallsyms]   [k] _raw_spin_lock_irq
10   1.70%   kswapd0  [kernel.kallsyms]   [k] 
__mem_cgroup_uncharge_common

Same scenario with 4 pages per 64-page radix tree node:

13   1.39%   kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate

16 pages per 64-page node:

75   0.20%   kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate

So I doubt this will bother anyone, especially since most use-once
streamers should have a better population density and populate cache
at disk speed, not CPU speed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [patch 10/10] mm: keep page cache radix tree nodes in check

2014-02-04 Thread Andrew Morton
On Mon,  3 Feb 2014 19:53:42 -0500 Johannes Weiner  wrote:

> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers.  But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed.  This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting.  The shadow
> entries will just sit there and waste memory.  In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
> 
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.
> Per-NUMA rather than global because we expect the radix tree nodes
> themselves to be allocated node-locally and we want to reduce
> cross-node references of otherwise independent cache workloads.  A
> simple shrinker will then reclaim these nodes on memory pressure.
> 
> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:
> 
> 1. There is no index available that would describe the reverse path
>from the node up to the tree root, which is needed to perform a
>deletion.  To solve this, encode in each node its offset inside the
>parent.  This can be stored in the unused upper bits of the same
>member that stores the node's height at no extra space cost.
> 
> 2. The number of shadow entries needs to be counted in addition to the
>regular entries, to quickly detect when the node is ready to go to
>the shadow node LRU list.  The current entry count is an unsigned
>int but the maximum number of entries is 64, so a shadow counter
>can easily be stored in the unused upper bits.
> 
> 3. Tree modification needs tree lock and tree root, which are located
>in the address space, so store an address_space backpointer in the
>node.  The parent pointer of the node is in a union with the 2-word
>rcu_head, so the backpointer comes at no extra cost as well.
> 
> 4. The node needs to be linked to an LRU list, which requires a list
>head inside the node.  This does increase the size of the node, but
>it does not change the number of objects that fit into a slab page.

changelog forgot to mention that this reclaim is performed via a
shrinker...

How expensive is that list walk in scan_shadow_nodes()?  I assume in
the best case it will bale out after nr_to_scan iterations?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 10/10] mm: keep page cache radix tree nodes in check

2014-02-04 Thread Andrew Morton
On Mon,  3 Feb 2014 19:53:42 -0500 Johannes Weiner han...@cmpxchg.org wrote:

 Previously, page cache radix tree nodes were freed after reclaim
 emptied out their page pointers.  But now reclaim stores shadow
 entries in their place, which are only reclaimed when the inodes
 themselves are reclaimed.  This is problematic for bigger files that
 are still in use after they have a significant amount of their cache
 reclaimed, without any of those pages actually refaulting.  The shadow
 entries will just sit there and waste memory.  In the worst case, the
 shadow entries will accumulate until the machine runs out of memory.
 
 To get this under control, the VM will track radix tree nodes
 exclusively containing shadow entries on a per-NUMA node list.
 Per-NUMA rather than global because we expect the radix tree nodes
 themselves to be allocated node-locally and we want to reduce
 cross-node references of otherwise independent cache workloads.  A
 simple shrinker will then reclaim these nodes on memory pressure.
 
 A few things need to be stored in the radix tree node to implement the
 shadow node LRU and allow tree deletions coming from the list:
 
 1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion.  To solve this, encode in each node its offset inside the
parent.  This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.
 
 2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list.  The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.
 
 3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node.  The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.
 
 4. The node needs to be linked to an LRU list, which requires a list
head inside the node.  This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.

changelog forgot to mention that this reclaim is performed via a
shrinker...

How expensive is that list walk in scan_shadow_nodes()?  I assume in
the best case it will bale out after nr_to_scan iterations?

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 10/10] mm: keep page cache radix tree nodes in check

2014-02-04 Thread Johannes Weiner
On Tue, Feb 04, 2014 at 03:07:56PM -0800, Andrew Morton wrote:
 On Mon,  3 Feb 2014 19:53:42 -0500 Johannes Weiner han...@cmpxchg.org wrote:
 
  Previously, page cache radix tree nodes were freed after reclaim
  emptied out their page pointers.  But now reclaim stores shadow
  entries in their place, which are only reclaimed when the inodes
  themselves are reclaimed.  This is problematic for bigger files that
  are still in use after they have a significant amount of their cache
  reclaimed, without any of those pages actually refaulting.  The shadow
  entries will just sit there and waste memory.  In the worst case, the
  shadow entries will accumulate until the machine runs out of memory.
  
  To get this under control, the VM will track radix tree nodes
  exclusively containing shadow entries on a per-NUMA node list.
  Per-NUMA rather than global because we expect the radix tree nodes
  themselves to be allocated node-locally and we want to reduce
  cross-node references of otherwise independent cache workloads.  A
  simple shrinker will then reclaim these nodes on memory pressure.

^^^
  A few things need to be stored in the radix tree node to implement the
  shadow node LRU and allow tree deletions coming from the list:
  
  1. There is no index available that would describe the reverse path
 from the node up to the tree root, which is needed to perform a
 deletion.  To solve this, encode in each node its offset inside the
 parent.  This can be stored in the unused upper bits of the same
 member that stores the node's height at no extra space cost.
  
  2. The number of shadow entries needs to be counted in addition to the
 regular entries, to quickly detect when the node is ready to go to
 the shadow node LRU list.  The current entry count is an unsigned
 int but the maximum number of entries is 64, so a shadow counter
 can easily be stored in the unused upper bits.
  
  3. Tree modification needs tree lock and tree root, which are located
 in the address space, so store an address_space backpointer in the
 node.  The parent pointer of the node is in a union with the 2-word
 rcu_head, so the backpointer comes at no extra cost as well.
  
  4. The node needs to be linked to an LRU list, which requires a list
 head inside the node.  This does increase the size of the node, but
 it does not change the number of objects that fit into a slab page.
 
 changelog forgot to mention that this reclaim is performed via a
 shrinker...

Uhm...  see above? :)

 How expensive is that list walk in scan_shadow_nodes()?  I assume in
 the best case it will bale out after nr_to_scan iterations?

Yes, it scans sc-nr_to_scan radix tree nodes, cleans their pointers,
and frees them.

I ran a worst-case scenario on an 8G machine that creates one 8T
sparse file and faults one page per 64-page radix tree node, i.e. one
node per sparse file fault at CPU speed.  The profile:

 1   9.21% radixblow  [kernel.kallsyms]   [k] memset
 2   7.23% radixblow  [kernel.kallsyms]   [k] do_mpage_readpage
 3   4.76% radixblow  [kernel.kallsyms]   [k] 
copy_user_generic_string
 4   3.85% radixblow  [kernel.kallsyms]   [k] __radix_tree_lookup
 5   3.32%   kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate
 6   2.92% radixblow  [kernel.kallsyms]   [k] get_page_from_freelist
 7   2.81%   kswapd0  [kernel.kallsyms]   [k] 
__delete_from_page_cache
 8   2.50% radixblow  [kernel.kallsyms]   [k] radix_tree_node_ctor
 9   1.79% radixblow  [kernel.kallsyms]   [k] _raw_spin_lock_irq
10   1.70%   kswapd0  [kernel.kallsyms]   [k] 
__mem_cgroup_uncharge_common

Same scenario with 4 pages per 64-page radix tree node:

13   1.39%   kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate

16 pages per 64-page node:

75   0.20%   kswapd0  [kernel.kallsyms]   [k] shadow_lru_isolate

So I doubt this will bother anyone, especially since most use-once
streamers should have a better population density and populate cache
at disk speed, not CPU speed.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[patch 10/10] mm: keep page cache radix tree nodes in check

2014-02-03 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.
Per-NUMA rather than global because we expect the radix tree nodes
themselves to be allocated node-locally and we want to reduce
cross-node references of otherwise independent cache workloads.  A
simple shrinker will then reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
   in the address space, so store an address_space backpointer in the
   node.  The parent pointer of the node is in a union with the 2-word
   rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner 
Reviewed-by: Rik van Riel 
Reviewed-by: Minchan Kim 
---
 include/linux/list_lru.h   |   2 +
 include/linux/mmzone.h |   1 +
 include/linux/radix-tree.h |  32 +++---
 include/linux/swap.h   |  31 ++
 lib/radix-tree.c   |  36 +++-
 mm/filemap.c   |  90 +++-
 mm/list_lru.c  |  10 
 mm/truncate.c  |  26 -
 mm/vmstat.c|   1 +
 mm/workingset.c| 143 +
 10 files changed, 332 insertions(+), 40 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce541753c88..b02fc233eadd 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -13,6 +13,8 @@
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
LRU_REMOVED,/* item removed from list */
+   LRU_REMOVED_RETRY,  /* item removed, but lock has been
+  dropped and reacquired */
LRU_ROTATE, /* item referenced, give another pass */
LRU_SKIP,   /* item cannot be locked, skip */
LRU_RETRY,  /* item not freeable. May drop the lock
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4bdeb411a4d..934820b3249c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
 #endif
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
+   WORKINGSET_NODERECLAIM,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c40bc42..33170dbd9db4 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 #define RADIX_TREE_TAG_LONGS   \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
+#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+ RADIX_TREE_MAP_SHIFT))
+
+/* Height component in node->path */
+#define RADIX_TREE_HEIGHT_SHIFT(RADIX_TREE_MAX_PATH + 1)
+#define RADIX_TREE_HEIGHT_MASK ((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
+
+/* Internally used bits of node->count */
+#define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1)
+#define RADIX_TREE_COUNT_MASK  ((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
+
 struct radix_tree_node {
-   unsigned intheight; /* Height from the bottom */
+   unsigned intpath;   /* Offset in parent & height from the bottom */

[patch 10/10] mm: keep page cache radix tree nodes in check

2014-02-03 Thread Johannes Weiner
Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers.  But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed.  This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting.  The shadow
entries will just sit there and waste memory.  In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.
Per-NUMA rather than global because we expect the radix tree nodes
themselves to be allocated node-locally and we want to reduce
cross-node references of otherwise independent cache workloads.  A
simple shrinker will then reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
   from the node up to the tree root, which is needed to perform a
   deletion.  To solve this, encode in each node its offset inside the
   parent.  This can be stored in the unused upper bits of the same
   member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
   regular entries, to quickly detect when the node is ready to go to
   the shadow node LRU list.  The current entry count is an unsigned
   int but the maximum number of entries is 64, so a shadow counter
   can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
   in the address space, so store an address_space backpointer in the
   node.  The parent pointer of the node is in a union with the 2-word
   rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
   head inside the node.  This does increase the size of the node, but
   it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
Reviewed-by: Rik van Riel r...@redhat.com
Reviewed-by: Minchan Kim minc...@kernel.org
---
 include/linux/list_lru.h   |   2 +
 include/linux/mmzone.h |   1 +
 include/linux/radix-tree.h |  32 +++---
 include/linux/swap.h   |  31 ++
 lib/radix-tree.c   |  36 +++-
 mm/filemap.c   |  90 +++-
 mm/list_lru.c  |  10 
 mm/truncate.c  |  26 -
 mm/vmstat.c|   1 +
 mm/workingset.c| 143 +
 10 files changed, 332 insertions(+), 40 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce541753c88..b02fc233eadd 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -13,6 +13,8 @@
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
LRU_REMOVED,/* item removed from list */
+   LRU_REMOVED_RETRY,  /* item removed, but lock has been
+  dropped and reacquired */
LRU_ROTATE, /* item referenced, give another pass */
LRU_SKIP,   /* item cannot be locked, skip */
LRU_RETRY,  /* item not freeable. May drop the lock
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4bdeb411a4d..934820b3249c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
 #endif
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
+   WORKINGSET_NODERECLAIM,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c40bc42..33170dbd9db4 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
 #define RADIX_TREE_TAG_LONGS   \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
 
+#define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+ RADIX_TREE_MAP_SHIFT))
+
+/* Height component in node-path */
+#define RADIX_TREE_HEIGHT_SHIFT(RADIX_TREE_MAX_PATH + 1)
+#define RADIX_TREE_HEIGHT_MASK ((1UL  RADIX_TREE_HEIGHT_SHIFT) - 1)
+
+/* Internally used bits of node-count */
+#define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1)
+#define RADIX_TREE_COUNT_MASK  ((1UL  RADIX_TREE_COUNT_SHIFT) - 1)
+
 struct radix_tree_node {
-   unsigned intheight; /* Height from the bottom */
+   unsigned intpath;   /* Offset in