Re: [patch 09/10] mm: thrash detection-based file cache sizing

2013-06-07 Thread Johannes Weiner
On Fri, Jun 07, 2013 at 06:16:05PM +0400, Roman Gushchin wrote:
> On 30.05.2013 22:04, Johannes Weiner wrote:
> >+/*
> >+ * Monotonic workingset clock for non-resident pages.
> >+ *
> >+ * The refault distance of a page is the number of ticks that occurred
> >+ * between that page's eviction and subsequent refault.
> >+ *
> >+ * Every page slot that is taken away from the inactive list is one
> >+ * more slot the inactive list would have to grow again in order to
> >+ * hold the current non-resident pages in memory as well.
> >+ *
> >+ * As the refault distance needs to reflect the space missing on the
> >+ * inactive list, the workingset time is advanced every time the
> >+ * inactive list is shrunk.  This means eviction, but also activation.
> >+ */
> >+static atomic_long_t workingset_time;
> 
> It seems strange to me, that workingset_time is global.
> Don't you want to make it per-cgroup?

Yes, we need to go there and the code is structured so that it will be
possible to adapt memcg in the future.

But we will still need to maintain a global view of the workingset
time as memory and data are not exclusive resources, or at least can't
be guaranteed to be, so refault distances always need to be applicable
to all containers in the system.  But in response to Peter's feedback,
I changed the workingset_time global variable to a per-zone one and
then use the per-zone floating proportions that I used to break down
global speed in reverse to scale up the zone time to global time.

> Two more questions:
> 1) do you plan to take fadvise's into account somehow?

DONTNEED is honored, shadow entries will be dropped in the fadvised
region.  Is that what you meant?

> 2) do you plan to use workingset information to enhance
>   the readahead mechanism?

I don't have any specific plans for this and I'm not sure if detecting
thrashing alone would be a good predicate.  It would make more sense
to adjust readahead windows if readahead pages are reclaimed before
they are used, and that may happen even in the absence of refaults.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 09/10] mm: thrash detection-based file cache sizing

2013-06-07 Thread Roman Gushchin

On 30.05.2013 22:04, Johannes Weiner wrote:

+/*
+ * Monotonic workingset clock for non-resident pages.
+ *
+ * The refault distance of a page is the number of ticks that occurred
+ * between that page's eviction and subsequent refault.
+ *
+ * Every page slot that is taken away from the inactive list is one
+ * more slot the inactive list would have to grow again in order to
+ * hold the current non-resident pages in memory as well.
+ *
+ * As the refault distance needs to reflect the space missing on the
+ * inactive list, the workingset time is advanced every time the
+ * inactive list is shrunk.  This means eviction, but also activation.
+ */
+static atomic_long_t workingset_time;


It seems strange to me, that workingset_time is global.
Don't you want to make it per-cgroup?

Two more questions:
1) do you plan to take fadvise's into account somehow?
2) do you plan to use workingset information to enhance
the readahead mechanism?

Thanks!

Regards,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 09/10] mm: thrash detection-based file cache sizing

2013-06-07 Thread Roman Gushchin

On 30.05.2013 22:04, Johannes Weiner wrote:

+/*
+ * Monotonic workingset clock for non-resident pages.
+ *
+ * The refault distance of a page is the number of ticks that occurred
+ * between that page's eviction and subsequent refault.
+ *
+ * Every page slot that is taken away from the inactive list is one
+ * more slot the inactive list would have to grow again in order to
+ * hold the current non-resident pages in memory as well.
+ *
+ * As the refault distance needs to reflect the space missing on the
+ * inactive list, the workingset time is advanced every time the
+ * inactive list is shrunk.  This means eviction, but also activation.
+ */
+static atomic_long_t workingset_time;


It seems strange to me, that workingset_time is global.
Don't you want to make it per-cgroup?

Two more questions:
1) do you plan to take fadvise's into account somehow?
2) do you plan to use workingset information to enhance
the readahead mechanism?

Thanks!

Regards,
Roman
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 09/10] mm: thrash detection-based file cache sizing

2013-06-07 Thread Johannes Weiner
On Fri, Jun 07, 2013 at 06:16:05PM +0400, Roman Gushchin wrote:
 On 30.05.2013 22:04, Johannes Weiner wrote:
 +/*
 + * Monotonic workingset clock for non-resident pages.
 + *
 + * The refault distance of a page is the number of ticks that occurred
 + * between that page's eviction and subsequent refault.
 + *
 + * Every page slot that is taken away from the inactive list is one
 + * more slot the inactive list would have to grow again in order to
 + * hold the current non-resident pages in memory as well.
 + *
 + * As the refault distance needs to reflect the space missing on the
 + * inactive list, the workingset time is advanced every time the
 + * inactive list is shrunk.  This means eviction, but also activation.
 + */
 +static atomic_long_t workingset_time;
 
 It seems strange to me, that workingset_time is global.
 Don't you want to make it per-cgroup?

Yes, we need to go there and the code is structured so that it will be
possible to adapt memcg in the future.

But we will still need to maintain a global view of the workingset
time as memory and data are not exclusive resources, or at least can't
be guaranteed to be, so refault distances always need to be applicable
to all containers in the system.  But in response to Peter's feedback,
I changed the workingset_time global variable to a per-zone one and
then use the per-zone floating proportions that I used to break down
global speed in reverse to scale up the zone time to global time.

 Two more questions:
 1) do you plan to take fadvise's into account somehow?

DONTNEED is honored, shadow entries will be dropped in the fadvised
region.  Is that what you meant?

 2) do you plan to use workingset information to enhance
   the readahead mechanism?

I don't have any specific plans for this and I'm not sure if detecting
thrashing alone would be a good predicate.  It would make more sense
to adjust readahead windows if readahead pages are reclaimed before
they are used, and that may happen even in the absence of refaults.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch 09/10] mm: thrash detection-based file cache sizing

2013-05-30 Thread Johannes Weiner
The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list "inactive list" and the frequently used list "active list".

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, "working set", to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.

Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This patch solves this problem by maintaining a history of recently
evicted file pages, which in turn allows the VM to tell used-once page
streams from thrashing file cache.

To accomplish this, a global counter is increased every time a page is
evicted and a snapshot of that counter is stored as shadow entry in
the page's now empty page cache radix tree slot.  Upon refault of that
page, the difference between the current value of that counter and the
shadow entry value is called the refault distance.  It tells how many
pages have been evicted since this page's eviction, which is how many
page slots are missing from the inactive list for this page to get
accessed twice while in memory.  If the number of missing slots is
less than or equal to the number of active pages, increasing the
inactive list at the cost of the active list would give this thrashing
set a chance to establish itself:

eviction counter = 4
evicted  inactive   active
 Page cache data:   [ a b c d ]  [ e f g h i j k ]  [ l m n ]
  Shadow entries: 0 1 2 3
Refault distance: 4 3 2 1

When c is faulted back into memory, it is noted that two more page
slots on the inactive list could have prevented the refault.  Thus,
the active list needs to be challenged for those two page slots as it
is possible that c is used more frequently than l, m, n.  However, c
might also be used much less frequent than the active pages and so 1)
pages can not be directly reclaimed from the tail of the active list
and b) refaulting pages can not be directly activated.  Instead,
active pages are moved from the tail of the active list to the head of
the inactive list and placed directly next to the refaulting pages.
This way, they both have the same time on the inactive list to prove
which page is actually used more frequently without incurring
unnecessary major faults or diluting the active page set in case the
previously active page is in fact the more frequently used one.

On multi-node systems, workloads with different page access
frequencies may execute concurrently on separate nodes.  On refault,
the page allocator walks the list of allowed zones to allocate a page
frame for the refaulting page.  For each zone, a local refault
distance is calculated that is proportional to the zone's recent share
of global evictions.  This local distance is then compared to the
local number of active pages, so the decision to rebalance the lists
is made on an individual per-zone basis.

Signed-off-by: Johannes Weiner 
---
 include/linux/mmzone.h |   6 ++
 include/linux/swap.h   |   8 +-
 mm/Makefile|   2 +-
 mm/memcontrol.c|   3 +
 mm/mmzone.c|   1 +
 mm/page_alloc.c|   9 +-
 mm/swap.c  |   2 +
 mm/vmscan.c|  45 +++---
 mm/vmstat.c|   3 +
 mm/workingset.c| 233 +
 10 files changed, 293 insertions(+), 19 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 370a35f..505bd80 100644
--- 

[patch 09/10] mm: thrash detection-based file cache sizing

2013-05-30 Thread Johannes Weiner
The VM maintains cached filesystem pages on two types of lists.  One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past.  We call the recently used
list inactive list and the frequently used list active list.

The tricky part of this model is finding the right balance between
them.  A big inactive list may not leave enough room for the active
list to protect all the frequently used pages.  A big active list may
not leave enough room for the inactive list for a new set of
frequently used pages, working set, to establish itself because the
young pages get pushed out of memory before having a chance to get
promoted.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list.  This model gave established
working sets more gracetime in the face of temporary use once streams,
but was not satisfactory when use once streaming persisted over longer
periods of time and the established working set was temporarily
suspended, like a nightly backup evicting all the interactive user
program data.

Subsequently, the rules were changed to only age active pages when
they exceeded the amount of inactive pages, i.e. leave the working set
alone as long as the other half of memory is easy to reclaim use once
pages.  This works well until working set transitions exceed the size
of half of memory and the average access distance between the pages of
the new working set is bigger than the inactive list.  The VM will
mistake the thrashing new working set for use once streaming, while
the unused old working set pages are stuck on the active list.

This patch solves this problem by maintaining a history of recently
evicted file pages, which in turn allows the VM to tell used-once page
streams from thrashing file cache.

To accomplish this, a global counter is increased every time a page is
evicted and a snapshot of that counter is stored as shadow entry in
the page's now empty page cache radix tree slot.  Upon refault of that
page, the difference between the current value of that counter and the
shadow entry value is called the refault distance.  It tells how many
pages have been evicted since this page's eviction, which is how many
page slots are missing from the inactive list for this page to get
accessed twice while in memory.  If the number of missing slots is
less than or equal to the number of active pages, increasing the
inactive list at the cost of the active list would give this thrashing
set a chance to establish itself:

eviction counter = 4
evicted  inactive   active
 Page cache data:   [ a b c d ]  [ e f g h i j k ]  [ l m n ]
  Shadow entries: 0 1 2 3
Refault distance: 4 3 2 1

When c is faulted back into memory, it is noted that two more page
slots on the inactive list could have prevented the refault.  Thus,
the active list needs to be challenged for those two page slots as it
is possible that c is used more frequently than l, m, n.  However, c
might also be used much less frequent than the active pages and so 1)
pages can not be directly reclaimed from the tail of the active list
and b) refaulting pages can not be directly activated.  Instead,
active pages are moved from the tail of the active list to the head of
the inactive list and placed directly next to the refaulting pages.
This way, they both have the same time on the inactive list to prove
which page is actually used more frequently without incurring
unnecessary major faults or diluting the active page set in case the
previously active page is in fact the more frequently used one.

On multi-node systems, workloads with different page access
frequencies may execute concurrently on separate nodes.  On refault,
the page allocator walks the list of allowed zones to allocate a page
frame for the refaulting page.  For each zone, a local refault
distance is calculated that is proportional to the zone's recent share
of global evictions.  This local distance is then compared to the
local number of active pages, so the decision to rebalance the lists
is made on an individual per-zone basis.

Signed-off-by: Johannes Weiner han...@cmpxchg.org
---
 include/linux/mmzone.h |   6 ++
 include/linux/swap.h   |   8 +-
 mm/Makefile|   2 +-
 mm/memcontrol.c|   3 +
 mm/mmzone.c|   1 +
 mm/page_alloc.c|   9 +-
 mm/swap.c  |   2 +
 mm/vmscan.c|  45 +++---
 mm/vmstat.c|   3 +
 mm/workingset.c| 233 +
 10 files changed, 293 insertions(+), 19 deletions(-)
 create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 370a35f..505bd80 100644
---