Re: [patch 09/10] mm: thrash detection-based file cache sizing
On Fri, Jun 07, 2013 at 06:16:05PM +0400, Roman Gushchin wrote: > On 30.05.2013 22:04, Johannes Weiner wrote: > >+/* > >+ * Monotonic workingset clock for non-resident pages. > >+ * > >+ * The refault distance of a page is the number of ticks that occurred > >+ * between that page's eviction and subsequent refault. > >+ * > >+ * Every page slot that is taken away from the inactive list is one > >+ * more slot the inactive list would have to grow again in order to > >+ * hold the current non-resident pages in memory as well. > >+ * > >+ * As the refault distance needs to reflect the space missing on the > >+ * inactive list, the workingset time is advanced every time the > >+ * inactive list is shrunk. This means eviction, but also activation. > >+ */ > >+static atomic_long_t workingset_time; > > It seems strange to me, that workingset_time is global. > Don't you want to make it per-cgroup? Yes, we need to go there and the code is structured so that it will be possible to adapt memcg in the future. But we will still need to maintain a global view of the workingset time as memory and data are not exclusive resources, or at least can't be guaranteed to be, so refault distances always need to be applicable to all containers in the system. But in response to Peter's feedback, I changed the workingset_time global variable to a per-zone one and then use the per-zone floating proportions that I used to break down global speed in reverse to scale up the zone time to global time. > Two more questions: > 1) do you plan to take fadvise's into account somehow? DONTNEED is honored, shadow entries will be dropped in the fadvised region. Is that what you meant? > 2) do you plan to use workingset information to enhance > the readahead mechanism? I don't have any specific plans for this and I'm not sure if detecting thrashing alone would be a good predicate. It would make more sense to adjust readahead windows if readahead pages are reclaimed before they are used, and that may happen even in the absence of refaults. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 09/10] mm: thrash detection-based file cache sizing
On 30.05.2013 22:04, Johannes Weiner wrote: +/* + * Monotonic workingset clock for non-resident pages. + * + * The refault distance of a page is the number of ticks that occurred + * between that page's eviction and subsequent refault. + * + * Every page slot that is taken away from the inactive list is one + * more slot the inactive list would have to grow again in order to + * hold the current non-resident pages in memory as well. + * + * As the refault distance needs to reflect the space missing on the + * inactive list, the workingset time is advanced every time the + * inactive list is shrunk. This means eviction, but also activation. + */ +static atomic_long_t workingset_time; It seems strange to me, that workingset_time is global. Don't you want to make it per-cgroup? Two more questions: 1) do you plan to take fadvise's into account somehow? 2) do you plan to use workingset information to enhance the readahead mechanism? Thanks! Regards, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 09/10] mm: thrash detection-based file cache sizing
On 30.05.2013 22:04, Johannes Weiner wrote: +/* + * Monotonic workingset clock for non-resident pages. + * + * The refault distance of a page is the number of ticks that occurred + * between that page's eviction and subsequent refault. + * + * Every page slot that is taken away from the inactive list is one + * more slot the inactive list would have to grow again in order to + * hold the current non-resident pages in memory as well. + * + * As the refault distance needs to reflect the space missing on the + * inactive list, the workingset time is advanced every time the + * inactive list is shrunk. This means eviction, but also activation. + */ +static atomic_long_t workingset_time; It seems strange to me, that workingset_time is global. Don't you want to make it per-cgroup? Two more questions: 1) do you plan to take fadvise's into account somehow? 2) do you plan to use workingset information to enhance the readahead mechanism? Thanks! Regards, Roman -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 09/10] mm: thrash detection-based file cache sizing
On Fri, Jun 07, 2013 at 06:16:05PM +0400, Roman Gushchin wrote: On 30.05.2013 22:04, Johannes Weiner wrote: +/* + * Monotonic workingset clock for non-resident pages. + * + * The refault distance of a page is the number of ticks that occurred + * between that page's eviction and subsequent refault. + * + * Every page slot that is taken away from the inactive list is one + * more slot the inactive list would have to grow again in order to + * hold the current non-resident pages in memory as well. + * + * As the refault distance needs to reflect the space missing on the + * inactive list, the workingset time is advanced every time the + * inactive list is shrunk. This means eviction, but also activation. + */ +static atomic_long_t workingset_time; It seems strange to me, that workingset_time is global. Don't you want to make it per-cgroup? Yes, we need to go there and the code is structured so that it will be possible to adapt memcg in the future. But we will still need to maintain a global view of the workingset time as memory and data are not exclusive resources, or at least can't be guaranteed to be, so refault distances always need to be applicable to all containers in the system. But in response to Peter's feedback, I changed the workingset_time global variable to a per-zone one and then use the per-zone floating proportions that I used to break down global speed in reverse to scale up the zone time to global time. Two more questions: 1) do you plan to take fadvise's into account somehow? DONTNEED is honored, shadow entries will be dropped in the fadvised region. Is that what you meant? 2) do you plan to use workingset information to enhance the readahead mechanism? I don't have any specific plans for this and I'm not sure if detecting thrashing alone would be a good predicate. It would make more sense to adjust readahead windows if readahead pages are reclaimed before they are used, and that may happen even in the absence of refaults. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 09/10] mm: thrash detection-based file cache sizing
The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list "inactive list" and the frequently used list "active list". The tricky part of this model is finding the right balance between them. A big inactive list may not leave enough room for the active list to protect all the frequently used pages. A big active list may not leave enough room for the inactive list for a new set of frequently used pages, "working set", to establish itself because the young pages get pushed out of memory before having a chance to get promoted. Historically, every reclaim scan of the inactive list also took a smaller number of pages from the tail of the active list and moved them to the head of the inactive list. This model gave established working sets more gracetime in the face of temporary use once streams, but was not satisfactory when use once streaming persisted over longer periods of time and the established working set was temporarily suspended, like a nightly backup evicting all the interactive user program data. Subsequently, the rules were changed to only age active pages when they exceeded the amount of inactive pages, i.e. leave the working set alone as long as the other half of memory is easy to reclaim use once pages. This works well until working set transitions exceed the size of half of memory and the average access distance between the pages of the new working set is bigger than the inactive list. The VM will mistake the thrashing new working set for use once streaming, while the unused old working set pages are stuck on the active list. This patch solves this problem by maintaining a history of recently evicted file pages, which in turn allows the VM to tell used-once page streams from thrashing file cache. To accomplish this, a global counter is increased every time a page is evicted and a snapshot of that counter is stored as shadow entry in the page's now empty page cache radix tree slot. Upon refault of that page, the difference between the current value of that counter and the shadow entry value is called the refault distance. It tells how many pages have been evicted since this page's eviction, which is how many page slots are missing from the inactive list for this page to get accessed twice while in memory. If the number of missing slots is less than or equal to the number of active pages, increasing the inactive list at the cost of the active list would give this thrashing set a chance to establish itself: eviction counter = 4 evicted inactive active Page cache data: [ a b c d ] [ e f g h i j k ] [ l m n ] Shadow entries: 0 1 2 3 Refault distance: 4 3 2 1 When c is faulted back into memory, it is noted that two more page slots on the inactive list could have prevented the refault. Thus, the active list needs to be challenged for those two page slots as it is possible that c is used more frequently than l, m, n. However, c might also be used much less frequent than the active pages and so 1) pages can not be directly reclaimed from the tail of the active list and b) refaulting pages can not be directly activated. Instead, active pages are moved from the tail of the active list to the head of the inactive list and placed directly next to the refaulting pages. This way, they both have the same time on the inactive list to prove which page is actually used more frequently without incurring unnecessary major faults or diluting the active page set in case the previously active page is in fact the more frequently used one. On multi-node systems, workloads with different page access frequencies may execute concurrently on separate nodes. On refault, the page allocator walks the list of allowed zones to allocate a page frame for the refaulting page. For each zone, a local refault distance is calculated that is proportional to the zone's recent share of global evictions. This local distance is then compared to the local number of active pages, so the decision to rebalance the lists is made on an individual per-zone basis. Signed-off-by: Johannes Weiner --- include/linux/mmzone.h | 6 ++ include/linux/swap.h | 8 +- mm/Makefile| 2 +- mm/memcontrol.c| 3 + mm/mmzone.c| 1 + mm/page_alloc.c| 9 +- mm/swap.c | 2 + mm/vmscan.c| 45 +++--- mm/vmstat.c| 3 + mm/workingset.c| 233 + 10 files changed, 293 insertions(+), 19 deletions(-) create mode 100644 mm/workingset.c diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 370a35f..505bd80 100644 ---
[patch 09/10] mm: thrash detection-based file cache sizing
The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list inactive list and the frequently used list active list. The tricky part of this model is finding the right balance between them. A big inactive list may not leave enough room for the active list to protect all the frequently used pages. A big active list may not leave enough room for the inactive list for a new set of frequently used pages, working set, to establish itself because the young pages get pushed out of memory before having a chance to get promoted. Historically, every reclaim scan of the inactive list also took a smaller number of pages from the tail of the active list and moved them to the head of the inactive list. This model gave established working sets more gracetime in the face of temporary use once streams, but was not satisfactory when use once streaming persisted over longer periods of time and the established working set was temporarily suspended, like a nightly backup evicting all the interactive user program data. Subsequently, the rules were changed to only age active pages when they exceeded the amount of inactive pages, i.e. leave the working set alone as long as the other half of memory is easy to reclaim use once pages. This works well until working set transitions exceed the size of half of memory and the average access distance between the pages of the new working set is bigger than the inactive list. The VM will mistake the thrashing new working set for use once streaming, while the unused old working set pages are stuck on the active list. This patch solves this problem by maintaining a history of recently evicted file pages, which in turn allows the VM to tell used-once page streams from thrashing file cache. To accomplish this, a global counter is increased every time a page is evicted and a snapshot of that counter is stored as shadow entry in the page's now empty page cache radix tree slot. Upon refault of that page, the difference between the current value of that counter and the shadow entry value is called the refault distance. It tells how many pages have been evicted since this page's eviction, which is how many page slots are missing from the inactive list for this page to get accessed twice while in memory. If the number of missing slots is less than or equal to the number of active pages, increasing the inactive list at the cost of the active list would give this thrashing set a chance to establish itself: eviction counter = 4 evicted inactive active Page cache data: [ a b c d ] [ e f g h i j k ] [ l m n ] Shadow entries: 0 1 2 3 Refault distance: 4 3 2 1 When c is faulted back into memory, it is noted that two more page slots on the inactive list could have prevented the refault. Thus, the active list needs to be challenged for those two page slots as it is possible that c is used more frequently than l, m, n. However, c might also be used much less frequent than the active pages and so 1) pages can not be directly reclaimed from the tail of the active list and b) refaulting pages can not be directly activated. Instead, active pages are moved from the tail of the active list to the head of the inactive list and placed directly next to the refaulting pages. This way, they both have the same time on the inactive list to prove which page is actually used more frequently without incurring unnecessary major faults or diluting the active page set in case the previously active page is in fact the more frequently used one. On multi-node systems, workloads with different page access frequencies may execute concurrently on separate nodes. On refault, the page allocator walks the list of allowed zones to allocate a page frame for the refaulting page. For each zone, a local refault distance is calculated that is proportional to the zone's recent share of global evictions. This local distance is then compared to the local number of active pages, so the decision to rebalance the lists is made on an individual per-zone basis. Signed-off-by: Johannes Weiner han...@cmpxchg.org --- include/linux/mmzone.h | 6 ++ include/linux/swap.h | 8 +- mm/Makefile| 2 +- mm/memcontrol.c| 3 + mm/mmzone.c| 1 + mm/page_alloc.c| 9 +- mm/swap.c | 2 + mm/vmscan.c| 45 +++--- mm/vmstat.c| 3 + mm/workingset.c| 233 + 10 files changed, 293 insertions(+), 19 deletions(-) create mode 100644 mm/workingset.c diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 370a35f..505bd80 100644 ---