Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-16 Thread Johannes Weiner
On Wed, Jan 15, 2014 at 10:57:21AM +0800, Bob Liu wrote:
> On 01/15/2014 03:16 AM, Johannes Weiner wrote:
> > On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
> >> Good job! This patch looks good to me and with nice descriptions.
> >> But it seems that this patch only fix the issue "working set changes
> >> bigger than half of cache memory go undetected and thrash indefinitely".
> >> My concern is could it be extended easily to address all other issues
> >> based on this patch set?
> >>
> >> The other possible way is something like Peter has implemented the CART
> >> and Clock-Pro which I think may be better because of using advanced
> >> algorithms and consider the problem as a whole from the beginning.(Sorry
> >> I haven't get enough time to read the source code, so I'm not 100% sure.)
> >> http://linux-mm.org/PeterZClockPro2
> > 
> > My patches are moving the VM towards something that is comparable to
> > how Peter implemented Clock-Pro.  However, the current VM has evolved
> > over time in small increments based on real life performance
> > observations.  Rewriting everything in one go would be incredibly
> > disruptive and I doubt very much we would merge any such proposal in
> > the first place.  So it's not like I don't see the big picture, it's
> > just divide and conquer:
> > 
> > Peter's Clock-Pro implementation was basically a double clock with an
> > intricate system to classify hotness, augmented by eviction
> > information to work with reuse distances independent of memory size.
> > 
> > What we have right now is a double clock with a very rudimentary
> > system to classify whether a page is hot: it has been accessed twice
> > while on the inactive clock.  My patches now add eviction information
> > to this, and improve the classification so that it can work with reuse
> > distances up to memory size and is no longer dependent on the inactive
> > clock size.
> > 
> > This is the smallest imaginable step that is still useful, and even
> > then we had a lot of discussions about scalability of the data
> > structures and confusion about how the new data point should be
> > interpreted.  It also took a long time until somebody read the series
> > and went, "Ok, this actually makes sense to me."  Now, maybe I suck at
> > documenting, but maybe this is just complicated stuff.  Either way, we
> > have to get there collectively, so that the code is maintainable in
> > the long term.
> > 
> > Once we have these new concepts established, we can further improve
> > the hotness detector so that it can classify and order pages with
> > reuse distances beyond memory size.  But this will come with its own
> > set of problems.  For example, some time ago we stopped regularly
> > scanning and rotating active pages because of scalability issues, but
> > we'll most likely need an uptodate estimate of the reuse distances on
> > the active list in order to classify refaults properly.
> > 
> 
> Thank you for your kindly explanation. It make sense to me please feel
> free to add my review.

Thank you!

> >>> + * Approximating inactive page access frequency - Observations:
> >>> + *
> >>> + * 1. When a page is accessed for the first time, it is added to the
> >>> + *head of the inactive list, slides every existing inactive page
> >>> + *towards the tail by one slot, and pushes the current tail page
> >>> + *out of memory.
> >>> + *
> >>> + * 2. When a page is accessed for the second time, it is promoted to
> >>> + *the active list, shrinking the inactive list by one slot.  This
> >>> + *also slides all inactive pages that were faulted into the cache
> >>> + *more recently than the activated page towards the tail of the
> >>> + *inactive list.
> >>> + *
> >>
> >> Nitpick, how about the reference bit?
> > 
> > What do you mean?
> > 
> 
> Sorry, I mean the PG_referenced flag. I thought when a page is accessed
> for the second time only PG_referenced flag  will be set instead of be
> promoted to active list.

It's cleared during rotation or not set on pages that came in through
readahead, but the first access sets the bit and the second access
activates it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-16 Thread Johannes Weiner
On Wed, Jan 15, 2014 at 10:57:21AM +0800, Bob Liu wrote:
 On 01/15/2014 03:16 AM, Johannes Weiner wrote:
  On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
  Good job! This patch looks good to me and with nice descriptions.
  But it seems that this patch only fix the issue working set changes
  bigger than half of cache memory go undetected and thrash indefinitely.
  My concern is could it be extended easily to address all other issues
  based on this patch set?
 
  The other possible way is something like Peter has implemented the CART
  and Clock-Pro which I think may be better because of using advanced
  algorithms and consider the problem as a whole from the beginning.(Sorry
  I haven't get enough time to read the source code, so I'm not 100% sure.)
  http://linux-mm.org/PeterZClockPro2
  
  My patches are moving the VM towards something that is comparable to
  how Peter implemented Clock-Pro.  However, the current VM has evolved
  over time in small increments based on real life performance
  observations.  Rewriting everything in one go would be incredibly
  disruptive and I doubt very much we would merge any such proposal in
  the first place.  So it's not like I don't see the big picture, it's
  just divide and conquer:
  
  Peter's Clock-Pro implementation was basically a double clock with an
  intricate system to classify hotness, augmented by eviction
  information to work with reuse distances independent of memory size.
  
  What we have right now is a double clock with a very rudimentary
  system to classify whether a page is hot: it has been accessed twice
  while on the inactive clock.  My patches now add eviction information
  to this, and improve the classification so that it can work with reuse
  distances up to memory size and is no longer dependent on the inactive
  clock size.
  
  This is the smallest imaginable step that is still useful, and even
  then we had a lot of discussions about scalability of the data
  structures and confusion about how the new data point should be
  interpreted.  It also took a long time until somebody read the series
  and went, Ok, this actually makes sense to me.  Now, maybe I suck at
  documenting, but maybe this is just complicated stuff.  Either way, we
  have to get there collectively, so that the code is maintainable in
  the long term.
  
  Once we have these new concepts established, we can further improve
  the hotness detector so that it can classify and order pages with
  reuse distances beyond memory size.  But this will come with its own
  set of problems.  For example, some time ago we stopped regularly
  scanning and rotating active pages because of scalability issues, but
  we'll most likely need an uptodate estimate of the reuse distances on
  the active list in order to classify refaults properly.
  
 
 Thank you for your kindly explanation. It make sense to me please feel
 free to add my review.

Thank you!

  + * Approximating inactive page access frequency - Observations:
  + *
  + * 1. When a page is accessed for the first time, it is added to the
  + *head of the inactive list, slides every existing inactive page
  + *towards the tail by one slot, and pushes the current tail page
  + *out of memory.
  + *
  + * 2. When a page is accessed for the second time, it is promoted to
  + *the active list, shrinking the inactive list by one slot.  This
  + *also slides all inactive pages that were faulted into the cache
  + *more recently than the activated page towards the tail of the
  + *inactive list.
  + *
 
  Nitpick, how about the reference bit?
  
  What do you mean?
  
 
 Sorry, I mean the PG_referenced flag. I thought when a page is accessed
 for the second time only PG_referenced flag  will be set instead of be
 promoted to active list.

It's cleared during rotation or not set on pages that came in through
readahead, but the first access sets the bit and the second access
activates it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-14 Thread Zhang Yanfei
Hello

On 01/15/2014 10:57 AM, Bob Liu wrote:
> 
> On 01/15/2014 03:16 AM, Johannes Weiner wrote:
>> On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
>>> Hi Johannes,
>>>
>>> On 01/11/2014 02:10 AM, Johannes Weiner wrote:
 The VM maintains cached filesystem pages on two types of lists.  One
 list holds the pages recently faulted into the cache, the other list
 holds pages that have been referenced repeatedly on that first list.
 The idea is to prefer reclaiming young pages over those that have
 shown to benefit from caching in the past.  We call the recently used
 list "inactive list" and the frequently used list "active list".

 Currently, the VM aims for a 1:1 ratio between the lists, which is the
 "perfect" trade-off between the ability to *protect* frequently used
 pages and the ability to *detect* frequently used pages.  This means
 that working set changes bigger than half of cache memory go
 undetected and thrash indefinitely, whereas working sets bigger than
 half of cache memory are unprotected against used-once streams that
 don't even need caching.

>>>
>>> Good job! This patch looks good to me and with nice descriptions.
>>> But it seems that this patch only fix the issue "working set changes
>>> bigger than half of cache memory go undetected and thrash indefinitely".
>>> My concern is could it be extended easily to address all other issues
>>> based on this patch set?
>>>
>>> The other possible way is something like Peter has implemented the CART
>>> and Clock-Pro which I think may be better because of using advanced
>>> algorithms and consider the problem as a whole from the beginning.(Sorry
>>> I haven't get enough time to read the source code, so I'm not 100% sure.)
>>> http://linux-mm.org/PeterZClockPro2
>>
>> My patches are moving the VM towards something that is comparable to
>> how Peter implemented Clock-Pro.  However, the current VM has evolved
>> over time in small increments based on real life performance
>> observations.  Rewriting everything in one go would be incredibly
>> disruptive and I doubt very much we would merge any such proposal in
>> the first place.  So it's not like I don't see the big picture, it's
>> just divide and conquer:
>>
>> Peter's Clock-Pro implementation was basically a double clock with an
>> intricate system to classify hotness, augmented by eviction
>> information to work with reuse distances independent of memory size.
>>
>> What we have right now is a double clock with a very rudimentary
>> system to classify whether a page is hot: it has been accessed twice
>> while on the inactive clock.  My patches now add eviction information
>> to this, and improve the classification so that it can work with reuse
>> distances up to memory size and is no longer dependent on the inactive
>> clock size.
>>
>> This is the smallest imaginable step that is still useful, and even
>> then we had a lot of discussions about scalability of the data
>> structures and confusion about how the new data point should be
>> interpreted.  It also took a long time until somebody read the series
>> and went, "Ok, this actually makes sense to me."  Now, maybe I suck at
>> documenting, but maybe this is just complicated stuff.  Either way, we
>> have to get there collectively, so that the code is maintainable in
>> the long term.
>>
>> Once we have these new concepts established, we can further improve
>> the hotness detector so that it can classify and order pages with
>> reuse distances beyond memory size.  But this will come with its own
>> set of problems.  For example, some time ago we stopped regularly
>> scanning and rotating active pages because of scalability issues, but
>> we'll most likely need an uptodate estimate of the reuse distances on
>> the active list in order to classify refaults properly.
>>
> 
> Thank you for your kindly explanation. It make sense to me please feel
> free to add my review.
> 
 + * Approximating inactive page access frequency - Observations:
 + *
 + * 1. When a page is accessed for the first time, it is added to the
 + *head of the inactive list, slides every existing inactive page
 + *towards the tail by one slot, and pushes the current tail page
 + *out of memory.
 + *
 + * 2. When a page is accessed for the second time, it is promoted to
 + *the active list, shrinking the inactive list by one slot.  This
 + *also slides all inactive pages that were faulted into the cache
 + *more recently than the activated page towards the tail of the
 + *inactive list.
 + *
>>>
>>> Nitpick, how about the reference bit?
>>
>> What do you mean?
>>
> 
> Sorry, I mean the PG_referenced flag. I thought when a page is accessed
> for the second time only PG_referenced flag  will be set instead of be
> promoted to active list.
> 

No. I try to explain a bit. For mapped file pages, if the second access
occurs on a 

Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-14 Thread Bob Liu

On 01/15/2014 03:16 AM, Johannes Weiner wrote:
> On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
>> Hi Johannes,
>>
>> On 01/11/2014 02:10 AM, Johannes Weiner wrote:
>>> The VM maintains cached filesystem pages on two types of lists.  One
>>> list holds the pages recently faulted into the cache, the other list
>>> holds pages that have been referenced repeatedly on that first list.
>>> The idea is to prefer reclaiming young pages over those that have
>>> shown to benefit from caching in the past.  We call the recently used
>>> list "inactive list" and the frequently used list "active list".
>>>
>>> Currently, the VM aims for a 1:1 ratio between the lists, which is the
>>> "perfect" trade-off between the ability to *protect* frequently used
>>> pages and the ability to *detect* frequently used pages.  This means
>>> that working set changes bigger than half of cache memory go
>>> undetected and thrash indefinitely, whereas working sets bigger than
>>> half of cache memory are unprotected against used-once streams that
>>> don't even need caching.
>>>
>>
>> Good job! This patch looks good to me and with nice descriptions.
>> But it seems that this patch only fix the issue "working set changes
>> bigger than half of cache memory go undetected and thrash indefinitely".
>> My concern is could it be extended easily to address all other issues
>> based on this patch set?
>>
>> The other possible way is something like Peter has implemented the CART
>> and Clock-Pro which I think may be better because of using advanced
>> algorithms and consider the problem as a whole from the beginning.(Sorry
>> I haven't get enough time to read the source code, so I'm not 100% sure.)
>> http://linux-mm.org/PeterZClockPro2
> 
> My patches are moving the VM towards something that is comparable to
> how Peter implemented Clock-Pro.  However, the current VM has evolved
> over time in small increments based on real life performance
> observations.  Rewriting everything in one go would be incredibly
> disruptive and I doubt very much we would merge any such proposal in
> the first place.  So it's not like I don't see the big picture, it's
> just divide and conquer:
> 
> Peter's Clock-Pro implementation was basically a double clock with an
> intricate system to classify hotness, augmented by eviction
> information to work with reuse distances independent of memory size.
> 
> What we have right now is a double clock with a very rudimentary
> system to classify whether a page is hot: it has been accessed twice
> while on the inactive clock.  My patches now add eviction information
> to this, and improve the classification so that it can work with reuse
> distances up to memory size and is no longer dependent on the inactive
> clock size.
> 
> This is the smallest imaginable step that is still useful, and even
> then we had a lot of discussions about scalability of the data
> structures and confusion about how the new data point should be
> interpreted.  It also took a long time until somebody read the series
> and went, "Ok, this actually makes sense to me."  Now, maybe I suck at
> documenting, but maybe this is just complicated stuff.  Either way, we
> have to get there collectively, so that the code is maintainable in
> the long term.
> 
> Once we have these new concepts established, we can further improve
> the hotness detector so that it can classify and order pages with
> reuse distances beyond memory size.  But this will come with its own
> set of problems.  For example, some time ago we stopped regularly
> scanning and rotating active pages because of scalability issues, but
> we'll most likely need an uptodate estimate of the reuse distances on
> the active list in order to classify refaults properly.
> 

Thank you for your kindly explanation. It make sense to me please feel
free to add my review.

>>> + * Approximating inactive page access frequency - Observations:
>>> + *
>>> + * 1. When a page is accessed for the first time, it is added to the
>>> + *head of the inactive list, slides every existing inactive page
>>> + *towards the tail by one slot, and pushes the current tail page
>>> + *out of memory.
>>> + *
>>> + * 2. When a page is accessed for the second time, it is promoted to
>>> + *the active list, shrinking the inactive list by one slot.  This
>>> + *also slides all inactive pages that were faulted into the cache
>>> + *more recently than the activated page towards the tail of the
>>> + *inactive list.
>>> + *
>>
>> Nitpick, how about the reference bit?
> 
> What do you mean?
> 

Sorry, I mean the PG_referenced flag. I thought when a page is accessed
for the second time only PG_referenced flag  will be set instead of be
promoted to active list.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-14 Thread Johannes Weiner
On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
> Hi Johannes,
> 
> On 01/11/2014 02:10 AM, Johannes Weiner wrote:
> > The VM maintains cached filesystem pages on two types of lists.  One
> > list holds the pages recently faulted into the cache, the other list
> > holds pages that have been referenced repeatedly on that first list.
> > The idea is to prefer reclaiming young pages over those that have
> > shown to benefit from caching in the past.  We call the recently used
> > list "inactive list" and the frequently used list "active list".
> > 
> > Currently, the VM aims for a 1:1 ratio between the lists, which is the
> > "perfect" trade-off between the ability to *protect* frequently used
> > pages and the ability to *detect* frequently used pages.  This means
> > that working set changes bigger than half of cache memory go
> > undetected and thrash indefinitely, whereas working sets bigger than
> > half of cache memory are unprotected against used-once streams that
> > don't even need caching.
> > 
> 
> Good job! This patch looks good to me and with nice descriptions.
> But it seems that this patch only fix the issue "working set changes
> bigger than half of cache memory go undetected and thrash indefinitely".
> My concern is could it be extended easily to address all other issues
> based on this patch set?
> 
> The other possible way is something like Peter has implemented the CART
> and Clock-Pro which I think may be better because of using advanced
> algorithms and consider the problem as a whole from the beginning.(Sorry
> I haven't get enough time to read the source code, so I'm not 100% sure.)
> http://linux-mm.org/PeterZClockPro2

My patches are moving the VM towards something that is comparable to
how Peter implemented Clock-Pro.  However, the current VM has evolved
over time in small increments based on real life performance
observations.  Rewriting everything in one go would be incredibly
disruptive and I doubt very much we would merge any such proposal in
the first place.  So it's not like I don't see the big picture, it's
just divide and conquer:

Peter's Clock-Pro implementation was basically a double clock with an
intricate system to classify hotness, augmented by eviction
information to work with reuse distances independent of memory size.

What we have right now is a double clock with a very rudimentary
system to classify whether a page is hot: it has been accessed twice
while on the inactive clock.  My patches now add eviction information
to this, and improve the classification so that it can work with reuse
distances up to memory size and is no longer dependent on the inactive
clock size.

This is the smallest imaginable step that is still useful, and even
then we had a lot of discussions about scalability of the data
structures and confusion about how the new data point should be
interpreted.  It also took a long time until somebody read the series
and went, "Ok, this actually makes sense to me."  Now, maybe I suck at
documenting, but maybe this is just complicated stuff.  Either way, we
have to get there collectively, so that the code is maintainable in
the long term.

Once we have these new concepts established, we can further improve
the hotness detector so that it can classify and order pages with
reuse distances beyond memory size.  But this will come with its own
set of problems.  For example, some time ago we stopped regularly
scanning and rotating active pages because of scalability issues, but
we'll most likely need an uptodate estimate of the reuse distances on
the active list in order to classify refaults properly.

> > + * Approximating inactive page access frequency - Observations:
> > + *
> > + * 1. When a page is accessed for the first time, it is added to the
> > + *head of the inactive list, slides every existing inactive page
> > + *towards the tail by one slot, and pushes the current tail page
> > + *out of memory.
> > + *
> > + * 2. When a page is accessed for the second time, it is promoted to
> > + *the active list, shrinking the inactive list by one slot.  This
> > + *also slides all inactive pages that were faulted into the cache
> > + *more recently than the activated page towards the tail of the
> > + *inactive list.
> > + *
> 
> Nitpick, how about the reference bit?

What do you mean?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-14 Thread Johannes Weiner
On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
 Hi Johannes,
 
 On 01/11/2014 02:10 AM, Johannes Weiner wrote:
  The VM maintains cached filesystem pages on two types of lists.  One
  list holds the pages recently faulted into the cache, the other list
  holds pages that have been referenced repeatedly on that first list.
  The idea is to prefer reclaiming young pages over those that have
  shown to benefit from caching in the past.  We call the recently used
  list inactive list and the frequently used list active list.
  
  Currently, the VM aims for a 1:1 ratio between the lists, which is the
  perfect trade-off between the ability to *protect* frequently used
  pages and the ability to *detect* frequently used pages.  This means
  that working set changes bigger than half of cache memory go
  undetected and thrash indefinitely, whereas working sets bigger than
  half of cache memory are unprotected against used-once streams that
  don't even need caching.
  
 
 Good job! This patch looks good to me and with nice descriptions.
 But it seems that this patch only fix the issue working set changes
 bigger than half of cache memory go undetected and thrash indefinitely.
 My concern is could it be extended easily to address all other issues
 based on this patch set?
 
 The other possible way is something like Peter has implemented the CART
 and Clock-Pro which I think may be better because of using advanced
 algorithms and consider the problem as a whole from the beginning.(Sorry
 I haven't get enough time to read the source code, so I'm not 100% sure.)
 http://linux-mm.org/PeterZClockPro2

My patches are moving the VM towards something that is comparable to
how Peter implemented Clock-Pro.  However, the current VM has evolved
over time in small increments based on real life performance
observations.  Rewriting everything in one go would be incredibly
disruptive and I doubt very much we would merge any such proposal in
the first place.  So it's not like I don't see the big picture, it's
just divide and conquer:

Peter's Clock-Pro implementation was basically a double clock with an
intricate system to classify hotness, augmented by eviction
information to work with reuse distances independent of memory size.

What we have right now is a double clock with a very rudimentary
system to classify whether a page is hot: it has been accessed twice
while on the inactive clock.  My patches now add eviction information
to this, and improve the classification so that it can work with reuse
distances up to memory size and is no longer dependent on the inactive
clock size.

This is the smallest imaginable step that is still useful, and even
then we had a lot of discussions about scalability of the data
structures and confusion about how the new data point should be
interpreted.  It also took a long time until somebody read the series
and went, Ok, this actually makes sense to me.  Now, maybe I suck at
documenting, but maybe this is just complicated stuff.  Either way, we
have to get there collectively, so that the code is maintainable in
the long term.

Once we have these new concepts established, we can further improve
the hotness detector so that it can classify and order pages with
reuse distances beyond memory size.  But this will come with its own
set of problems.  For example, some time ago we stopped regularly
scanning and rotating active pages because of scalability issues, but
we'll most likely need an uptodate estimate of the reuse distances on
the active list in order to classify refaults properly.

  + * Approximating inactive page access frequency - Observations:
  + *
  + * 1. When a page is accessed for the first time, it is added to the
  + *head of the inactive list, slides every existing inactive page
  + *towards the tail by one slot, and pushes the current tail page
  + *out of memory.
  + *
  + * 2. When a page is accessed for the second time, it is promoted to
  + *the active list, shrinking the inactive list by one slot.  This
  + *also slides all inactive pages that were faulted into the cache
  + *more recently than the activated page towards the tail of the
  + *inactive list.
  + *
 
 Nitpick, how about the reference bit?

What do you mean?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-14 Thread Bob Liu

On 01/15/2014 03:16 AM, Johannes Weiner wrote:
 On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
 Hi Johannes,

 On 01/11/2014 02:10 AM, Johannes Weiner wrote:
 The VM maintains cached filesystem pages on two types of lists.  One
 list holds the pages recently faulted into the cache, the other list
 holds pages that have been referenced repeatedly on that first list.
 The idea is to prefer reclaiming young pages over those that have
 shown to benefit from caching in the past.  We call the recently used
 list inactive list and the frequently used list active list.

 Currently, the VM aims for a 1:1 ratio between the lists, which is the
 perfect trade-off between the ability to *protect* frequently used
 pages and the ability to *detect* frequently used pages.  This means
 that working set changes bigger than half of cache memory go
 undetected and thrash indefinitely, whereas working sets bigger than
 half of cache memory are unprotected against used-once streams that
 don't even need caching.


 Good job! This patch looks good to me and with nice descriptions.
 But it seems that this patch only fix the issue working set changes
 bigger than half of cache memory go undetected and thrash indefinitely.
 My concern is could it be extended easily to address all other issues
 based on this patch set?

 The other possible way is something like Peter has implemented the CART
 and Clock-Pro which I think may be better because of using advanced
 algorithms and consider the problem as a whole from the beginning.(Sorry
 I haven't get enough time to read the source code, so I'm not 100% sure.)
 http://linux-mm.org/PeterZClockPro2
 
 My patches are moving the VM towards something that is comparable to
 how Peter implemented Clock-Pro.  However, the current VM has evolved
 over time in small increments based on real life performance
 observations.  Rewriting everything in one go would be incredibly
 disruptive and I doubt very much we would merge any such proposal in
 the first place.  So it's not like I don't see the big picture, it's
 just divide and conquer:
 
 Peter's Clock-Pro implementation was basically a double clock with an
 intricate system to classify hotness, augmented by eviction
 information to work with reuse distances independent of memory size.
 
 What we have right now is a double clock with a very rudimentary
 system to classify whether a page is hot: it has been accessed twice
 while on the inactive clock.  My patches now add eviction information
 to this, and improve the classification so that it can work with reuse
 distances up to memory size and is no longer dependent on the inactive
 clock size.
 
 This is the smallest imaginable step that is still useful, and even
 then we had a lot of discussions about scalability of the data
 structures and confusion about how the new data point should be
 interpreted.  It also took a long time until somebody read the series
 and went, Ok, this actually makes sense to me.  Now, maybe I suck at
 documenting, but maybe this is just complicated stuff.  Either way, we
 have to get there collectively, so that the code is maintainable in
 the long term.
 
 Once we have these new concepts established, we can further improve
 the hotness detector so that it can classify and order pages with
 reuse distances beyond memory size.  But this will come with its own
 set of problems.  For example, some time ago we stopped regularly
 scanning and rotating active pages because of scalability issues, but
 we'll most likely need an uptodate estimate of the reuse distances on
 the active list in order to classify refaults properly.
 

Thank you for your kindly explanation. It make sense to me please feel
free to add my review.

 + * Approximating inactive page access frequency - Observations:
 + *
 + * 1. When a page is accessed for the first time, it is added to the
 + *head of the inactive list, slides every existing inactive page
 + *towards the tail by one slot, and pushes the current tail page
 + *out of memory.
 + *
 + * 2. When a page is accessed for the second time, it is promoted to
 + *the active list, shrinking the inactive list by one slot.  This
 + *also slides all inactive pages that were faulted into the cache
 + *more recently than the activated page towards the tail of the
 + *inactive list.
 + *

 Nitpick, how about the reference bit?
 
 What do you mean?
 

Sorry, I mean the PG_referenced flag. I thought when a page is accessed
for the second time only PG_referenced flag  will be set instead of be
promoted to active list.

-- 
Regards,
-Bob
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-14 Thread Zhang Yanfei
Hello

On 01/15/2014 10:57 AM, Bob Liu wrote:
 
 On 01/15/2014 03:16 AM, Johannes Weiner wrote:
 On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote:
 Hi Johannes,

 On 01/11/2014 02:10 AM, Johannes Weiner wrote:
 The VM maintains cached filesystem pages on two types of lists.  One
 list holds the pages recently faulted into the cache, the other list
 holds pages that have been referenced repeatedly on that first list.
 The idea is to prefer reclaiming young pages over those that have
 shown to benefit from caching in the past.  We call the recently used
 list inactive list and the frequently used list active list.

 Currently, the VM aims for a 1:1 ratio between the lists, which is the
 perfect trade-off between the ability to *protect* frequently used
 pages and the ability to *detect* frequently used pages.  This means
 that working set changes bigger than half of cache memory go
 undetected and thrash indefinitely, whereas working sets bigger than
 half of cache memory are unprotected against used-once streams that
 don't even need caching.


 Good job! This patch looks good to me and with nice descriptions.
 But it seems that this patch only fix the issue working set changes
 bigger than half of cache memory go undetected and thrash indefinitely.
 My concern is could it be extended easily to address all other issues
 based on this patch set?

 The other possible way is something like Peter has implemented the CART
 and Clock-Pro which I think may be better because of using advanced
 algorithms and consider the problem as a whole from the beginning.(Sorry
 I haven't get enough time to read the source code, so I'm not 100% sure.)
 http://linux-mm.org/PeterZClockPro2

 My patches are moving the VM towards something that is comparable to
 how Peter implemented Clock-Pro.  However, the current VM has evolved
 over time in small increments based on real life performance
 observations.  Rewriting everything in one go would be incredibly
 disruptive and I doubt very much we would merge any such proposal in
 the first place.  So it's not like I don't see the big picture, it's
 just divide and conquer:

 Peter's Clock-Pro implementation was basically a double clock with an
 intricate system to classify hotness, augmented by eviction
 information to work with reuse distances independent of memory size.

 What we have right now is a double clock with a very rudimentary
 system to classify whether a page is hot: it has been accessed twice
 while on the inactive clock.  My patches now add eviction information
 to this, and improve the classification so that it can work with reuse
 distances up to memory size and is no longer dependent on the inactive
 clock size.

 This is the smallest imaginable step that is still useful, and even
 then we had a lot of discussions about scalability of the data
 structures and confusion about how the new data point should be
 interpreted.  It also took a long time until somebody read the series
 and went, Ok, this actually makes sense to me.  Now, maybe I suck at
 documenting, but maybe this is just complicated stuff.  Either way, we
 have to get there collectively, so that the code is maintainable in
 the long term.

 Once we have these new concepts established, we can further improve
 the hotness detector so that it can classify and order pages with
 reuse distances beyond memory size.  But this will come with its own
 set of problems.  For example, some time ago we stopped regularly
 scanning and rotating active pages because of scalability issues, but
 we'll most likely need an uptodate estimate of the reuse distances on
 the active list in order to classify refaults properly.

 
 Thank you for your kindly explanation. It make sense to me please feel
 free to add my review.
 
 + * Approximating inactive page access frequency - Observations:
 + *
 + * 1. When a page is accessed for the first time, it is added to the
 + *head of the inactive list, slides every existing inactive page
 + *towards the tail by one slot, and pushes the current tail page
 + *out of memory.
 + *
 + * 2. When a page is accessed for the second time, it is promoted to
 + *the active list, shrinking the inactive list by one slot.  This
 + *also slides all inactive pages that were faulted into the cache
 + *more recently than the activated page towards the tail of the
 + *inactive list.
 + *

 Nitpick, how about the reference bit?

 What do you mean?

 
 Sorry, I mean the PG_referenced flag. I thought when a page is accessed
 for the second time only PG_referenced flag  will be set instead of be
 promoted to active list.
 

No. I try to explain a bit. For mapped file pages, if the second access
occurs on a different page table entry, the page is surely promoted to active
list. But if the paged is always accessed from the same page table entry, it
was mistakenly evicted. This was fixed by Johannes already by reusing the
PG_referenced flag, for details, please refer to commit 

Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-13 Thread Bob Liu
Hi Johannes,

On 01/11/2014 02:10 AM, Johannes Weiner wrote:
> The VM maintains cached filesystem pages on two types of lists.  One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past.  We call the recently used
> list "inactive list" and the frequently used list "active list".
> 
> Currently, the VM aims for a 1:1 ratio between the lists, which is the
> "perfect" trade-off between the ability to *protect* frequently used
> pages and the ability to *detect* frequently used pages.  This means
> that working set changes bigger than half of cache memory go
> undetected and thrash indefinitely, whereas working sets bigger than
> half of cache memory are unprotected against used-once streams that
> don't even need caching.
> 

Good job! This patch looks good to me and with nice descriptions.
But it seems that this patch only fix the issue "working set changes
bigger than half of cache memory go undetected and thrash indefinitely".
My concern is could it be extended easily to address all other issues
based on this patch set?

The other possible way is something like Peter has implemented the CART
and Clock-Pro which I think may be better because of using advanced
algorithms and consider the problem as a whole from the beginning.(Sorry
I haven't get enough time to read the source code, so I'm not 100% sure.)
http://linux-mm.org/PeterZClockPro2

> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list.  This model gave established
> working sets more gracetime in the face of temporary use-once streams,
> but ultimately was not significantly better than a FIFO policy and
> still thrashed cache based on eviction speed, rather than actual
> demand for cache.
> 
> This patch solves one half of the problem by decoupling the ability to
> detect working set changes from the inactive list size.  By
> maintaining a history of recently evicted file pages it can detect
> frequently used pages with an arbitrarily small inactive list size,
> and subsequently apply pressure on the active list based on actual
> demand for cache, not just overall eviction speed.
> 
> Every zone maintains a counter that tracks inactive list aging speed.
> When a page is evicted, a snapshot of this counter is stored in the
> now-empty page cache radix tree slot.  On refault, the minimum access
> distance of the page can be assessed, to evaluate whether the page
> should be part of the active list or not.
> 
> This fixes the VM's blindness towards working set changes in excess of
> the inactive list.  And it's the foundation to further improve the
> protection ability and reduce the minimum inactive list size of 50%.
> 
> Signed-off-by: Johannes Weiner 
> ---
>  include/linux/mmzone.h |   5 +
>  include/linux/swap.h   |   5 +
>  mm/Makefile|   2 +-
>  mm/filemap.c   |  61 
>  mm/swap.c  |   2 +
>  mm/vmscan.c|  24 -
>  mm/vmstat.c|   2 +
>  mm/workingset.c| 253 
> +
>  8 files changed, 331 insertions(+), 23 deletions(-)
>  create mode 100644 mm/workingset.c
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bd791e452ad7..118ba9f51e86 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -142,6 +142,8 @@ enum zone_stat_item {
>   NUMA_LOCAL, /* allocation from local node */
>   NUMA_OTHER, /* allocation from other node */
>  #endif
> + WORKINGSET_REFAULT,
> + WORKINGSET_ACTIVATE,
>   NR_ANON_TRANSPARENT_HUGEPAGES,
>   NR_FREE_CMA_PAGES,
>   NR_VM_ZONE_STAT_ITEMS };
> @@ -392,6 +394,9 @@ struct zone {
>   spinlock_t  lru_lock;
>   struct lruvec   lruvec;
>  
> + /* Evictions & activations on the inactive file list */
> + atomic_long_t   inactive_age;
> +
>   unsigned long   pages_scanned; /* since last reclaim */
>   unsigned long   flags; /* zone flags, see below */
>  
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6c219f..b83cf61403ed 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -260,6 +260,11 @@ struct swap_list_t {
>   int next;   /* swapfile to be used next */
>  };
>  
> +/* linux/mm/workingset.c */
> +void *workingset_eviction(struct address_space *mapping, struct page *page);
> +bool workingset_refault(void *shadow);
> +void workingset_activation(struct page *page);
> +
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalram_pages;
>  extern unsigned long totalreserve_pages;
> diff --git a/mm/Makefile b/mm/Makefile
> index 305d10acd081..b30aeb86abd6 100644
> 

Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-13 Thread Bob Liu
Hi Johannes,

On 01/11/2014 02:10 AM, Johannes Weiner wrote:
 The VM maintains cached filesystem pages on two types of lists.  One
 list holds the pages recently faulted into the cache, the other list
 holds pages that have been referenced repeatedly on that first list.
 The idea is to prefer reclaiming young pages over those that have
 shown to benefit from caching in the past.  We call the recently used
 list inactive list and the frequently used list active list.
 
 Currently, the VM aims for a 1:1 ratio between the lists, which is the
 perfect trade-off between the ability to *protect* frequently used
 pages and the ability to *detect* frequently used pages.  This means
 that working set changes bigger than half of cache memory go
 undetected and thrash indefinitely, whereas working sets bigger than
 half of cache memory are unprotected against used-once streams that
 don't even need caching.
 

Good job! This patch looks good to me and with nice descriptions.
But it seems that this patch only fix the issue working set changes
bigger than half of cache memory go undetected and thrash indefinitely.
My concern is could it be extended easily to address all other issues
based on this patch set?

The other possible way is something like Peter has implemented the CART
and Clock-Pro which I think may be better because of using advanced
algorithms and consider the problem as a whole from the beginning.(Sorry
I haven't get enough time to read the source code, so I'm not 100% sure.)
http://linux-mm.org/PeterZClockPro2

 Historically, every reclaim scan of the inactive list also took a
 smaller number of pages from the tail of the active list and moved
 them to the head of the inactive list.  This model gave established
 working sets more gracetime in the face of temporary use-once streams,
 but ultimately was not significantly better than a FIFO policy and
 still thrashed cache based on eviction speed, rather than actual
 demand for cache.
 
 This patch solves one half of the problem by decoupling the ability to
 detect working set changes from the inactive list size.  By
 maintaining a history of recently evicted file pages it can detect
 frequently used pages with an arbitrarily small inactive list size,
 and subsequently apply pressure on the active list based on actual
 demand for cache, not just overall eviction speed.
 
 Every zone maintains a counter that tracks inactive list aging speed.
 When a page is evicted, a snapshot of this counter is stored in the
 now-empty page cache radix tree slot.  On refault, the minimum access
 distance of the page can be assessed, to evaluate whether the page
 should be part of the active list or not.
 
 This fixes the VM's blindness towards working set changes in excess of
 the inactive list.  And it's the foundation to further improve the
 protection ability and reduce the minimum inactive list size of 50%.
 
 Signed-off-by: Johannes Weiner han...@cmpxchg.org
 ---
  include/linux/mmzone.h |   5 +
  include/linux/swap.h   |   5 +
  mm/Makefile|   2 +-
  mm/filemap.c   |  61 
  mm/swap.c  |   2 +
  mm/vmscan.c|  24 -
  mm/vmstat.c|   2 +
  mm/workingset.c| 253 
 +
  8 files changed, 331 insertions(+), 23 deletions(-)
  create mode 100644 mm/workingset.c
 
 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
 index bd791e452ad7..118ba9f51e86 100644
 --- a/include/linux/mmzone.h
 +++ b/include/linux/mmzone.h
 @@ -142,6 +142,8 @@ enum zone_stat_item {
   NUMA_LOCAL, /* allocation from local node */
   NUMA_OTHER, /* allocation from other node */
  #endif
 + WORKINGSET_REFAULT,
 + WORKINGSET_ACTIVATE,
   NR_ANON_TRANSPARENT_HUGEPAGES,
   NR_FREE_CMA_PAGES,
   NR_VM_ZONE_STAT_ITEMS };
 @@ -392,6 +394,9 @@ struct zone {
   spinlock_t  lru_lock;
   struct lruvec   lruvec;
  
 + /* Evictions  activations on the inactive file list */
 + atomic_long_t   inactive_age;
 +
   unsigned long   pages_scanned; /* since last reclaim */
   unsigned long   flags; /* zone flags, see below */
  
 diff --git a/include/linux/swap.h b/include/linux/swap.h
 index 46ba0c6c219f..b83cf61403ed 100644
 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -260,6 +260,11 @@ struct swap_list_t {
   int next;   /* swapfile to be used next */
  };
  
 +/* linux/mm/workingset.c */
 +void *workingset_eviction(struct address_space *mapping, struct page *page);
 +bool workingset_refault(void *shadow);
 +void workingset_activation(struct page *page);
 +
  /* linux/mm/page_alloc.c */
  extern unsigned long totalram_pages;
  extern unsigned long totalreserve_pages;
 diff --git a/mm/Makefile b/mm/Makefile
 index 305d10acd081..b30aeb86abd6 100644
 --- a/mm/Makefile
 +++ b/mm/Makefile
 @@ -17,7 +17,7 @@ obj-y   := 

Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-12 Thread Minchan Kim
On Fri, Jan 10, 2014 at 01:10:41PM -0500, Johannes Weiner wrote:
> The VM maintains cached filesystem pages on two types of lists.  One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past.  We call the recently used
> list "inactive list" and the frequently used list "active list".
> 
> Currently, the VM aims for a 1:1 ratio between the lists, which is the
> "perfect" trade-off between the ability to *protect* frequently used
> pages and the ability to *detect* frequently used pages.  This means
> that working set changes bigger than half of cache memory go
> undetected and thrash indefinitely, whereas working sets bigger than
> half of cache memory are unprotected against used-once streams that
> don't even need caching.
> 
> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list.  This model gave established
> working sets more gracetime in the face of temporary use-once streams,
> but ultimately was not significantly better than a FIFO policy and
> still thrashed cache based on eviction speed, rather than actual
> demand for cache.
> 
> This patch solves one half of the problem by decoupling the ability to
> detect working set changes from the inactive list size.  By
> maintaining a history of recently evicted file pages it can detect
> frequently used pages with an arbitrarily small inactive list size,
> and subsequently apply pressure on the active list based on actual
> demand for cache, not just overall eviction speed.
> 
> Every zone maintains a counter that tracks inactive list aging speed.
> When a page is evicted, a snapshot of this counter is stored in the
> now-empty page cache radix tree slot.  On refault, the minimum access
> distance of the page can be assessed, to evaluate whether the page
> should be part of the active list or not.
> 
> This fixes the VM's blindness towards working set changes in excess of
> the inactive list.  And it's the foundation to further improve the
> protection ability and reduce the minimum inactive list size of 50%.
> 
> Signed-off-by: Johannes Weiner 
Reviewed-by: Minchan Kim 

Really nice description and code to understand.
I believe we should really merge/maintain such advanced algorithm,
which will end up putting more advanced concept easily.

Johannes, thanks for the your effort!

> ---
>  include/linux/mmzone.h |   5 +
>  include/linux/swap.h   |   5 +
>  mm/Makefile|   2 +-
>  mm/filemap.c   |  61 
>  mm/swap.c  |   2 +
>  mm/vmscan.c|  24 -
>  mm/vmstat.c|   2 +
>  mm/workingset.c| 253 
> +
>  8 files changed, 331 insertions(+), 23 deletions(-)
>  create mode 100644 mm/workingset.c
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bd791e452ad7..118ba9f51e86 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -142,6 +142,8 @@ enum zone_stat_item {
>   NUMA_LOCAL, /* allocation from local node */
>   NUMA_OTHER, /* allocation from other node */
>  #endif
> + WORKINGSET_REFAULT,
> + WORKINGSET_ACTIVATE,
>   NR_ANON_TRANSPARENT_HUGEPAGES,
>   NR_FREE_CMA_PAGES,
>   NR_VM_ZONE_STAT_ITEMS };
> @@ -392,6 +394,9 @@ struct zone {
>   spinlock_t  lru_lock;
>   struct lruvec   lruvec;
>  
> + /* Evictions & activations on the inactive file list */
> + atomic_long_t   inactive_age;
> +
>   unsigned long   pages_scanned; /* since last reclaim */
>   unsigned long   flags; /* zone flags, see below */
>  
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 46ba0c6c219f..b83cf61403ed 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -260,6 +260,11 @@ struct swap_list_t {
>   int next;   /* swapfile to be used next */
>  };
>  
> +/* linux/mm/workingset.c */
> +void *workingset_eviction(struct address_space *mapping, struct page *page);
> +bool workingset_refault(void *shadow);
> +void workingset_activation(struct page *page);
> +
>  /* linux/mm/page_alloc.c */
>  extern unsigned long totalram_pages;
>  extern unsigned long totalreserve_pages;
> diff --git a/mm/Makefile b/mm/Makefile
> index 305d10acd081..b30aeb86abd6 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -17,7 +17,7 @@ obj-y   := filemap.o mempool.o 
> oom_kill.o fadvise.o \
>  util.o mmzone.o vmstat.o backing-dev.o \
>  mm_init.o mmu_context.o percpu.o slab_common.o \
>  compaction.o balloon_compaction.o \
> -interval_tree.o 

Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-12 Thread Minchan Kim
On Fri, Jan 10, 2014 at 01:10:41PM -0500, Johannes Weiner wrote:
 The VM maintains cached filesystem pages on two types of lists.  One
 list holds the pages recently faulted into the cache, the other list
 holds pages that have been referenced repeatedly on that first list.
 The idea is to prefer reclaiming young pages over those that have
 shown to benefit from caching in the past.  We call the recently used
 list inactive list and the frequently used list active list.
 
 Currently, the VM aims for a 1:1 ratio between the lists, which is the
 perfect trade-off between the ability to *protect* frequently used
 pages and the ability to *detect* frequently used pages.  This means
 that working set changes bigger than half of cache memory go
 undetected and thrash indefinitely, whereas working sets bigger than
 half of cache memory are unprotected against used-once streams that
 don't even need caching.
 
 Historically, every reclaim scan of the inactive list also took a
 smaller number of pages from the tail of the active list and moved
 them to the head of the inactive list.  This model gave established
 working sets more gracetime in the face of temporary use-once streams,
 but ultimately was not significantly better than a FIFO policy and
 still thrashed cache based on eviction speed, rather than actual
 demand for cache.
 
 This patch solves one half of the problem by decoupling the ability to
 detect working set changes from the inactive list size.  By
 maintaining a history of recently evicted file pages it can detect
 frequently used pages with an arbitrarily small inactive list size,
 and subsequently apply pressure on the active list based on actual
 demand for cache, not just overall eviction speed.
 
 Every zone maintains a counter that tracks inactive list aging speed.
 When a page is evicted, a snapshot of this counter is stored in the
 now-empty page cache radix tree slot.  On refault, the minimum access
 distance of the page can be assessed, to evaluate whether the page
 should be part of the active list or not.
 
 This fixes the VM's blindness towards working set changes in excess of
 the inactive list.  And it's the foundation to further improve the
 protection ability and reduce the minimum inactive list size of 50%.
 
 Signed-off-by: Johannes Weiner han...@cmpxchg.org
Reviewed-by: Minchan Kim minc...@kernel.org

Really nice description and code to understand.
I believe we should really merge/maintain such advanced algorithm,
which will end up putting more advanced concept easily.

Johannes, thanks for the your effort!

 ---
  include/linux/mmzone.h |   5 +
  include/linux/swap.h   |   5 +
  mm/Makefile|   2 +-
  mm/filemap.c   |  61 
  mm/swap.c  |   2 +
  mm/vmscan.c|  24 -
  mm/vmstat.c|   2 +
  mm/workingset.c| 253 
 +
  8 files changed, 331 insertions(+), 23 deletions(-)
  create mode 100644 mm/workingset.c
 
 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
 index bd791e452ad7..118ba9f51e86 100644
 --- a/include/linux/mmzone.h
 +++ b/include/linux/mmzone.h
 @@ -142,6 +142,8 @@ enum zone_stat_item {
   NUMA_LOCAL, /* allocation from local node */
   NUMA_OTHER, /* allocation from other node */
  #endif
 + WORKINGSET_REFAULT,
 + WORKINGSET_ACTIVATE,
   NR_ANON_TRANSPARENT_HUGEPAGES,
   NR_FREE_CMA_PAGES,
   NR_VM_ZONE_STAT_ITEMS };
 @@ -392,6 +394,9 @@ struct zone {
   spinlock_t  lru_lock;
   struct lruvec   lruvec;
  
 + /* Evictions  activations on the inactive file list */
 + atomic_long_t   inactive_age;
 +
   unsigned long   pages_scanned; /* since last reclaim */
   unsigned long   flags; /* zone flags, see below */
  
 diff --git a/include/linux/swap.h b/include/linux/swap.h
 index 46ba0c6c219f..b83cf61403ed 100644
 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -260,6 +260,11 @@ struct swap_list_t {
   int next;   /* swapfile to be used next */
  };
  
 +/* linux/mm/workingset.c */
 +void *workingset_eviction(struct address_space *mapping, struct page *page);
 +bool workingset_refault(void *shadow);
 +void workingset_activation(struct page *page);
 +
  /* linux/mm/page_alloc.c */
  extern unsigned long totalram_pages;
  extern unsigned long totalreserve_pages;
 diff --git a/mm/Makefile b/mm/Makefile
 index 305d10acd081..b30aeb86abd6 100644
 --- a/mm/Makefile
 +++ b/mm/Makefile
 @@ -17,7 +17,7 @@ obj-y   := filemap.o mempool.o 
 oom_kill.o fadvise.o \
  util.o mmzone.o vmstat.o backing-dev.o \
  mm_init.o mmu_context.o percpu.o slab_common.o \
  compaction.o balloon_compaction.o \
 -interval_tree.o list_lru.o $(mmu-y)
 +interval_tree.o list_lru.o 

Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-10 Thread Rik van Riel
On 01/10/2014 01:10 PM, Johannes Weiner wrote:

> This patch solves one half of the problem by decoupling the ability to
> detect working set changes from the inactive list size.  By
> maintaining a history of recently evicted file pages it can detect
> frequently used pages with an arbitrarily small inactive list size,
> and subsequently apply pressure on the active list based on actual
> demand for cache, not just overall eviction speed.
> 
> Every zone maintains a counter that tracks inactive list aging speed.
> When a page is evicted, a snapshot of this counter is stored in the
> now-empty page cache radix tree slot.  On refault, the minimum access
> distance of the page can be assessed, to evaluate whether the page
> should be part of the active list or not.
> 
> This fixes the VM's blindness towards working set changes in excess of
> the inactive list.  And it's the foundation to further improve the
> protection ability and reduce the minimum inactive list size of 50%.
> 
> Signed-off-by: Johannes Weiner 

Reviewed-by: Rik van Riel 

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2014-01-10 Thread Rik van Riel
On 01/10/2014 01:10 PM, Johannes Weiner wrote:

 This patch solves one half of the problem by decoupling the ability to
 detect working set changes from the inactive list size.  By
 maintaining a history of recently evicted file pages it can detect
 frequently used pages with an arbitrarily small inactive list size,
 and subsequently apply pressure on the active list based on actual
 demand for cache, not just overall eviction speed.
 
 Every zone maintains a counter that tracks inactive list aging speed.
 When a page is evicted, a snapshot of this counter is stored in the
 now-empty page cache radix tree slot.  On refault, the minimum access
 distance of the page can be assessed, to evaluate whether the page
 should be part of the active list or not.
 
 This fixes the VM's blindness towards working set changes in excess of
 the inactive list.  And it's the foundation to further improve the
 protection ability and reduce the minimum inactive list size of 50%.
 
 Signed-off-by: Johannes Weiner han...@cmpxchg.org

Reviewed-by: Rik van Riel r...@redhat.com

-- 
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2013-11-26 Thread Johannes Weiner
On Tue, Nov 26, 2013 at 12:56:23PM +1100, Ryan Mallon wrote:
> > + *   fault +
> > + * |
> > + *  +--+   |+-+
> > + *   reclaim <- |   inactive   | <-+-- demotion |active   | <--+
> > + *  +--++-+|
> > + * |   |
> > + * +-- promotion --+
> > + *
> > + *
> > + * Access frequency and refault distance
> > + *
> > + * A workload is trashing when its pages are frequently used but they
> 
> "thrashing".

Thanks ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2013-11-26 Thread Johannes Weiner
On Tue, Nov 26, 2013 at 12:56:23PM +1100, Ryan Mallon wrote:
  + *   fault +
  + * |
  + *  +--+   |+-+
  + *   reclaim - |   inactive   | -+-- demotion |active   | --+
  + *  +--++-+|
  + * |   |
  + * +-- promotion --+
  + *
  + *
  + * Access frequency and refault distance
  + *
  + * A workload is trashing when its pages are frequently used but they
 
 thrashing.

Thanks ;)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2013-11-25 Thread Johannes Weiner
On Mon, Nov 25, 2013 at 03:50:11PM -0800, Andrew Morton wrote:
> On Sun, 24 Nov 2013 18:38:26 -0500 Johannes Weiner  wrote:
> 
> > ...
> >
> > + * Access frequency and refault distance
> > + *
> > + * A workload is trashing when its pages are frequently used but they
> > + * are evicted from the inactive list every time before another access
> > + * would have promoted them to the active list.
> > + *
> > + * In cases where the average access distance between thrashing pages
> > + * is bigger than the size of memory there is nothing that can be
> > + * done - the thrashing set could never fit into memory under any
> > + * circumstance.
> > + *
> > + * However, the average access distance could be bigger than the
> > + * inactive list, yet smaller than the size of memory.  In this case,
> > + * the set could fit into memory if it weren't for the currently
> > + * active pages - which may be used more, hopefully less frequently:
> > + *
> > + *  +-memory available to cache-+
> > + *  |   |
> > + *  +-inactive--+-active+
> > + *  a b | c d e f g h i | J K L M N |
> > + *  +---+---+
> 
> So making the inactive list smaller will worsen this problem?

Only if the inactive list size is a factor in detecting repeatedly
used pages.  This patch series is all about removing that dependency
and using non-residency information to cover that deficit a small
inactive list would otherwise create.

> If so, don't we have a conflict with this objective:
> 
> > Right now we have a fixed ratio (50:50) between inactive and active
> > list but we already have complaints about working sets exceeding half
> > of memory being pushed out of the cache by simple streaming in the
> > background.  Ultimately, we want to adjust this ratio and allow for a
> > much smaller inactive list.

No, this IS the objective.  The patches get us there by being able to
detect repeated references with an arbitrary inactive list size.

> > + * It is prohibitively expensive to accurately track access frequency
> > + * of pages.  But a reasonable approximation can be made to measure
> > + * thrashing on the inactive list, after which refaulting pages can be
> > + * activated optimistically to compete with the existing active pages.
> > + *
> > + * Approximating inactive page access frequency - Observations:
> > + *
> > + * 1. When a page is accesed for the first time, it is added to the
> 
> "accessed"

Whoopsa :-)  Will fix that up.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2013-11-25 Thread Ryan Mallon
On 25/11/13 10:38, Johannes Weiner wrote:
> The VM maintains cached filesystem pages on two types of lists.  One
> list holds the pages recently faulted into the cache, the other list
> holds pages that have been referenced repeatedly on that first list.
> The idea is to prefer reclaiming young pages over those that have
> shown to benefit from caching in the past.  We call the recently used
> list "inactive list" and the frequently used list "active list".
> 
> Currently, the VM aims for a 1:1 ratio between the lists, which is the
> "perfect" trade-off between the ability to *protect* frequently used
> pages and the ability to *detect* frequently used pages.  This means
> that working set changes bigger than half of cache memory go
> undetected and thrash indefinitely, whereas working sets bigger than
> half of cache memory are unprotected against used-once streams that
> don't even need caching.
> 
> Historically, every reclaim scan of the inactive list also took a
> smaller number of pages from the tail of the active list and moved
> them to the head of the inactive list.  This model gave established
> working sets more gracetime in the face of temporary use-once streams,
> but ultimately was not significantly better than a FIFO policy and
> still thrashed cache based on eviction speed, rather than actual
> demand for cache.
> 
> This patch solves one half of the problem by decoupling the ability to
> detect working set changes from the inactive list size.  By
> maintaining a history of recently evicted file pages it can detect
> frequently used pages with an arbitrarily small inactive list size,
> and subsequently apply pressure on the active list based on actual
> demand for cache, not just overall eviction speed.
> 
> Every zone maintains a counter that tracks inactive list aging speed.
> When a page is evicted, a snapshot of this counter is stored in the
> now-empty page cache radix tree slot.  On refault, the minimum access
> distance of the page can be assesed, to evaluate whether the page
> should be part of the active list or not.
> 
> This fixes the VM's blindness towards working set changes in excess of
> the inactive list.  And it's the foundation to further improve the
> protection ability and reduce the minimum inactive list size of 50%.
> 
> Signed-off-by: Johannes Weiner 
> ---



> + *   fault +
> + * |
> + *  +--+   |+-+
> + *   reclaim <- |   inactive   | <-+-- demotion |active   | <--+
> + *  +--++-+|
> + * |   |
> + * +-- promotion --+
> + *
> + *
> + *   Access frequency and refault distance
> + *
> + * A workload is trashing when its pages are frequently used but they

"thrashing".

~Ryan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2013-11-25 Thread Andrew Morton
On Sun, 24 Nov 2013 18:38:26 -0500 Johannes Weiner  wrote:

> ...
>
> + *   Access frequency and refault distance
> + *
> + * A workload is trashing when its pages are frequently used but they
> + * are evicted from the inactive list every time before another access
> + * would have promoted them to the active list.
> + *
> + * In cases where the average access distance between thrashing pages
> + * is bigger than the size of memory there is nothing that can be
> + * done - the thrashing set could never fit into memory under any
> + * circumstance.
> + *
> + * However, the average access distance could be bigger than the
> + * inactive list, yet smaller than the size of memory.  In this case,
> + * the set could fit into memory if it weren't for the currently
> + * active pages - which may be used more, hopefully less frequently:
> + *
> + *  +-memory available to cache-+
> + *  |   |
> + *  +-inactive--+-active+
> + *  a b | c d e f g h i | J K L M N |
> + *  +---+---+

So making the inactive list smaller will worsen this problem?

If so, don't we have a conflict with this objective:

> Right now we have a fixed ratio (50:50) between inactive and active
> list but we already have complaints about working sets exceeding half
> of memory being pushed out of the cache by simple streaming in the
> background.  Ultimately, we want to adjust this ratio and allow for a
> much smaller inactive list.

?

> + * It is prohibitively expensive to accurately track access frequency
> + * of pages.  But a reasonable approximation can be made to measure
> + * thrashing on the inactive list, after which refaulting pages can be
> + * activated optimistically to compete with the existing active pages.
> + *
> + * Approximating inactive page access frequency - Observations:
> + *
> + * 1. When a page is accesed for the first time, it is added to the

"accessed"

> + *head of the inactive list, slides every existing inactive page
> + *towards the tail by one slot, and pushes the current tail page
> + *out of memory.
> + *
>
> ...
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2013-11-25 Thread Andrew Morton
On Sun, 24 Nov 2013 18:38:26 -0500 Johannes Weiner han...@cmpxchg.org wrote:

 ...

 + *   Access frequency and refault distance
 + *
 + * A workload is trashing when its pages are frequently used but they
 + * are evicted from the inactive list every time before another access
 + * would have promoted them to the active list.
 + *
 + * In cases where the average access distance between thrashing pages
 + * is bigger than the size of memory there is nothing that can be
 + * done - the thrashing set could never fit into memory under any
 + * circumstance.
 + *
 + * However, the average access distance could be bigger than the
 + * inactive list, yet smaller than the size of memory.  In this case,
 + * the set could fit into memory if it weren't for the currently
 + * active pages - which may be used more, hopefully less frequently:
 + *
 + *  +-memory available to cache-+
 + *  |   |
 + *  +-inactive--+-active+
 + *  a b | c d e f g h i | J K L M N |
 + *  +---+---+

So making the inactive list smaller will worsen this problem?

If so, don't we have a conflict with this objective:

 Right now we have a fixed ratio (50:50) between inactive and active
 list but we already have complaints about working sets exceeding half
 of memory being pushed out of the cache by simple streaming in the
 background.  Ultimately, we want to adjust this ratio and allow for a
 much smaller inactive list.

?

 + * It is prohibitively expensive to accurately track access frequency
 + * of pages.  But a reasonable approximation can be made to measure
 + * thrashing on the inactive list, after which refaulting pages can be
 + * activated optimistically to compete with the existing active pages.
 + *
 + * Approximating inactive page access frequency - Observations:
 + *
 + * 1. When a page is accesed for the first time, it is added to the

accessed

 + *head of the inactive list, slides every existing inactive page
 + *towards the tail by one slot, and pushes the current tail page
 + *out of memory.
 + *

 ...

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2013-11-25 Thread Ryan Mallon
On 25/11/13 10:38, Johannes Weiner wrote:
 The VM maintains cached filesystem pages on two types of lists.  One
 list holds the pages recently faulted into the cache, the other list
 holds pages that have been referenced repeatedly on that first list.
 The idea is to prefer reclaiming young pages over those that have
 shown to benefit from caching in the past.  We call the recently used
 list inactive list and the frequently used list active list.
 
 Currently, the VM aims for a 1:1 ratio between the lists, which is the
 perfect trade-off between the ability to *protect* frequently used
 pages and the ability to *detect* frequently used pages.  This means
 that working set changes bigger than half of cache memory go
 undetected and thrash indefinitely, whereas working sets bigger than
 half of cache memory are unprotected against used-once streams that
 don't even need caching.
 
 Historically, every reclaim scan of the inactive list also took a
 smaller number of pages from the tail of the active list and moved
 them to the head of the inactive list.  This model gave established
 working sets more gracetime in the face of temporary use-once streams,
 but ultimately was not significantly better than a FIFO policy and
 still thrashed cache based on eviction speed, rather than actual
 demand for cache.
 
 This patch solves one half of the problem by decoupling the ability to
 detect working set changes from the inactive list size.  By
 maintaining a history of recently evicted file pages it can detect
 frequently used pages with an arbitrarily small inactive list size,
 and subsequently apply pressure on the active list based on actual
 demand for cache, not just overall eviction speed.
 
 Every zone maintains a counter that tracks inactive list aging speed.
 When a page is evicted, a snapshot of this counter is stored in the
 now-empty page cache radix tree slot.  On refault, the minimum access
 distance of the page can be assesed, to evaluate whether the page
 should be part of the active list or not.
 
 This fixes the VM's blindness towards working set changes in excess of
 the inactive list.  And it's the foundation to further improve the
 protection ability and reduce the minimum inactive list size of 50%.
 
 Signed-off-by: Johannes Weiner han...@cmpxchg.org
 ---

snip

 + *   fault +
 + * |
 + *  +--+   |+-+
 + *   reclaim - |   inactive   | -+-- demotion |active   | --+
 + *  +--++-+|
 + * |   |
 + * +-- promotion --+
 + *
 + *
 + *   Access frequency and refault distance
 + *
 + * A workload is trashing when its pages are frequently used but they

thrashing.

~Ryan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 7/9] mm: thrash detection-based file cache sizing

2013-11-25 Thread Johannes Weiner
On Mon, Nov 25, 2013 at 03:50:11PM -0800, Andrew Morton wrote:
 On Sun, 24 Nov 2013 18:38:26 -0500 Johannes Weiner han...@cmpxchg.org wrote:
 
  ...
 
  + * Access frequency and refault distance
  + *
  + * A workload is trashing when its pages are frequently used but they
  + * are evicted from the inactive list every time before another access
  + * would have promoted them to the active list.
  + *
  + * In cases where the average access distance between thrashing pages
  + * is bigger than the size of memory there is nothing that can be
  + * done - the thrashing set could never fit into memory under any
  + * circumstance.
  + *
  + * However, the average access distance could be bigger than the
  + * inactive list, yet smaller than the size of memory.  In this case,
  + * the set could fit into memory if it weren't for the currently
  + * active pages - which may be used more, hopefully less frequently:
  + *
  + *  +-memory available to cache-+
  + *  |   |
  + *  +-inactive--+-active+
  + *  a b | c d e f g h i | J K L M N |
  + *  +---+---+
 
 So making the inactive list smaller will worsen this problem?

Only if the inactive list size is a factor in detecting repeatedly
used pages.  This patch series is all about removing that dependency
and using non-residency information to cover that deficit a small
inactive list would otherwise create.

 If so, don't we have a conflict with this objective:
 
  Right now we have a fixed ratio (50:50) between inactive and active
  list but we already have complaints about working sets exceeding half
  of memory being pushed out of the cache by simple streaming in the
  background.  Ultimately, we want to adjust this ratio and allow for a
  much smaller inactive list.

No, this IS the objective.  The patches get us there by being able to
detect repeated references with an arbitrary inactive list size.

  + * It is prohibitively expensive to accurately track access frequency
  + * of pages.  But a reasonable approximation can be made to measure
  + * thrashing on the inactive list, after which refaulting pages can be
  + * activated optimistically to compete with the existing active pages.
  + *
  + * Approximating inactive page access frequency - Observations:
  + *
  + * 1. When a page is accesed for the first time, it is added to the
 
 accessed

Whoopsa :-)  Will fix that up.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/