Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Wed, Jan 15, 2014 at 10:57:21AM +0800, Bob Liu wrote: > On 01/15/2014 03:16 AM, Johannes Weiner wrote: > > On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote: > >> Good job! This patch looks good to me and with nice descriptions. > >> But it seems that this patch only fix the issue "working set changes > >> bigger than half of cache memory go undetected and thrash indefinitely". > >> My concern is could it be extended easily to address all other issues > >> based on this patch set? > >> > >> The other possible way is something like Peter has implemented the CART > >> and Clock-Pro which I think may be better because of using advanced > >> algorithms and consider the problem as a whole from the beginning.(Sorry > >> I haven't get enough time to read the source code, so I'm not 100% sure.) > >> http://linux-mm.org/PeterZClockPro2 > > > > My patches are moving the VM towards something that is comparable to > > how Peter implemented Clock-Pro. However, the current VM has evolved > > over time in small increments based on real life performance > > observations. Rewriting everything in one go would be incredibly > > disruptive and I doubt very much we would merge any such proposal in > > the first place. So it's not like I don't see the big picture, it's > > just divide and conquer: > > > > Peter's Clock-Pro implementation was basically a double clock with an > > intricate system to classify hotness, augmented by eviction > > information to work with reuse distances independent of memory size. > > > > What we have right now is a double clock with a very rudimentary > > system to classify whether a page is hot: it has been accessed twice > > while on the inactive clock. My patches now add eviction information > > to this, and improve the classification so that it can work with reuse > > distances up to memory size and is no longer dependent on the inactive > > clock size. > > > > This is the smallest imaginable step that is still useful, and even > > then we had a lot of discussions about scalability of the data > > structures and confusion about how the new data point should be > > interpreted. It also took a long time until somebody read the series > > and went, "Ok, this actually makes sense to me." Now, maybe I suck at > > documenting, but maybe this is just complicated stuff. Either way, we > > have to get there collectively, so that the code is maintainable in > > the long term. > > > > Once we have these new concepts established, we can further improve > > the hotness detector so that it can classify and order pages with > > reuse distances beyond memory size. But this will come with its own > > set of problems. For example, some time ago we stopped regularly > > scanning and rotating active pages because of scalability issues, but > > we'll most likely need an uptodate estimate of the reuse distances on > > the active list in order to classify refaults properly. > > > > Thank you for your kindly explanation. It make sense to me please feel > free to add my review. Thank you! > >>> + * Approximating inactive page access frequency - Observations: > >>> + * > >>> + * 1. When a page is accessed for the first time, it is added to the > >>> + *head of the inactive list, slides every existing inactive page > >>> + *towards the tail by one slot, and pushes the current tail page > >>> + *out of memory. > >>> + * > >>> + * 2. When a page is accessed for the second time, it is promoted to > >>> + *the active list, shrinking the inactive list by one slot. This > >>> + *also slides all inactive pages that were faulted into the cache > >>> + *more recently than the activated page towards the tail of the > >>> + *inactive list. > >>> + * > >> > >> Nitpick, how about the reference bit? > > > > What do you mean? > > > > Sorry, I mean the PG_referenced flag. I thought when a page is accessed > for the second time only PG_referenced flag will be set instead of be > promoted to active list. It's cleared during rotation or not set on pages that came in through readahead, but the first access sets the bit and the second access activates it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Wed, Jan 15, 2014 at 10:57:21AM +0800, Bob Liu wrote: On 01/15/2014 03:16 AM, Johannes Weiner wrote: On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote: Good job! This patch looks good to me and with nice descriptions. But it seems that this patch only fix the issue working set changes bigger than half of cache memory go undetected and thrash indefinitely. My concern is could it be extended easily to address all other issues based on this patch set? The other possible way is something like Peter has implemented the CART and Clock-Pro which I think may be better because of using advanced algorithms and consider the problem as a whole from the beginning.(Sorry I haven't get enough time to read the source code, so I'm not 100% sure.) http://linux-mm.org/PeterZClockPro2 My patches are moving the VM towards something that is comparable to how Peter implemented Clock-Pro. However, the current VM has evolved over time in small increments based on real life performance observations. Rewriting everything in one go would be incredibly disruptive and I doubt very much we would merge any such proposal in the first place. So it's not like I don't see the big picture, it's just divide and conquer: Peter's Clock-Pro implementation was basically a double clock with an intricate system to classify hotness, augmented by eviction information to work with reuse distances independent of memory size. What we have right now is a double clock with a very rudimentary system to classify whether a page is hot: it has been accessed twice while on the inactive clock. My patches now add eviction information to this, and improve the classification so that it can work with reuse distances up to memory size and is no longer dependent on the inactive clock size. This is the smallest imaginable step that is still useful, and even then we had a lot of discussions about scalability of the data structures and confusion about how the new data point should be interpreted. It also took a long time until somebody read the series and went, Ok, this actually makes sense to me. Now, maybe I suck at documenting, but maybe this is just complicated stuff. Either way, we have to get there collectively, so that the code is maintainable in the long term. Once we have these new concepts established, we can further improve the hotness detector so that it can classify and order pages with reuse distances beyond memory size. But this will come with its own set of problems. For example, some time ago we stopped regularly scanning and rotating active pages because of scalability issues, but we'll most likely need an uptodate estimate of the reuse distances on the active list in order to classify refaults properly. Thank you for your kindly explanation. It make sense to me please feel free to add my review. Thank you! + * Approximating inactive page access frequency - Observations: + * + * 1. When a page is accessed for the first time, it is added to the + *head of the inactive list, slides every existing inactive page + *towards the tail by one slot, and pushes the current tail page + *out of memory. + * + * 2. When a page is accessed for the second time, it is promoted to + *the active list, shrinking the inactive list by one slot. This + *also slides all inactive pages that were faulted into the cache + *more recently than the activated page towards the tail of the + *inactive list. + * Nitpick, how about the reference bit? What do you mean? Sorry, I mean the PG_referenced flag. I thought when a page is accessed for the second time only PG_referenced flag will be set instead of be promoted to active list. It's cleared during rotation or not set on pages that came in through readahead, but the first access sets the bit and the second access activates it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
Hello On 01/15/2014 10:57 AM, Bob Liu wrote: > > On 01/15/2014 03:16 AM, Johannes Weiner wrote: >> On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote: >>> Hi Johannes, >>> >>> On 01/11/2014 02:10 AM, Johannes Weiner wrote: The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list "inactive list" and the frequently used list "active list". Currently, the VM aims for a 1:1 ratio between the lists, which is the "perfect" trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. >>> >>> Good job! This patch looks good to me and with nice descriptions. >>> But it seems that this patch only fix the issue "working set changes >>> bigger than half of cache memory go undetected and thrash indefinitely". >>> My concern is could it be extended easily to address all other issues >>> based on this patch set? >>> >>> The other possible way is something like Peter has implemented the CART >>> and Clock-Pro which I think may be better because of using advanced >>> algorithms and consider the problem as a whole from the beginning.(Sorry >>> I haven't get enough time to read the source code, so I'm not 100% sure.) >>> http://linux-mm.org/PeterZClockPro2 >> >> My patches are moving the VM towards something that is comparable to >> how Peter implemented Clock-Pro. However, the current VM has evolved >> over time in small increments based on real life performance >> observations. Rewriting everything in one go would be incredibly >> disruptive and I doubt very much we would merge any such proposal in >> the first place. So it's not like I don't see the big picture, it's >> just divide and conquer: >> >> Peter's Clock-Pro implementation was basically a double clock with an >> intricate system to classify hotness, augmented by eviction >> information to work with reuse distances independent of memory size. >> >> What we have right now is a double clock with a very rudimentary >> system to classify whether a page is hot: it has been accessed twice >> while on the inactive clock. My patches now add eviction information >> to this, and improve the classification so that it can work with reuse >> distances up to memory size and is no longer dependent on the inactive >> clock size. >> >> This is the smallest imaginable step that is still useful, and even >> then we had a lot of discussions about scalability of the data >> structures and confusion about how the new data point should be >> interpreted. It also took a long time until somebody read the series >> and went, "Ok, this actually makes sense to me." Now, maybe I suck at >> documenting, but maybe this is just complicated stuff. Either way, we >> have to get there collectively, so that the code is maintainable in >> the long term. >> >> Once we have these new concepts established, we can further improve >> the hotness detector so that it can classify and order pages with >> reuse distances beyond memory size. But this will come with its own >> set of problems. For example, some time ago we stopped regularly >> scanning and rotating active pages because of scalability issues, but >> we'll most likely need an uptodate estimate of the reuse distances on >> the active list in order to classify refaults properly. >> > > Thank you for your kindly explanation. It make sense to me please feel > free to add my review. > + * Approximating inactive page access frequency - Observations: + * + * 1. When a page is accessed for the first time, it is added to the + *head of the inactive list, slides every existing inactive page + *towards the tail by one slot, and pushes the current tail page + *out of memory. + * + * 2. When a page is accessed for the second time, it is promoted to + *the active list, shrinking the inactive list by one slot. This + *also slides all inactive pages that were faulted into the cache + *more recently than the activated page towards the tail of the + *inactive list. + * >>> >>> Nitpick, how about the reference bit? >> >> What do you mean? >> > > Sorry, I mean the PG_referenced flag. I thought when a page is accessed > for the second time only PG_referenced flag will be set instead of be > promoted to active list. > No. I try to explain a bit. For mapped file pages, if the second access occurs on a
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On 01/15/2014 03:16 AM, Johannes Weiner wrote: > On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote: >> Hi Johannes, >> >> On 01/11/2014 02:10 AM, Johannes Weiner wrote: >>> The VM maintains cached filesystem pages on two types of lists. One >>> list holds the pages recently faulted into the cache, the other list >>> holds pages that have been referenced repeatedly on that first list. >>> The idea is to prefer reclaiming young pages over those that have >>> shown to benefit from caching in the past. We call the recently used >>> list "inactive list" and the frequently used list "active list". >>> >>> Currently, the VM aims for a 1:1 ratio between the lists, which is the >>> "perfect" trade-off between the ability to *protect* frequently used >>> pages and the ability to *detect* frequently used pages. This means >>> that working set changes bigger than half of cache memory go >>> undetected and thrash indefinitely, whereas working sets bigger than >>> half of cache memory are unprotected against used-once streams that >>> don't even need caching. >>> >> >> Good job! This patch looks good to me and with nice descriptions. >> But it seems that this patch only fix the issue "working set changes >> bigger than half of cache memory go undetected and thrash indefinitely". >> My concern is could it be extended easily to address all other issues >> based on this patch set? >> >> The other possible way is something like Peter has implemented the CART >> and Clock-Pro which I think may be better because of using advanced >> algorithms and consider the problem as a whole from the beginning.(Sorry >> I haven't get enough time to read the source code, so I'm not 100% sure.) >> http://linux-mm.org/PeterZClockPro2 > > My patches are moving the VM towards something that is comparable to > how Peter implemented Clock-Pro. However, the current VM has evolved > over time in small increments based on real life performance > observations. Rewriting everything in one go would be incredibly > disruptive and I doubt very much we would merge any such proposal in > the first place. So it's not like I don't see the big picture, it's > just divide and conquer: > > Peter's Clock-Pro implementation was basically a double clock with an > intricate system to classify hotness, augmented by eviction > information to work with reuse distances independent of memory size. > > What we have right now is a double clock with a very rudimentary > system to classify whether a page is hot: it has been accessed twice > while on the inactive clock. My patches now add eviction information > to this, and improve the classification so that it can work with reuse > distances up to memory size and is no longer dependent on the inactive > clock size. > > This is the smallest imaginable step that is still useful, and even > then we had a lot of discussions about scalability of the data > structures and confusion about how the new data point should be > interpreted. It also took a long time until somebody read the series > and went, "Ok, this actually makes sense to me." Now, maybe I suck at > documenting, but maybe this is just complicated stuff. Either way, we > have to get there collectively, so that the code is maintainable in > the long term. > > Once we have these new concepts established, we can further improve > the hotness detector so that it can classify and order pages with > reuse distances beyond memory size. But this will come with its own > set of problems. For example, some time ago we stopped regularly > scanning and rotating active pages because of scalability issues, but > we'll most likely need an uptodate estimate of the reuse distances on > the active list in order to classify refaults properly. > Thank you for your kindly explanation. It make sense to me please feel free to add my review. >>> + * Approximating inactive page access frequency - Observations: >>> + * >>> + * 1. When a page is accessed for the first time, it is added to the >>> + *head of the inactive list, slides every existing inactive page >>> + *towards the tail by one slot, and pushes the current tail page >>> + *out of memory. >>> + * >>> + * 2. When a page is accessed for the second time, it is promoted to >>> + *the active list, shrinking the inactive list by one slot. This >>> + *also slides all inactive pages that were faulted into the cache >>> + *more recently than the activated page towards the tail of the >>> + *inactive list. >>> + * >> >> Nitpick, how about the reference bit? > > What do you mean? > Sorry, I mean the PG_referenced flag. I thought when a page is accessed for the second time only PG_referenced flag will be set instead of be promoted to active list. -- Regards, -Bob -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote: > Hi Johannes, > > On 01/11/2014 02:10 AM, Johannes Weiner wrote: > > The VM maintains cached filesystem pages on two types of lists. One > > list holds the pages recently faulted into the cache, the other list > > holds pages that have been referenced repeatedly on that first list. > > The idea is to prefer reclaiming young pages over those that have > > shown to benefit from caching in the past. We call the recently used > > list "inactive list" and the frequently used list "active list". > > > > Currently, the VM aims for a 1:1 ratio between the lists, which is the > > "perfect" trade-off between the ability to *protect* frequently used > > pages and the ability to *detect* frequently used pages. This means > > that working set changes bigger than half of cache memory go > > undetected and thrash indefinitely, whereas working sets bigger than > > half of cache memory are unprotected against used-once streams that > > don't even need caching. > > > > Good job! This patch looks good to me and with nice descriptions. > But it seems that this patch only fix the issue "working set changes > bigger than half of cache memory go undetected and thrash indefinitely". > My concern is could it be extended easily to address all other issues > based on this patch set? > > The other possible way is something like Peter has implemented the CART > and Clock-Pro which I think may be better because of using advanced > algorithms and consider the problem as a whole from the beginning.(Sorry > I haven't get enough time to read the source code, so I'm not 100% sure.) > http://linux-mm.org/PeterZClockPro2 My patches are moving the VM towards something that is comparable to how Peter implemented Clock-Pro. However, the current VM has evolved over time in small increments based on real life performance observations. Rewriting everything in one go would be incredibly disruptive and I doubt very much we would merge any such proposal in the first place. So it's not like I don't see the big picture, it's just divide and conquer: Peter's Clock-Pro implementation was basically a double clock with an intricate system to classify hotness, augmented by eviction information to work with reuse distances independent of memory size. What we have right now is a double clock with a very rudimentary system to classify whether a page is hot: it has been accessed twice while on the inactive clock. My patches now add eviction information to this, and improve the classification so that it can work with reuse distances up to memory size and is no longer dependent on the inactive clock size. This is the smallest imaginable step that is still useful, and even then we had a lot of discussions about scalability of the data structures and confusion about how the new data point should be interpreted. It also took a long time until somebody read the series and went, "Ok, this actually makes sense to me." Now, maybe I suck at documenting, but maybe this is just complicated stuff. Either way, we have to get there collectively, so that the code is maintainable in the long term. Once we have these new concepts established, we can further improve the hotness detector so that it can classify and order pages with reuse distances beyond memory size. But this will come with its own set of problems. For example, some time ago we stopped regularly scanning and rotating active pages because of scalability issues, but we'll most likely need an uptodate estimate of the reuse distances on the active list in order to classify refaults properly. > > + * Approximating inactive page access frequency - Observations: > > + * > > + * 1. When a page is accessed for the first time, it is added to the > > + *head of the inactive list, slides every existing inactive page > > + *towards the tail by one slot, and pushes the current tail page > > + *out of memory. > > + * > > + * 2. When a page is accessed for the second time, it is promoted to > > + *the active list, shrinking the inactive list by one slot. This > > + *also slides all inactive pages that were faulted into the cache > > + *more recently than the activated page towards the tail of the > > + *inactive list. > > + * > > Nitpick, how about the reference bit? What do you mean? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote: Hi Johannes, On 01/11/2014 02:10 AM, Johannes Weiner wrote: The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list inactive list and the frequently used list active list. Currently, the VM aims for a 1:1 ratio between the lists, which is the perfect trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. Good job! This patch looks good to me and with nice descriptions. But it seems that this patch only fix the issue working set changes bigger than half of cache memory go undetected and thrash indefinitely. My concern is could it be extended easily to address all other issues based on this patch set? The other possible way is something like Peter has implemented the CART and Clock-Pro which I think may be better because of using advanced algorithms and consider the problem as a whole from the beginning.(Sorry I haven't get enough time to read the source code, so I'm not 100% sure.) http://linux-mm.org/PeterZClockPro2 My patches are moving the VM towards something that is comparable to how Peter implemented Clock-Pro. However, the current VM has evolved over time in small increments based on real life performance observations. Rewriting everything in one go would be incredibly disruptive and I doubt very much we would merge any such proposal in the first place. So it's not like I don't see the big picture, it's just divide and conquer: Peter's Clock-Pro implementation was basically a double clock with an intricate system to classify hotness, augmented by eviction information to work with reuse distances independent of memory size. What we have right now is a double clock with a very rudimentary system to classify whether a page is hot: it has been accessed twice while on the inactive clock. My patches now add eviction information to this, and improve the classification so that it can work with reuse distances up to memory size and is no longer dependent on the inactive clock size. This is the smallest imaginable step that is still useful, and even then we had a lot of discussions about scalability of the data structures and confusion about how the new data point should be interpreted. It also took a long time until somebody read the series and went, Ok, this actually makes sense to me. Now, maybe I suck at documenting, but maybe this is just complicated stuff. Either way, we have to get there collectively, so that the code is maintainable in the long term. Once we have these new concepts established, we can further improve the hotness detector so that it can classify and order pages with reuse distances beyond memory size. But this will come with its own set of problems. For example, some time ago we stopped regularly scanning and rotating active pages because of scalability issues, but we'll most likely need an uptodate estimate of the reuse distances on the active list in order to classify refaults properly. + * Approximating inactive page access frequency - Observations: + * + * 1. When a page is accessed for the first time, it is added to the + *head of the inactive list, slides every existing inactive page + *towards the tail by one slot, and pushes the current tail page + *out of memory. + * + * 2. When a page is accessed for the second time, it is promoted to + *the active list, shrinking the inactive list by one slot. This + *also slides all inactive pages that were faulted into the cache + *more recently than the activated page towards the tail of the + *inactive list. + * Nitpick, how about the reference bit? What do you mean? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On 01/15/2014 03:16 AM, Johannes Weiner wrote: On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote: Hi Johannes, On 01/11/2014 02:10 AM, Johannes Weiner wrote: The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list inactive list and the frequently used list active list. Currently, the VM aims for a 1:1 ratio between the lists, which is the perfect trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. Good job! This patch looks good to me and with nice descriptions. But it seems that this patch only fix the issue working set changes bigger than half of cache memory go undetected and thrash indefinitely. My concern is could it be extended easily to address all other issues based on this patch set? The other possible way is something like Peter has implemented the CART and Clock-Pro which I think may be better because of using advanced algorithms and consider the problem as a whole from the beginning.(Sorry I haven't get enough time to read the source code, so I'm not 100% sure.) http://linux-mm.org/PeterZClockPro2 My patches are moving the VM towards something that is comparable to how Peter implemented Clock-Pro. However, the current VM has evolved over time in small increments based on real life performance observations. Rewriting everything in one go would be incredibly disruptive and I doubt very much we would merge any such proposal in the first place. So it's not like I don't see the big picture, it's just divide and conquer: Peter's Clock-Pro implementation was basically a double clock with an intricate system to classify hotness, augmented by eviction information to work with reuse distances independent of memory size. What we have right now is a double clock with a very rudimentary system to classify whether a page is hot: it has been accessed twice while on the inactive clock. My patches now add eviction information to this, and improve the classification so that it can work with reuse distances up to memory size and is no longer dependent on the inactive clock size. This is the smallest imaginable step that is still useful, and even then we had a lot of discussions about scalability of the data structures and confusion about how the new data point should be interpreted. It also took a long time until somebody read the series and went, Ok, this actually makes sense to me. Now, maybe I suck at documenting, but maybe this is just complicated stuff. Either way, we have to get there collectively, so that the code is maintainable in the long term. Once we have these new concepts established, we can further improve the hotness detector so that it can classify and order pages with reuse distances beyond memory size. But this will come with its own set of problems. For example, some time ago we stopped regularly scanning and rotating active pages because of scalability issues, but we'll most likely need an uptodate estimate of the reuse distances on the active list in order to classify refaults properly. Thank you for your kindly explanation. It make sense to me please feel free to add my review. + * Approximating inactive page access frequency - Observations: + * + * 1. When a page is accessed for the first time, it is added to the + *head of the inactive list, slides every existing inactive page + *towards the tail by one slot, and pushes the current tail page + *out of memory. + * + * 2. When a page is accessed for the second time, it is promoted to + *the active list, shrinking the inactive list by one slot. This + *also slides all inactive pages that were faulted into the cache + *more recently than the activated page towards the tail of the + *inactive list. + * Nitpick, how about the reference bit? What do you mean? Sorry, I mean the PG_referenced flag. I thought when a page is accessed for the second time only PG_referenced flag will be set instead of be promoted to active list. -- Regards, -Bob -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
Hello On 01/15/2014 10:57 AM, Bob Liu wrote: On 01/15/2014 03:16 AM, Johannes Weiner wrote: On Tue, Jan 14, 2014 at 09:01:09AM +0800, Bob Liu wrote: Hi Johannes, On 01/11/2014 02:10 AM, Johannes Weiner wrote: The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list inactive list and the frequently used list active list. Currently, the VM aims for a 1:1 ratio between the lists, which is the perfect trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. Good job! This patch looks good to me and with nice descriptions. But it seems that this patch only fix the issue working set changes bigger than half of cache memory go undetected and thrash indefinitely. My concern is could it be extended easily to address all other issues based on this patch set? The other possible way is something like Peter has implemented the CART and Clock-Pro which I think may be better because of using advanced algorithms and consider the problem as a whole from the beginning.(Sorry I haven't get enough time to read the source code, so I'm not 100% sure.) http://linux-mm.org/PeterZClockPro2 My patches are moving the VM towards something that is comparable to how Peter implemented Clock-Pro. However, the current VM has evolved over time in small increments based on real life performance observations. Rewriting everything in one go would be incredibly disruptive and I doubt very much we would merge any such proposal in the first place. So it's not like I don't see the big picture, it's just divide and conquer: Peter's Clock-Pro implementation was basically a double clock with an intricate system to classify hotness, augmented by eviction information to work with reuse distances independent of memory size. What we have right now is a double clock with a very rudimentary system to classify whether a page is hot: it has been accessed twice while on the inactive clock. My patches now add eviction information to this, and improve the classification so that it can work with reuse distances up to memory size and is no longer dependent on the inactive clock size. This is the smallest imaginable step that is still useful, and even then we had a lot of discussions about scalability of the data structures and confusion about how the new data point should be interpreted. It also took a long time until somebody read the series and went, Ok, this actually makes sense to me. Now, maybe I suck at documenting, but maybe this is just complicated stuff. Either way, we have to get there collectively, so that the code is maintainable in the long term. Once we have these new concepts established, we can further improve the hotness detector so that it can classify and order pages with reuse distances beyond memory size. But this will come with its own set of problems. For example, some time ago we stopped regularly scanning and rotating active pages because of scalability issues, but we'll most likely need an uptodate estimate of the reuse distances on the active list in order to classify refaults properly. Thank you for your kindly explanation. It make sense to me please feel free to add my review. + * Approximating inactive page access frequency - Observations: + * + * 1. When a page is accessed for the first time, it is added to the + *head of the inactive list, slides every existing inactive page + *towards the tail by one slot, and pushes the current tail page + *out of memory. + * + * 2. When a page is accessed for the second time, it is promoted to + *the active list, shrinking the inactive list by one slot. This + *also slides all inactive pages that were faulted into the cache + *more recently than the activated page towards the tail of the + *inactive list. + * Nitpick, how about the reference bit? What do you mean? Sorry, I mean the PG_referenced flag. I thought when a page is accessed for the second time only PG_referenced flag will be set instead of be promoted to active list. No. I try to explain a bit. For mapped file pages, if the second access occurs on a different page table entry, the page is surely promoted to active list. But if the paged is always accessed from the same page table entry, it was mistakenly evicted. This was fixed by Johannes already by reusing the PG_referenced flag, for details, please refer to commit
Re: [patch 7/9] mm: thrash detection-based file cache sizing
Hi Johannes, On 01/11/2014 02:10 AM, Johannes Weiner wrote: > The VM maintains cached filesystem pages on two types of lists. One > list holds the pages recently faulted into the cache, the other list > holds pages that have been referenced repeatedly on that first list. > The idea is to prefer reclaiming young pages over those that have > shown to benefit from caching in the past. We call the recently used > list "inactive list" and the frequently used list "active list". > > Currently, the VM aims for a 1:1 ratio between the lists, which is the > "perfect" trade-off between the ability to *protect* frequently used > pages and the ability to *detect* frequently used pages. This means > that working set changes bigger than half of cache memory go > undetected and thrash indefinitely, whereas working sets bigger than > half of cache memory are unprotected against used-once streams that > don't even need caching. > Good job! This patch looks good to me and with nice descriptions. But it seems that this patch only fix the issue "working set changes bigger than half of cache memory go undetected and thrash indefinitely". My concern is could it be extended easily to address all other issues based on this patch set? The other possible way is something like Peter has implemented the CART and Clock-Pro which I think may be better because of using advanced algorithms and consider the problem as a whole from the beginning.(Sorry I haven't get enough time to read the source code, so I'm not 100% sure.) http://linux-mm.org/PeterZClockPro2 > Historically, every reclaim scan of the inactive list also took a > smaller number of pages from the tail of the active list and moved > them to the head of the inactive list. This model gave established > working sets more gracetime in the face of temporary use-once streams, > but ultimately was not significantly better than a FIFO policy and > still thrashed cache based on eviction speed, rather than actual > demand for cache. > > This patch solves one half of the problem by decoupling the ability to > detect working set changes from the inactive list size. By > maintaining a history of recently evicted file pages it can detect > frequently used pages with an arbitrarily small inactive list size, > and subsequently apply pressure on the active list based on actual > demand for cache, not just overall eviction speed. > > Every zone maintains a counter that tracks inactive list aging speed. > When a page is evicted, a snapshot of this counter is stored in the > now-empty page cache radix tree slot. On refault, the minimum access > distance of the page can be assessed, to evaluate whether the page > should be part of the active list or not. > > This fixes the VM's blindness towards working set changes in excess of > the inactive list. And it's the foundation to further improve the > protection ability and reduce the minimum inactive list size of 50%. > > Signed-off-by: Johannes Weiner > --- > include/linux/mmzone.h | 5 + > include/linux/swap.h | 5 + > mm/Makefile| 2 +- > mm/filemap.c | 61 > mm/swap.c | 2 + > mm/vmscan.c| 24 - > mm/vmstat.c| 2 + > mm/workingset.c| 253 > + > 8 files changed, 331 insertions(+), 23 deletions(-) > create mode 100644 mm/workingset.c > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index bd791e452ad7..118ba9f51e86 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -142,6 +142,8 @@ enum zone_stat_item { > NUMA_LOCAL, /* allocation from local node */ > NUMA_OTHER, /* allocation from other node */ > #endif > + WORKINGSET_REFAULT, > + WORKINGSET_ACTIVATE, > NR_ANON_TRANSPARENT_HUGEPAGES, > NR_FREE_CMA_PAGES, > NR_VM_ZONE_STAT_ITEMS }; > @@ -392,6 +394,9 @@ struct zone { > spinlock_t lru_lock; > struct lruvec lruvec; > > + /* Evictions & activations on the inactive file list */ > + atomic_long_t inactive_age; > + > unsigned long pages_scanned; /* since last reclaim */ > unsigned long flags; /* zone flags, see below */ > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 46ba0c6c219f..b83cf61403ed 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -260,6 +260,11 @@ struct swap_list_t { > int next; /* swapfile to be used next */ > }; > > +/* linux/mm/workingset.c */ > +void *workingset_eviction(struct address_space *mapping, struct page *page); > +bool workingset_refault(void *shadow); > +void workingset_activation(struct page *page); > + > /* linux/mm/page_alloc.c */ > extern unsigned long totalram_pages; > extern unsigned long totalreserve_pages; > diff --git a/mm/Makefile b/mm/Makefile > index 305d10acd081..b30aeb86abd6 100644 >
Re: [patch 7/9] mm: thrash detection-based file cache sizing
Hi Johannes, On 01/11/2014 02:10 AM, Johannes Weiner wrote: The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list inactive list and the frequently used list active list. Currently, the VM aims for a 1:1 ratio between the lists, which is the perfect trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. Good job! This patch looks good to me and with nice descriptions. But it seems that this patch only fix the issue working set changes bigger than half of cache memory go undetected and thrash indefinitely. My concern is could it be extended easily to address all other issues based on this patch set? The other possible way is something like Peter has implemented the CART and Clock-Pro which I think may be better because of using advanced algorithms and consider the problem as a whole from the beginning.(Sorry I haven't get enough time to read the source code, so I'm not 100% sure.) http://linux-mm.org/PeterZClockPro2 Historically, every reclaim scan of the inactive list also took a smaller number of pages from the tail of the active list and moved them to the head of the inactive list. This model gave established working sets more gracetime in the face of temporary use-once streams, but ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner han...@cmpxchg.org --- include/linux/mmzone.h | 5 + include/linux/swap.h | 5 + mm/Makefile| 2 +- mm/filemap.c | 61 mm/swap.c | 2 + mm/vmscan.c| 24 - mm/vmstat.c| 2 + mm/workingset.c| 253 + 8 files changed, 331 insertions(+), 23 deletions(-) create mode 100644 mm/workingset.c diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index bd791e452ad7..118ba9f51e86 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -142,6 +142,8 @@ enum zone_stat_item { NUMA_LOCAL, /* allocation from local node */ NUMA_OTHER, /* allocation from other node */ #endif + WORKINGSET_REFAULT, + WORKINGSET_ACTIVATE, NR_ANON_TRANSPARENT_HUGEPAGES, NR_FREE_CMA_PAGES, NR_VM_ZONE_STAT_ITEMS }; @@ -392,6 +394,9 @@ struct zone { spinlock_t lru_lock; struct lruvec lruvec; + /* Evictions activations on the inactive file list */ + atomic_long_t inactive_age; + unsigned long pages_scanned; /* since last reclaim */ unsigned long flags; /* zone flags, see below */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 46ba0c6c219f..b83cf61403ed 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -260,6 +260,11 @@ struct swap_list_t { int next; /* swapfile to be used next */ }; +/* linux/mm/workingset.c */ +void *workingset_eviction(struct address_space *mapping, struct page *page); +bool workingset_refault(void *shadow); +void workingset_activation(struct page *page); + /* linux/mm/page_alloc.c */ extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; diff --git a/mm/Makefile b/mm/Makefile index 305d10acd081..b30aeb86abd6 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -17,7 +17,7 @@ obj-y :=
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Fri, Jan 10, 2014 at 01:10:41PM -0500, Johannes Weiner wrote: > The VM maintains cached filesystem pages on two types of lists. One > list holds the pages recently faulted into the cache, the other list > holds pages that have been referenced repeatedly on that first list. > The idea is to prefer reclaiming young pages over those that have > shown to benefit from caching in the past. We call the recently used > list "inactive list" and the frequently used list "active list". > > Currently, the VM aims for a 1:1 ratio between the lists, which is the > "perfect" trade-off between the ability to *protect* frequently used > pages and the ability to *detect* frequently used pages. This means > that working set changes bigger than half of cache memory go > undetected and thrash indefinitely, whereas working sets bigger than > half of cache memory are unprotected against used-once streams that > don't even need caching. > > Historically, every reclaim scan of the inactive list also took a > smaller number of pages from the tail of the active list and moved > them to the head of the inactive list. This model gave established > working sets more gracetime in the face of temporary use-once streams, > but ultimately was not significantly better than a FIFO policy and > still thrashed cache based on eviction speed, rather than actual > demand for cache. > > This patch solves one half of the problem by decoupling the ability to > detect working set changes from the inactive list size. By > maintaining a history of recently evicted file pages it can detect > frequently used pages with an arbitrarily small inactive list size, > and subsequently apply pressure on the active list based on actual > demand for cache, not just overall eviction speed. > > Every zone maintains a counter that tracks inactive list aging speed. > When a page is evicted, a snapshot of this counter is stored in the > now-empty page cache radix tree slot. On refault, the minimum access > distance of the page can be assessed, to evaluate whether the page > should be part of the active list or not. > > This fixes the VM's blindness towards working set changes in excess of > the inactive list. And it's the foundation to further improve the > protection ability and reduce the minimum inactive list size of 50%. > > Signed-off-by: Johannes Weiner Reviewed-by: Minchan Kim Really nice description and code to understand. I believe we should really merge/maintain such advanced algorithm, which will end up putting more advanced concept easily. Johannes, thanks for the your effort! > --- > include/linux/mmzone.h | 5 + > include/linux/swap.h | 5 + > mm/Makefile| 2 +- > mm/filemap.c | 61 > mm/swap.c | 2 + > mm/vmscan.c| 24 - > mm/vmstat.c| 2 + > mm/workingset.c| 253 > + > 8 files changed, 331 insertions(+), 23 deletions(-) > create mode 100644 mm/workingset.c > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index bd791e452ad7..118ba9f51e86 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -142,6 +142,8 @@ enum zone_stat_item { > NUMA_LOCAL, /* allocation from local node */ > NUMA_OTHER, /* allocation from other node */ > #endif > + WORKINGSET_REFAULT, > + WORKINGSET_ACTIVATE, > NR_ANON_TRANSPARENT_HUGEPAGES, > NR_FREE_CMA_PAGES, > NR_VM_ZONE_STAT_ITEMS }; > @@ -392,6 +394,9 @@ struct zone { > spinlock_t lru_lock; > struct lruvec lruvec; > > + /* Evictions & activations on the inactive file list */ > + atomic_long_t inactive_age; > + > unsigned long pages_scanned; /* since last reclaim */ > unsigned long flags; /* zone flags, see below */ > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 46ba0c6c219f..b83cf61403ed 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -260,6 +260,11 @@ struct swap_list_t { > int next; /* swapfile to be used next */ > }; > > +/* linux/mm/workingset.c */ > +void *workingset_eviction(struct address_space *mapping, struct page *page); > +bool workingset_refault(void *shadow); > +void workingset_activation(struct page *page); > + > /* linux/mm/page_alloc.c */ > extern unsigned long totalram_pages; > extern unsigned long totalreserve_pages; > diff --git a/mm/Makefile b/mm/Makefile > index 305d10acd081..b30aeb86abd6 100644 > --- a/mm/Makefile > +++ b/mm/Makefile > @@ -17,7 +17,7 @@ obj-y := filemap.o mempool.o > oom_kill.o fadvise.o \ > util.o mmzone.o vmstat.o backing-dev.o \ > mm_init.o mmu_context.o percpu.o slab_common.o \ > compaction.o balloon_compaction.o \ > -interval_tree.o
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Fri, Jan 10, 2014 at 01:10:41PM -0500, Johannes Weiner wrote: The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list inactive list and the frequently used list active list. Currently, the VM aims for a 1:1 ratio between the lists, which is the perfect trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. Historically, every reclaim scan of the inactive list also took a smaller number of pages from the tail of the active list and moved them to the head of the inactive list. This model gave established working sets more gracetime in the face of temporary use-once streams, but ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner han...@cmpxchg.org Reviewed-by: Minchan Kim minc...@kernel.org Really nice description and code to understand. I believe we should really merge/maintain such advanced algorithm, which will end up putting more advanced concept easily. Johannes, thanks for the your effort! --- include/linux/mmzone.h | 5 + include/linux/swap.h | 5 + mm/Makefile| 2 +- mm/filemap.c | 61 mm/swap.c | 2 + mm/vmscan.c| 24 - mm/vmstat.c| 2 + mm/workingset.c| 253 + 8 files changed, 331 insertions(+), 23 deletions(-) create mode 100644 mm/workingset.c diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index bd791e452ad7..118ba9f51e86 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -142,6 +142,8 @@ enum zone_stat_item { NUMA_LOCAL, /* allocation from local node */ NUMA_OTHER, /* allocation from other node */ #endif + WORKINGSET_REFAULT, + WORKINGSET_ACTIVATE, NR_ANON_TRANSPARENT_HUGEPAGES, NR_FREE_CMA_PAGES, NR_VM_ZONE_STAT_ITEMS }; @@ -392,6 +394,9 @@ struct zone { spinlock_t lru_lock; struct lruvec lruvec; + /* Evictions activations on the inactive file list */ + atomic_long_t inactive_age; + unsigned long pages_scanned; /* since last reclaim */ unsigned long flags; /* zone flags, see below */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 46ba0c6c219f..b83cf61403ed 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -260,6 +260,11 @@ struct swap_list_t { int next; /* swapfile to be used next */ }; +/* linux/mm/workingset.c */ +void *workingset_eviction(struct address_space *mapping, struct page *page); +bool workingset_refault(void *shadow); +void workingset_activation(struct page *page); + /* linux/mm/page_alloc.c */ extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; diff --git a/mm/Makefile b/mm/Makefile index 305d10acd081..b30aeb86abd6 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -17,7 +17,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ util.o mmzone.o vmstat.o backing-dev.o \ mm_init.o mmu_context.o percpu.o slab_common.o \ compaction.o balloon_compaction.o \ -interval_tree.o list_lru.o $(mmu-y) +interval_tree.o list_lru.o
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On 01/10/2014 01:10 PM, Johannes Weiner wrote: > This patch solves one half of the problem by decoupling the ability to > detect working set changes from the inactive list size. By > maintaining a history of recently evicted file pages it can detect > frequently used pages with an arbitrarily small inactive list size, > and subsequently apply pressure on the active list based on actual > demand for cache, not just overall eviction speed. > > Every zone maintains a counter that tracks inactive list aging speed. > When a page is evicted, a snapshot of this counter is stored in the > now-empty page cache radix tree slot. On refault, the minimum access > distance of the page can be assessed, to evaluate whether the page > should be part of the active list or not. > > This fixes the VM's blindness towards working set changes in excess of > the inactive list. And it's the foundation to further improve the > protection ability and reduce the minimum inactive list size of 50%. > > Signed-off-by: Johannes Weiner Reviewed-by: Rik van Riel -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On 01/10/2014 01:10 PM, Johannes Weiner wrote: This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assessed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner han...@cmpxchg.org Reviewed-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Tue, Nov 26, 2013 at 12:56:23PM +1100, Ryan Mallon wrote: > > + * fault + > > + * | > > + * +--+ |+-+ > > + * reclaim <- | inactive | <-+-- demotion |active | <--+ > > + * +--++-+| > > + * | | > > + * +-- promotion --+ > > + * > > + * > > + * Access frequency and refault distance > > + * > > + * A workload is trashing when its pages are frequently used but they > > "thrashing". Thanks ;) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Tue, Nov 26, 2013 at 12:56:23PM +1100, Ryan Mallon wrote: + * fault + + * | + * +--+ |+-+ + * reclaim - | inactive | -+-- demotion |active | --+ + * +--++-+| + * | | + * +-- promotion --+ + * + * + * Access frequency and refault distance + * + * A workload is trashing when its pages are frequently used but they thrashing. Thanks ;) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Mon, Nov 25, 2013 at 03:50:11PM -0800, Andrew Morton wrote: > On Sun, 24 Nov 2013 18:38:26 -0500 Johannes Weiner wrote: > > > ... > > > > + * Access frequency and refault distance > > + * > > + * A workload is trashing when its pages are frequently used but they > > + * are evicted from the inactive list every time before another access > > + * would have promoted them to the active list. > > + * > > + * In cases where the average access distance between thrashing pages > > + * is bigger than the size of memory there is nothing that can be > > + * done - the thrashing set could never fit into memory under any > > + * circumstance. > > + * > > + * However, the average access distance could be bigger than the > > + * inactive list, yet smaller than the size of memory. In this case, > > + * the set could fit into memory if it weren't for the currently > > + * active pages - which may be used more, hopefully less frequently: > > + * > > + * +-memory available to cache-+ > > + * | | > > + * +-inactive--+-active+ > > + * a b | c d e f g h i | J K L M N | > > + * +---+---+ > > So making the inactive list smaller will worsen this problem? Only if the inactive list size is a factor in detecting repeatedly used pages. This patch series is all about removing that dependency and using non-residency information to cover that deficit a small inactive list would otherwise create. > If so, don't we have a conflict with this objective: > > > Right now we have a fixed ratio (50:50) between inactive and active > > list but we already have complaints about working sets exceeding half > > of memory being pushed out of the cache by simple streaming in the > > background. Ultimately, we want to adjust this ratio and allow for a > > much smaller inactive list. No, this IS the objective. The patches get us there by being able to detect repeated references with an arbitrary inactive list size. > > + * It is prohibitively expensive to accurately track access frequency > > + * of pages. But a reasonable approximation can be made to measure > > + * thrashing on the inactive list, after which refaulting pages can be > > + * activated optimistically to compete with the existing active pages. > > + * > > + * Approximating inactive page access frequency - Observations: > > + * > > + * 1. When a page is accesed for the first time, it is added to the > > "accessed" Whoopsa :-) Will fix that up. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On 25/11/13 10:38, Johannes Weiner wrote: > The VM maintains cached filesystem pages on two types of lists. One > list holds the pages recently faulted into the cache, the other list > holds pages that have been referenced repeatedly on that first list. > The idea is to prefer reclaiming young pages over those that have > shown to benefit from caching in the past. We call the recently used > list "inactive list" and the frequently used list "active list". > > Currently, the VM aims for a 1:1 ratio between the lists, which is the > "perfect" trade-off between the ability to *protect* frequently used > pages and the ability to *detect* frequently used pages. This means > that working set changes bigger than half of cache memory go > undetected and thrash indefinitely, whereas working sets bigger than > half of cache memory are unprotected against used-once streams that > don't even need caching. > > Historically, every reclaim scan of the inactive list also took a > smaller number of pages from the tail of the active list and moved > them to the head of the inactive list. This model gave established > working sets more gracetime in the face of temporary use-once streams, > but ultimately was not significantly better than a FIFO policy and > still thrashed cache based on eviction speed, rather than actual > demand for cache. > > This patch solves one half of the problem by decoupling the ability to > detect working set changes from the inactive list size. By > maintaining a history of recently evicted file pages it can detect > frequently used pages with an arbitrarily small inactive list size, > and subsequently apply pressure on the active list based on actual > demand for cache, not just overall eviction speed. > > Every zone maintains a counter that tracks inactive list aging speed. > When a page is evicted, a snapshot of this counter is stored in the > now-empty page cache radix tree slot. On refault, the minimum access > distance of the page can be assesed, to evaluate whether the page > should be part of the active list or not. > > This fixes the VM's blindness towards working set changes in excess of > the inactive list. And it's the foundation to further improve the > protection ability and reduce the minimum inactive list size of 50%. > > Signed-off-by: Johannes Weiner > --- > + * fault + > + * | > + * +--+ |+-+ > + * reclaim <- | inactive | <-+-- demotion |active | <--+ > + * +--++-+| > + * | | > + * +-- promotion --+ > + * > + * > + * Access frequency and refault distance > + * > + * A workload is trashing when its pages are frequently used but they "thrashing". ~Ryan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Sun, 24 Nov 2013 18:38:26 -0500 Johannes Weiner wrote: > ... > > + * Access frequency and refault distance > + * > + * A workload is trashing when its pages are frequently used but they > + * are evicted from the inactive list every time before another access > + * would have promoted them to the active list. > + * > + * In cases where the average access distance between thrashing pages > + * is bigger than the size of memory there is nothing that can be > + * done - the thrashing set could never fit into memory under any > + * circumstance. > + * > + * However, the average access distance could be bigger than the > + * inactive list, yet smaller than the size of memory. In this case, > + * the set could fit into memory if it weren't for the currently > + * active pages - which may be used more, hopefully less frequently: > + * > + * +-memory available to cache-+ > + * | | > + * +-inactive--+-active+ > + * a b | c d e f g h i | J K L M N | > + * +---+---+ So making the inactive list smaller will worsen this problem? If so, don't we have a conflict with this objective: > Right now we have a fixed ratio (50:50) between inactive and active > list but we already have complaints about working sets exceeding half > of memory being pushed out of the cache by simple streaming in the > background. Ultimately, we want to adjust this ratio and allow for a > much smaller inactive list. ? > + * It is prohibitively expensive to accurately track access frequency > + * of pages. But a reasonable approximation can be made to measure > + * thrashing on the inactive list, after which refaulting pages can be > + * activated optimistically to compete with the existing active pages. > + * > + * Approximating inactive page access frequency - Observations: > + * > + * 1. When a page is accesed for the first time, it is added to the "accessed" > + *head of the inactive list, slides every existing inactive page > + *towards the tail by one slot, and pushes the current tail page > + *out of memory. > + * > > ... > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Sun, 24 Nov 2013 18:38:26 -0500 Johannes Weiner han...@cmpxchg.org wrote: ... + * Access frequency and refault distance + * + * A workload is trashing when its pages are frequently used but they + * are evicted from the inactive list every time before another access + * would have promoted them to the active list. + * + * In cases where the average access distance between thrashing pages + * is bigger than the size of memory there is nothing that can be + * done - the thrashing set could never fit into memory under any + * circumstance. + * + * However, the average access distance could be bigger than the + * inactive list, yet smaller than the size of memory. In this case, + * the set could fit into memory if it weren't for the currently + * active pages - which may be used more, hopefully less frequently: + * + * +-memory available to cache-+ + * | | + * +-inactive--+-active+ + * a b | c d e f g h i | J K L M N | + * +---+---+ So making the inactive list smaller will worsen this problem? If so, don't we have a conflict with this objective: Right now we have a fixed ratio (50:50) between inactive and active list but we already have complaints about working sets exceeding half of memory being pushed out of the cache by simple streaming in the background. Ultimately, we want to adjust this ratio and allow for a much smaller inactive list. ? + * It is prohibitively expensive to accurately track access frequency + * of pages. But a reasonable approximation can be made to measure + * thrashing on the inactive list, after which refaulting pages can be + * activated optimistically to compete with the existing active pages. + * + * Approximating inactive page access frequency - Observations: + * + * 1. When a page is accesed for the first time, it is added to the accessed + *head of the inactive list, slides every existing inactive page + *towards the tail by one slot, and pushes the current tail page + *out of memory. + * ... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On 25/11/13 10:38, Johannes Weiner wrote: The VM maintains cached filesystem pages on two types of lists. One list holds the pages recently faulted into the cache, the other list holds pages that have been referenced repeatedly on that first list. The idea is to prefer reclaiming young pages over those that have shown to benefit from caching in the past. We call the recently used list inactive list and the frequently used list active list. Currently, the VM aims for a 1:1 ratio between the lists, which is the perfect trade-off between the ability to *protect* frequently used pages and the ability to *detect* frequently used pages. This means that working set changes bigger than half of cache memory go undetected and thrash indefinitely, whereas working sets bigger than half of cache memory are unprotected against used-once streams that don't even need caching. Historically, every reclaim scan of the inactive list also took a smaller number of pages from the tail of the active list and moved them to the head of the inactive list. This model gave established working sets more gracetime in the face of temporary use-once streams, but ultimately was not significantly better than a FIFO policy and still thrashed cache based on eviction speed, rather than actual demand for cache. This patch solves one half of the problem by decoupling the ability to detect working set changes from the inactive list size. By maintaining a history of recently evicted file pages it can detect frequently used pages with an arbitrarily small inactive list size, and subsequently apply pressure on the active list based on actual demand for cache, not just overall eviction speed. Every zone maintains a counter that tracks inactive list aging speed. When a page is evicted, a snapshot of this counter is stored in the now-empty page cache radix tree slot. On refault, the minimum access distance of the page can be assesed, to evaluate whether the page should be part of the active list or not. This fixes the VM's blindness towards working set changes in excess of the inactive list. And it's the foundation to further improve the protection ability and reduce the minimum inactive list size of 50%. Signed-off-by: Johannes Weiner han...@cmpxchg.org --- snip + * fault + + * | + * +--+ |+-+ + * reclaim - | inactive | -+-- demotion |active | --+ + * +--++-+| + * | | + * +-- promotion --+ + * + * + * Access frequency and refault distance + * + * A workload is trashing when its pages are frequently used but they thrashing. ~Ryan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 7/9] mm: thrash detection-based file cache sizing
On Mon, Nov 25, 2013 at 03:50:11PM -0800, Andrew Morton wrote: On Sun, 24 Nov 2013 18:38:26 -0500 Johannes Weiner han...@cmpxchg.org wrote: ... + * Access frequency and refault distance + * + * A workload is trashing when its pages are frequently used but they + * are evicted from the inactive list every time before another access + * would have promoted them to the active list. + * + * In cases where the average access distance between thrashing pages + * is bigger than the size of memory there is nothing that can be + * done - the thrashing set could never fit into memory under any + * circumstance. + * + * However, the average access distance could be bigger than the + * inactive list, yet smaller than the size of memory. In this case, + * the set could fit into memory if it weren't for the currently + * active pages - which may be used more, hopefully less frequently: + * + * +-memory available to cache-+ + * | | + * +-inactive--+-active+ + * a b | c d e f g h i | J K L M N | + * +---+---+ So making the inactive list smaller will worsen this problem? Only if the inactive list size is a factor in detecting repeatedly used pages. This patch series is all about removing that dependency and using non-residency information to cover that deficit a small inactive list would otherwise create. If so, don't we have a conflict with this objective: Right now we have a fixed ratio (50:50) between inactive and active list but we already have complaints about working sets exceeding half of memory being pushed out of the cache by simple streaming in the background. Ultimately, we want to adjust this ratio and allow for a much smaller inactive list. No, this IS the objective. The patches get us there by being able to detect repeated references with an arbitrary inactive list size. + * It is prohibitively expensive to accurately track access frequency + * of pages. But a reasonable approximation can be made to measure + * thrashing on the inactive list, after which refaulting pages can be + * activated optimistically to compete with the existing active pages. + * + * Approximating inactive page access frequency - Observations: + * + * 1. When a page is accesed for the first time, it is added to the accessed Whoopsa :-) Will fix that up. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/