Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On 3/2/19 1:20 AM, Johannes Weiner wrote: > On Fri, Mar 01, 2019 at 10:46:34PM +0300, Andrey Ryabinin wrote: >> On 3/1/19 8:49 PM, Johannes Weiner wrote: >>> On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote: On 2/26/19 3:50 PM, Andrey Ryabinin wrote: > On 2/22/19 10:15 PM, Johannes Weiner wrote: >> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: >>> In a presence of more than 1 memory cgroup in the system our reclaim >>> logic is just suck. When we hit memory limit (global or a limit on >>> cgroup with subgroups) we reclaim some memory from all cgroups. >>> This is sucks because, the cgroup that allocates more often always wins. >>> E.g. job that allocates a lot of clean rarely used page cache will push >>> out of memory other jobs with active relatively small all in memory >>> working set. >>> >>> To prevent such situations we have memcg controls like low/max, etc >>> which >>> are supposed to protect jobs or limit them so they to not hurt others. >>> But memory cgroups are very hard to configure right because it requires >>> precise knowledge of the workload which may vary during the execution. >>> E.g. setting memory limit means that job won't be able to use all memory >>> in the system for page cache even if the rest the system is idle. >>> Basically our current scheme requires to configure every single cgroup >>> in the system. >>> >>> I think we can do better. The idea proposed by this patch is to reclaim >>> only inactive pages and only from cgroups that have big >>> (!inactive_is_low()) inactive list. And go back to shrinking active >>> lists >>> only if all inactive lists are low. >> >> Yes, you are absolutely right. >> >> We shouldn't go after active pages as long as there are plenty of >> inactive pages around. That's the global reclaim policy, and we >> currently fail to translate that well to cgrouped systems. >> >> Setting group protections or limits would work around this problem, >> but they're kind of a red herring. We shouldn't ever allow use-once >> streams to push out hot workingsets, that's a bug. >> >>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec >>> *lruvec, struct mem_cgroup *memcg, >>> >>> scan >>= sc->priority; >>> >>> + if (!sc->may_shrink_active && >>> inactive_list_is_low(lruvec, >>> + file, memcg, sc, false)) >>> + scan = 0; >>> + >>> /* >>> * If the cgroup's already been deleted, make sure to >>> * scrape out the remaining cache. >>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct >>> scan_control *sc) >>> struct reclaim_state *reclaim_state = current->reclaim_state; >>> unsigned long nr_reclaimed, nr_scanned; >>> bool reclaimable = false; >>> + bool retry; >>> >>> do { >>> struct mem_cgroup *root = sc->target_mem_cgroup; >>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct >>> scan_control *sc) >>> }; >>> struct mem_cgroup *memcg; >>> >>> + retry = false; >>> + >>> memset(>nr, 0, sizeof(sc->nr)); >>> >>> nr_reclaimed = sc->nr_reclaimed; >>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct >>> scan_control *sc) >>> } >>> } while ((memcg = mem_cgroup_iter(root, memcg, >>> ))); >>> >>> + if ((sc->nr_scanned - nr_scanned) == 0 && >>> +!sc->may_shrink_active) { >>> + sc->may_shrink_active = 1; >>> + retry = true; >>> + continue; >>> + } >> >> Using !scanned as the gate could be a problem. There might be a cgroup >> that has inactive pages on the local level, but when viewed from the >> system level the total inactive pages in the system might still be low >> compared to active ones. In that case we should go after active pages. >> >> Basically, during global reclaim, the answer for whether active pages >> should be scanned or not should be the same regardless of whether the >> memory is all global or whether it's spread out between cgroups. >> >> The reason this isn't the case is because we're checking the ratio at >> the lruvec level - which is the highest level (and identical to the >> node counters) when memory is global, but it's at the lowest level >> when memory is cgrouped. >> >> So IMO what we should do is: >> >> -
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On Fri, Mar 01, 2019 at 10:46:34PM +0300, Andrey Ryabinin wrote: > On 3/1/19 8:49 PM, Johannes Weiner wrote: > > On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote: > >> On 2/26/19 3:50 PM, Andrey Ryabinin wrote: > >>> On 2/22/19 10:15 PM, Johannes Weiner wrote: > On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > > In a presence of more than 1 memory cgroup in the system our reclaim > > logic is just suck. When we hit memory limit (global or a limit on > > cgroup with subgroups) we reclaim some memory from all cgroups. > > This is sucks because, the cgroup that allocates more often always wins. > > E.g. job that allocates a lot of clean rarely used page cache will push > > out of memory other jobs with active relatively small all in memory > > working set. > > > > To prevent such situations we have memcg controls like low/max, etc > > which > > are supposed to protect jobs or limit them so they to not hurt others. > > But memory cgroups are very hard to configure right because it requires > > precise knowledge of the workload which may vary during the execution. > > E.g. setting memory limit means that job won't be able to use all memory > > in the system for page cache even if the rest the system is idle. > > Basically our current scheme requires to configure every single cgroup > > in the system. > > > > I think we can do better. The idea proposed by this patch is to reclaim > > only inactive pages and only from cgroups that have big > > (!inactive_is_low()) inactive list. And go back to shrinking active > > lists > > only if all inactive lists are low. > > Yes, you are absolutely right. > > We shouldn't go after active pages as long as there are plenty of > inactive pages around. That's the global reclaim policy, and we > currently fail to translate that well to cgrouped systems. > > Setting group protections or limits would work around this problem, > but they're kind of a red herring. We shouldn't ever allow use-once > streams to push out hot workingsets, that's a bug. > > > @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec > > *lruvec, struct mem_cgroup *memcg, > > > > scan >>= sc->priority; > > > > + if (!sc->may_shrink_active && > > inactive_list_is_low(lruvec, > > + file, memcg, sc, false)) > > + scan = 0; > > + > > /* > > * If the cgroup's already been deleted, make sure to > > * scrape out the remaining cache. > > @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct > > scan_control *sc) > > struct reclaim_state *reclaim_state = current->reclaim_state; > > unsigned long nr_reclaimed, nr_scanned; > > bool reclaimable = false; > > + bool retry; > > > > do { > > struct mem_cgroup *root = sc->target_mem_cgroup; > > @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct > > scan_control *sc) > > }; > > struct mem_cgroup *memcg; > > > > + retry = false; > > + > > memset(>nr, 0, sizeof(sc->nr)); > > > > nr_reclaimed = sc->nr_reclaimed; > > @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct > > scan_control *sc) > > } > > } while ((memcg = mem_cgroup_iter(root, memcg, > > ))); > > > > + if ((sc->nr_scanned - nr_scanned) == 0 && > > +!sc->may_shrink_active) { > > + sc->may_shrink_active = 1; > > + retry = true; > > + continue; > > + } > > Using !scanned as the gate could be a problem. There might be a cgroup > that has inactive pages on the local level, but when viewed from the > system level the total inactive pages in the system might still be low > compared to active ones. In that case we should go after active pages. > > Basically, during global reclaim, the answer for whether active pages > should be scanned or not should be the same regardless of whether the > memory is all global or whether it's spread out between cgroups. > > The reason this isn't the case is because we're checking the ratio at > the lruvec level - which is the highest level (and identical to the > node counters) when memory is global, but it's at the lowest level > when memory is cgrouped. > > So IMO what we should do is: > > - At the beginning of global reclaim, use
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On 3/1/19 8:49 PM, Johannes Weiner wrote: > Hello Andrey, > > On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote: >> On 2/26/19 3:50 PM, Andrey Ryabinin wrote: >>> On 2/22/19 10:15 PM, Johannes Weiner wrote: On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always wins. > E.g. job that allocates a lot of clean rarely used page cache will push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc which > are supposed to protect jobs or limit them so they to not hurt others. > But memory cgroups are very hard to configure right because it requires > precise knowledge of the workload which may vary during the execution. > E.g. setting memory limit means that job won't be able to use all memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active lists > only if all inactive lists are low. Yes, you are absolutely right. We shouldn't go after active pages as long as there are plenty of inactive pages around. That's the global reclaim policy, and we currently fail to translate that well to cgrouped systems. Setting group protections or limits would work around this problem, but they're kind of a red herring. We shouldn't ever allow use-once streams to push out hot workingsets, that's a bug. > @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, > struct mem_cgroup *memcg, > > scan >>= sc->priority; > > + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, > + file, memcg, sc, false)) > + scan = 0; > + > /* >* If the cgroup's already been deleted, make sure to >* scrape out the remaining cache. > @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct > scan_control *sc) > struct reclaim_state *reclaim_state = current->reclaim_state; > unsigned long nr_reclaimed, nr_scanned; > bool reclaimable = false; > + bool retry; > > do { > struct mem_cgroup *root = sc->target_mem_cgroup; > @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct > scan_control *sc) > }; > struct mem_cgroup *memcg; > > + retry = false; > + > memset(>nr, 0, sizeof(sc->nr)); > > nr_reclaimed = sc->nr_reclaimed; > @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct > scan_control *sc) > } > } while ((memcg = mem_cgroup_iter(root, memcg, ))); > > + if ((sc->nr_scanned - nr_scanned) == 0 && > + !sc->may_shrink_active) { > + sc->may_shrink_active = 1; > + retry = true; > + continue; > + } Using !scanned as the gate could be a problem. There might be a cgroup that has inactive pages on the local level, but when viewed from the system level the total inactive pages in the system might still be low compared to active ones. In that case we should go after active pages. Basically, during global reclaim, the answer for whether active pages should be scanned or not should be the same regardless of whether the memory is all global or whether it's spread out between cgroups. The reason this isn't the case is because we're checking the ratio at the lruvec level - which is the highest level (and identical to the node counters) when memory is global, but it's at the lowest level when memory is cgrouped. So IMO what we should do is: - At the beginning of global reclaim, use node_page_state() to compare the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim can go after active pages or not. Regardless of what the ratio is in individual lruvecs. - And likewise at the beginning of cgroup limit reclaim, walk the subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE and ACTIVE_FILE counters, and make inactive_is_low() decision on
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
Hello Andrey, On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote: > On 2/26/19 3:50 PM, Andrey Ryabinin wrote: > > On 2/22/19 10:15 PM, Johannes Weiner wrote: > >> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > >>> In a presence of more than 1 memory cgroup in the system our reclaim > >>> logic is just suck. When we hit memory limit (global or a limit on > >>> cgroup with subgroups) we reclaim some memory from all cgroups. > >>> This is sucks because, the cgroup that allocates more often always wins. > >>> E.g. job that allocates a lot of clean rarely used page cache will push > >>> out of memory other jobs with active relatively small all in memory > >>> working set. > >>> > >>> To prevent such situations we have memcg controls like low/max, etc which > >>> are supposed to protect jobs or limit them so they to not hurt others. > >>> But memory cgroups are very hard to configure right because it requires > >>> precise knowledge of the workload which may vary during the execution. > >>> E.g. setting memory limit means that job won't be able to use all memory > >>> in the system for page cache even if the rest the system is idle. > >>> Basically our current scheme requires to configure every single cgroup > >>> in the system. > >>> > >>> I think we can do better. The idea proposed by this patch is to reclaim > >>> only inactive pages and only from cgroups that have big > >>> (!inactive_is_low()) inactive list. And go back to shrinking active lists > >>> only if all inactive lists are low. > >> > >> Yes, you are absolutely right. > >> > >> We shouldn't go after active pages as long as there are plenty of > >> inactive pages around. That's the global reclaim policy, and we > >> currently fail to translate that well to cgrouped systems. > >> > >> Setting group protections or limits would work around this problem, > >> but they're kind of a red herring. We shouldn't ever allow use-once > >> streams to push out hot workingsets, that's a bug. > >> > >>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, > >>> struct mem_cgroup *memcg, > >>> > >>> scan >>= sc->priority; > >>> > >>> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, > >>> + file, memcg, sc, false)) > >>> + scan = 0; > >>> + > >>> /* > >>>* If the cgroup's already been deleted, make sure to > >>>* scrape out the remaining cache. > >>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct > >>> scan_control *sc) > >>> struct reclaim_state *reclaim_state = current->reclaim_state; > >>> unsigned long nr_reclaimed, nr_scanned; > >>> bool reclaimable = false; > >>> + bool retry; > >>> > >>> do { > >>> struct mem_cgroup *root = sc->target_mem_cgroup; > >>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct > >>> scan_control *sc) > >>> }; > >>> struct mem_cgroup *memcg; > >>> > >>> + retry = false; > >>> + > >>> memset(>nr, 0, sizeof(sc->nr)); > >>> > >>> nr_reclaimed = sc->nr_reclaimed; > >>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct > >>> scan_control *sc) > >>> } > >>> } while ((memcg = mem_cgroup_iter(root, memcg, ))); > >>> > >>> + if ((sc->nr_scanned - nr_scanned) == 0 && > >>> + !sc->may_shrink_active) { > >>> + sc->may_shrink_active = 1; > >>> + retry = true; > >>> + continue; > >>> + } > >> > >> Using !scanned as the gate could be a problem. There might be a cgroup > >> that has inactive pages on the local level, but when viewed from the > >> system level the total inactive pages in the system might still be low > >> compared to active ones. In that case we should go after active pages. > >> > >> Basically, during global reclaim, the answer for whether active pages > >> should be scanned or not should be the same regardless of whether the > >> memory is all global or whether it's spread out between cgroups. > >> > >> The reason this isn't the case is because we're checking the ratio at > >> the lruvec level - which is the highest level (and identical to the > >> node counters) when memory is global, but it's at the lowest level > >> when memory is cgrouped. > >> > >> So IMO what we should do is: > >> > >> - At the beginning of global reclaim, use node_page_state() to compare > >> the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim > >> can go after active pages or not. Regardless of what the ratio is in > >> individual lruvecs. > >> > >> - And likewise at the beginning of cgroup limit reclaim, walk the > >> subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE > >> and ACTIVE_FILE counters, and make inactive_is_low() decision on > >> those sums. > >> > > > > Sounds reasonable. > > >
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On 2/26/19 3:50 PM, Andrey Ryabinin wrote: > > > On 2/22/19 10:15 PM, Johannes Weiner wrote: >> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: >>> In a presence of more than 1 memory cgroup in the system our reclaim >>> logic is just suck. When we hit memory limit (global or a limit on >>> cgroup with subgroups) we reclaim some memory from all cgroups. >>> This is sucks because, the cgroup that allocates more often always wins. >>> E.g. job that allocates a lot of clean rarely used page cache will push >>> out of memory other jobs with active relatively small all in memory >>> working set. >>> >>> To prevent such situations we have memcg controls like low/max, etc which >>> are supposed to protect jobs or limit them so they to not hurt others. >>> But memory cgroups are very hard to configure right because it requires >>> precise knowledge of the workload which may vary during the execution. >>> E.g. setting memory limit means that job won't be able to use all memory >>> in the system for page cache even if the rest the system is idle. >>> Basically our current scheme requires to configure every single cgroup >>> in the system. >>> >>> I think we can do better. The idea proposed by this patch is to reclaim >>> only inactive pages and only from cgroups that have big >>> (!inactive_is_low()) inactive list. And go back to shrinking active lists >>> only if all inactive lists are low. >> >> Yes, you are absolutely right. >> >> We shouldn't go after active pages as long as there are plenty of >> inactive pages around. That's the global reclaim policy, and we >> currently fail to translate that well to cgrouped systems. >> >> Setting group protections or limits would work around this problem, >> but they're kind of a red herring. We shouldn't ever allow use-once >> streams to push out hot workingsets, that's a bug. >> >>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, >>> struct mem_cgroup *memcg, >>> >>> scan >>= sc->priority; >>> >>> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, >>> + file, memcg, sc, false)) >>> + scan = 0; >>> + >>> /* >>> * If the cgroup's already been deleted, make sure to >>> * scrape out the remaining cache. >>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct >>> scan_control *sc) >>> struct reclaim_state *reclaim_state = current->reclaim_state; >>> unsigned long nr_reclaimed, nr_scanned; >>> bool reclaimable = false; >>> + bool retry; >>> >>> do { >>> struct mem_cgroup *root = sc->target_mem_cgroup; >>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct >>> scan_control *sc) >>> }; >>> struct mem_cgroup *memcg; >>> >>> + retry = false; >>> + >>> memset(>nr, 0, sizeof(sc->nr)); >>> >>> nr_reclaimed = sc->nr_reclaimed; >>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct >>> scan_control *sc) >>> } >>> } while ((memcg = mem_cgroup_iter(root, memcg, ))); >>> >>> + if ((sc->nr_scanned - nr_scanned) == 0 && >>> +!sc->may_shrink_active) { >>> + sc->may_shrink_active = 1; >>> + retry = true; >>> + continue; >>> + } >> >> Using !scanned as the gate could be a problem. There might be a cgroup >> that has inactive pages on the local level, but when viewed from the >> system level the total inactive pages in the system might still be low >> compared to active ones. In that case we should go after active pages. >> >> Basically, during global reclaim, the answer for whether active pages >> should be scanned or not should be the same regardless of whether the >> memory is all global or whether it's spread out between cgroups. >> >> The reason this isn't the case is because we're checking the ratio at >> the lruvec level - which is the highest level (and identical to the >> node counters) when memory is global, but it's at the lowest level >> when memory is cgrouped. >> >> So IMO what we should do is: >> >> - At the beginning of global reclaim, use node_page_state() to compare >> the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim >> can go after active pages or not. Regardless of what the ratio is in >> individual lruvecs. >> >> - And likewise at the beginning of cgroup limit reclaim, walk the >> subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE >> and ACTIVE_FILE counters, and make inactive_is_low() decision on >> those sums. >> > > Sounds reasonable. > On the second thought it seems to be better to keep the decision on lru level. There are couple reasons for this: 1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active counters) would be wrong. Because some
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On Tue, Feb 26, 2019 at 06:36:38PM +0300, Andrey Ryabinin wrote: > > > On 2/25/19 7:03 AM, Roman Gushchin wrote: > > On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > >> In a presence of more than 1 memory cgroup in the system our reclaim > >> logic is just suck. When we hit memory limit (global or a limit on > >> cgroup with subgroups) we reclaim some memory from all cgroups. > >> This is sucks because, the cgroup that allocates more often always wins. > >> E.g. job that allocates a lot of clean rarely used page cache will push > >> out of memory other jobs with active relatively small all in memory > >> working set. > >> > >> To prevent such situations we have memcg controls like low/max, etc which > >> are supposed to protect jobs or limit them so they to not hurt others. > >> But memory cgroups are very hard to configure right because it requires > >> precise knowledge of the workload which may vary during the execution. > >> E.g. setting memory limit means that job won't be able to use all memory > >> in the system for page cache even if the rest the system is idle. > >> Basically our current scheme requires to configure every single cgroup > >> in the system. > >> > >> I think we can do better. The idea proposed by this patch is to reclaim > >> only inactive pages and only from cgroups that have big > >> (!inactive_is_low()) inactive list. And go back to shrinking active lists > >> only if all inactive lists are low. > > > > Hi Andrey! > > > > It's definitely an interesting idea! However, let me bring some concerns: > > 1) What's considered active and inactive depends on memory pressure inside > > a cgroup. > > There is no such dependency. High memory pressure may be generated both > by active and inactive pages. We also can have a cgroup creating no pressure > with almost only active (or only inactive) pages. > > > Actually active pages in one cgroup (e.g. just deleted) can be colder > > than inactive pages in an other (e.g. a memory-hungry cgroup with a tight > > memory.max). > > > > Well, yes, this is a drawback of having per-memcg lrus. > > > Also a workload inside a cgroup can to some extend control what's going > > to the active LRU. So it opens a way to get more memory unfairly by > > artificially promoting more pages to the active LRU. So a cgroup > > can get an unfair advantage over other cgroups. > > > > Unfair is usually a negative term, but in this case it's very much depends on > definition of what is "fair". > > If fair means to put equal reclaim pressure on all cgroups, than yes, the > patch > increases such unfairness, but such unfairness is a good thing. > Obviously it's more valuable to keep in memory actively used page than the > page that not used. I think that fairness is good here. > > > Generally speaking, now we have a way to measure the memory pressure > > inside a cgroup. So, in theory, it should be possible to balance > > scanning effort based on memory pressure. > > > > Simply by design, the inactive pages are the first candidates to reclaim. > Any decision that doesn't take into account inactive pages probably would be > wrong. > > E.g. cgroup A with active job loading a big and active working set which > creates high memory pressure > and cgroup B - idle (no memory pressure) with a huge not used cache. > It's definitely preferable to reclaim from B rather than from A. > For sure, if we're reclaiming hot pages instead of cold, it's bad for the overall performance. But active and inactive LRUs are just an approximation of what is hot and cold. E.g. I will run "cat some_large_file" twice in a cgroup, and the whole file will reside in the active LRU and considered hot. Even if nobody will ever use it again. So it means that depending on memory usage pattern, some workloads will benefit from your change, and some will suffer. Btw, what will be with protected cgroups (with memory.low set)? Those will still affect global scanning decisions (active/inactive ratio), but will be exempted from scanning? Thanks!
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On 2/25/19 7:03 AM, Roman Gushchin wrote: > On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: >> In a presence of more than 1 memory cgroup in the system our reclaim >> logic is just suck. When we hit memory limit (global or a limit on >> cgroup with subgroups) we reclaim some memory from all cgroups. >> This is sucks because, the cgroup that allocates more often always wins. >> E.g. job that allocates a lot of clean rarely used page cache will push >> out of memory other jobs with active relatively small all in memory >> working set. >> >> To prevent such situations we have memcg controls like low/max, etc which >> are supposed to protect jobs or limit them so they to not hurt others. >> But memory cgroups are very hard to configure right because it requires >> precise knowledge of the workload which may vary during the execution. >> E.g. setting memory limit means that job won't be able to use all memory >> in the system for page cache even if the rest the system is idle. >> Basically our current scheme requires to configure every single cgroup >> in the system. >> >> I think we can do better. The idea proposed by this patch is to reclaim >> only inactive pages and only from cgroups that have big >> (!inactive_is_low()) inactive list. And go back to shrinking active lists >> only if all inactive lists are low. > > Hi Andrey! > > It's definitely an interesting idea! However, let me bring some concerns: > 1) What's considered active and inactive depends on memory pressure inside > a cgroup. There is no such dependency. High memory pressure may be generated both by active and inactive pages. We also can have a cgroup creating no pressure with almost only active (or only inactive) pages. > Actually active pages in one cgroup (e.g. just deleted) can be colder > than inactive pages in an other (e.g. a memory-hungry cgroup with a tight > memory.max). > Well, yes, this is a drawback of having per-memcg lrus. > Also a workload inside a cgroup can to some extend control what's going > to the active LRU. So it opens a way to get more memory unfairly by > artificially promoting more pages to the active LRU. So a cgroup > can get an unfair advantage over other cgroups. > Unfair is usually a negative term, but in this case it's very much depends on definition of what is "fair". If fair means to put equal reclaim pressure on all cgroups, than yes, the patch increases such unfairness, but such unfairness is a good thing. Obviously it's more valuable to keep in memory actively used page than the page that not used. > Generally speaking, now we have a way to measure the memory pressure > inside a cgroup. So, in theory, it should be possible to balance > scanning effort based on memory pressure. > Simply by design, the inactive pages are the first candidates to reclaim. Any decision that doesn't take into account inactive pages probably would be wrong. E.g. cgroup A with active job loading a big and active working set which creates high memory pressure and cgroup B - idle (no memory pressure) with a huge not used cache. It's definitely preferable to reclaim from B rather than from A.
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On 2/22/19 10:15 PM, Johannes Weiner wrote: > On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: >> In a presence of more than 1 memory cgroup in the system our reclaim >> logic is just suck. When we hit memory limit (global or a limit on >> cgroup with subgroups) we reclaim some memory from all cgroups. >> This is sucks because, the cgroup that allocates more often always wins. >> E.g. job that allocates a lot of clean rarely used page cache will push >> out of memory other jobs with active relatively small all in memory >> working set. >> >> To prevent such situations we have memcg controls like low/max, etc which >> are supposed to protect jobs or limit them so they to not hurt others. >> But memory cgroups are very hard to configure right because it requires >> precise knowledge of the workload which may vary during the execution. >> E.g. setting memory limit means that job won't be able to use all memory >> in the system for page cache even if the rest the system is idle. >> Basically our current scheme requires to configure every single cgroup >> in the system. >> >> I think we can do better. The idea proposed by this patch is to reclaim >> only inactive pages and only from cgroups that have big >> (!inactive_is_low()) inactive list. And go back to shrinking active lists >> only if all inactive lists are low. > > Yes, you are absolutely right. > > We shouldn't go after active pages as long as there are plenty of > inactive pages around. That's the global reclaim policy, and we > currently fail to translate that well to cgrouped systems. > > Setting group protections or limits would work around this problem, > but they're kind of a red herring. We shouldn't ever allow use-once > streams to push out hot workingsets, that's a bug. > >> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, >> struct mem_cgroup *memcg, >> >> scan >>= sc->priority; >> >> +if (!sc->may_shrink_active && inactive_list_is_low(lruvec, >> +file, memcg, sc, false)) >> +scan = 0; >> + >> /* >> * If the cgroup's already been deleted, make sure to >> * scrape out the remaining cache. >> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct >> scan_control *sc) >> struct reclaim_state *reclaim_state = current->reclaim_state; >> unsigned long nr_reclaimed, nr_scanned; >> bool reclaimable = false; >> +bool retry; >> >> do { >> struct mem_cgroup *root = sc->target_mem_cgroup; >> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct >> scan_control *sc) >> }; >> struct mem_cgroup *memcg; >> >> +retry = false; >> + >> memset(>nr, 0, sizeof(sc->nr)); >> >> nr_reclaimed = sc->nr_reclaimed; >> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct >> scan_control *sc) >> } >> } while ((memcg = mem_cgroup_iter(root, memcg, ))); >> >> +if ((sc->nr_scanned - nr_scanned) == 0 && >> + !sc->may_shrink_active) { >> +sc->may_shrink_active = 1; >> +retry = true; >> +continue; >> +} > > Using !scanned as the gate could be a problem. There might be a cgroup > that has inactive pages on the local level, but when viewed from the > system level the total inactive pages in the system might still be low > compared to active ones. In that case we should go after active pages. > > Basically, during global reclaim, the answer for whether active pages > should be scanned or not should be the same regardless of whether the > memory is all global or whether it's spread out between cgroups. > > The reason this isn't the case is because we're checking the ratio at > the lruvec level - which is the highest level (and identical to the > node counters) when memory is global, but it's at the lowest level > when memory is cgrouped. > > So IMO what we should do is: > > - At the beginning of global reclaim, use node_page_state() to compare > the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim > can go after active pages or not. Regardless of what the ratio is in > individual lruvecs. > > - And likewise at the beginning of cgroup limit reclaim, walk the > subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE > and ACTIVE_FILE counters, and make inactive_is_low() decision on > those sums. > Sounds reasonable.
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On 2/22/19 6:58 PM, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always wins. > E.g. job that allocates a lot of clean rarely used page cache will push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc which > are supposed to protect jobs or limit them so they to not hurt others. > But memory cgroups are very hard to configure right because it requires > precise knowledge of the workload which may vary during the execution. > E.g. setting memory limit means that job won't be able to use all memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active lists > only if all inactive lists are low. Perhaps going this direction could also make page cache side-channel attacks harder? Quoting [1]: "On Linux, we are only able to evict pages efficiently because we can trick the page re- placement algorithm into believing our target page would be the best choice for eviction. The reason for this lies in the fact that Linux uses a global page replacement algorithm, i.e., an algorithm which does not distinguish between dif- ferent processes. Global page replacement algorithms have been known for decades to allow one process to perform a denial-of-service on other processes" [1] https://arxiv.org/abs/1901.01161
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always wins. > E.g. job that allocates a lot of clean rarely used page cache will push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc which > are supposed to protect jobs or limit them so they to not hurt others. > But memory cgroups are very hard to configure right because it requires > precise knowledge of the workload which may vary during the execution. > E.g. setting memory limit means that job won't be able to use all memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active lists > only if all inactive lists are low. Hi Andrey! It's definitely an interesting idea! However, let me bring some concerns: 1) What's considered active and inactive depends on memory pressure inside a cgroup. Actually active pages in one cgroup (e.g. just deleted) can be colder than inactive pages in an other (e.g. a memory-hungry cgroup with a tight memory.max). Also a workload inside a cgroup can to some extend control what's going to the active LRU. So it opens a way to get more memory unfairly by artificially promoting more pages to the active LRU. So a cgroup can get an unfair advantage over other cgroups. Generally speaking, now we have a way to measure the memory pressure inside a cgroup. So, in theory, it should be possible to balance scanning effort based on memory pressure. Thanks!
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always wins. > E.g. job that allocates a lot of clean rarely used page cache will push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc which > are supposed to protect jobs or limit them so they to not hurt others. > But memory cgroups are very hard to configure right because it requires > precise knowledge of the workload which may vary during the execution. > E.g. setting memory limit means that job won't be able to use all memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active lists > only if all inactive lists are low. Yes, you are absolutely right. We shouldn't go after active pages as long as there are plenty of inactive pages around. That's the global reclaim policy, and we currently fail to translate that well to cgrouped systems. Setting group protections or limits would work around this problem, but they're kind of a red herring. We shouldn't ever allow use-once streams to push out hot workingsets, that's a bug. > @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, > struct mem_cgroup *memcg, > > scan >>= sc->priority; > > + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, > + file, memcg, sc, false)) > + scan = 0; > + > /* >* If the cgroup's already been deleted, make sure to >* scrape out the remaining cache. > @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct > scan_control *sc) > struct reclaim_state *reclaim_state = current->reclaim_state; > unsigned long nr_reclaimed, nr_scanned; > bool reclaimable = false; > + bool retry; > > do { > struct mem_cgroup *root = sc->target_mem_cgroup; > @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct > scan_control *sc) > }; > struct mem_cgroup *memcg; > > + retry = false; > + > memset(>nr, 0, sizeof(sc->nr)); > > nr_reclaimed = sc->nr_reclaimed; > @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct > scan_control *sc) > } > } while ((memcg = mem_cgroup_iter(root, memcg, ))); > > + if ((sc->nr_scanned - nr_scanned) == 0 && > + !sc->may_shrink_active) { > + sc->may_shrink_active = 1; > + retry = true; > + continue; > + } Using !scanned as the gate could be a problem. There might be a cgroup that has inactive pages on the local level, but when viewed from the system level the total inactive pages in the system might still be low compared to active ones. In that case we should go after active pages. Basically, during global reclaim, the answer for whether active pages should be scanned or not should be the same regardless of whether the memory is all global or whether it's spread out between cgroups. The reason this isn't the case is because we're checking the ratio at the lruvec level - which is the highest level (and identical to the node counters) when memory is global, but it's at the lowest level when memory is cgrouped. So IMO what we should do is: - At the beginning of global reclaim, use node_page_state() to compare the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim can go after active pages or not. Regardless of what the ratio is in individual lruvecs. - And likewise at the beginning of cgroup limit reclaim, walk the subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE and ACTIVE_FILE counters, and make inactive_is_low() decision on those sums.
Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
On Fri, 2019-02-22 at 20:58 +0300, Andrey Ryabinin wrote: > In a presence of more than 1 memory cgroup in the system our reclaim > logic is just suck. When we hit memory limit (global or a limit on > cgroup with subgroups) we reclaim some memory from all cgroups. > This is sucks because, the cgroup that allocates more often always > wins. > E.g. job that allocates a lot of clean rarely used page cache will > push > out of memory other jobs with active relatively small all in memory > working set. > > To prevent such situations we have memcg controls like low/max, etc > which > are supposed to protect jobs or limit them so they to not hurt > others. > But memory cgroups are very hard to configure right because it > requires > precise knowledge of the workload which may vary during the > execution. > E.g. setting memory limit means that job won't be able to use all > memory > in the system for page cache even if the rest the system is idle. > Basically our current scheme requires to configure every single > cgroup > in the system. > > I think we can do better. The idea proposed by this patch is to > reclaim > only inactive pages and only from cgroups that have big > (!inactive_is_low()) inactive list. And go back to shrinking active > lists > only if all inactive lists are low. Your general idea seems like a good one, but the logic in the code seems a little convoluted to me. I wonder if we can simplify things a little, by checking (when we enter page reclaim) whether the pgdat has enough inactive pages based on the node_page_state statistics, and basing our decision whether or not to scan the active lists off that. As it stands, your patch seems like the kind of code that makes perfect sense today, but which will confuse people who look at the code two years from now. If the code could be made a little more explicit, great. If there are good reasons to do things in the fallback way your current patch does it, the code could use some good comments explaining why :) -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
[PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.
In a presence of more than 1 memory cgroup in the system our reclaim logic is just suck. When we hit memory limit (global or a limit on cgroup with subgroups) we reclaim some memory from all cgroups. This is sucks because, the cgroup that allocates more often always wins. E.g. job that allocates a lot of clean rarely used page cache will push out of memory other jobs with active relatively small all in memory working set. To prevent such situations we have memcg controls like low/max, etc which are supposed to protect jobs or limit them so they to not hurt others. But memory cgroups are very hard to configure right because it requires precise knowledge of the workload which may vary during the execution. E.g. setting memory limit means that job won't be able to use all memory in the system for page cache even if the rest the system is idle. Basically our current scheme requires to configure every single cgroup in the system. I think we can do better. The idea proposed by this patch is to reclaim only inactive pages and only from cgroups that have big (!inactive_is_low()) inactive list. And go back to shrinking active lists only if all inactive lists are low. Now, the simple test case to demonstrate the effect of the patch. The job in one memcg repeatedly compresses one file: perf stat -n --repeat 20 gzip -ck sample > /dev/null and just 'dd' running in parallel reading the disk in another cgroup. Before: Performance counter stats for 'gzip -ck sample' (20 runs): 17.673572290 seconds time elapsed ( +- 5.60% ) After: Performance counter stats for 'gzip -ck sample' (20 runs): 11.426193980 seconds time elapsed ( +- 0.20% ) The more often dd cgroup allocates memory, the more gzip suffer. With 4 parallel dd instead of one: Before: Performance counter stats for 'gzip -ck sample' (20 runs): 499.976782013 seconds time elapsed ( +- 23.13% ) After: Performance counter stats for 'gzip -ck sample' (20 runs): 11.307450516 seconds time elapsed ( +- 0.27% ) It would be possible to achieve the similar effect by setting the memory.low on gzip cgroup, but the best value for memory.low depends on the size of the 'sample' file. It also possible to limit the 'dd' job, but just imagine something more sophisticated than just 'dd', the job that would benefit from occupying all available memory. The best limit for such job would be something like 'total_memory' - 'sample size' which is again unknown. Signed-off-by: Andrey Ryabinin Cc: Johannes Weiner Cc: Michal Hocko Cc: Vlastimil Babka Cc: Rik van Riel Cc: Mel Gorman Cc: Roman Gushchin Cc: Shakeel Butt --- mm/vmscan.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index efd10d6b9510..2f562c3358ab 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -104,6 +104,8 @@ struct scan_control { /* One of the zones is ready for compaction */ unsigned int compaction_ready:1; + unsigned int may_shrink_active:1; + /* Allocation order */ s8 order; @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg, scan >>= sc->priority; + if (!sc->may_shrink_active && inactive_list_is_low(lruvec, + file, memcg, sc, false)) + scan = 0; + /* * If the cgroup's already been deleted, make sure to * scrape out the remaining cache. @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long nr_reclaimed, nr_scanned; bool reclaimable = false; + bool retry; do { struct mem_cgroup *root = sc->target_mem_cgroup; @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) }; struct mem_cgroup *memcg; + retry = false; + memset(>nr, 0, sizeof(sc->nr)); nr_reclaimed = sc->nr_reclaimed; @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) } } while ((memcg = mem_cgroup_iter(root, memcg, ))); + if ((sc->nr_scanned - nr_scanned) == 0 && +!sc->may_shrink_active) { + sc->may_shrink_active = 1; + retry = true; + continue; + } + if (reclaim_state) { sc->nr_reclaimed += reclaim_state->reclaimed_slab; reclaim_state->reclaimed_slab = 0; @@ -2887,7 +2903,7 @@ static bool shrink_node(pg_data_t *pgdat, struct