Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-03-04 Thread Andrey Ryabinin



On 3/2/19 1:20 AM, Johannes Weiner wrote:
> On Fri, Mar 01, 2019 at 10:46:34PM +0300, Andrey Ryabinin wrote:
>> On 3/1/19 8:49 PM, Johannes Weiner wrote:
>>> On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote:
 On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
> On 2/22/19 10:15 PM, Johannes Weiner wrote:
>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>>> In a presence of more than 1 memory cgroup in the system our reclaim
>>> logic is just suck. When we hit memory limit (global or a limit on
>>> cgroup with subgroups) we reclaim some memory from all cgroups.
>>> This is sucks because, the cgroup that allocates more often always wins.
>>> E.g. job that allocates a lot of clean rarely used page cache will push
>>> out of memory other jobs with active relatively small all in memory
>>> working set.
>>>
>>> To prevent such situations we have memcg controls like low/max, etc 
>>> which
>>> are supposed to protect jobs or limit them so they to not hurt others.
>>> But memory cgroups are very hard to configure right because it requires
>>> precise knowledge of the workload which may vary during the execution.
>>> E.g. setting memory limit means that job won't be able to use all memory
>>> in the system for page cache even if the rest the system is idle.
>>> Basically our current scheme requires to configure every single cgroup
>>> in the system.
>>>
>>> I think we can do better. The idea proposed by this patch is to reclaim
>>> only inactive pages and only from cgroups that have big
>>> (!inactive_is_low()) inactive list. And go back to shrinking active 
>>> lists
>>> only if all inactive lists are low.
>>
>> Yes, you are absolutely right.
>>
>> We shouldn't go after active pages as long as there are plenty of
>> inactive pages around. That's the global reclaim policy, and we
>> currently fail to translate that well to cgrouped systems.
>>
>> Setting group protections or limits would work around this problem,
>> but they're kind of a red herring. We shouldn't ever allow use-once
>> streams to push out hot workingsets, that's a bug.
>>
>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec 
>>> *lruvec, struct mem_cgroup *memcg,
>>>  
>>> scan >>= sc->priority;
>>>  
>>> +   if (!sc->may_shrink_active && 
>>> inactive_list_is_low(lruvec,
>>> +   file, memcg, sc, false))
>>> +   scan = 0;
>>> +
>>> /*
>>>  * If the cgroup's already been deleted, make sure to
>>>  * scrape out the remaining cache.
>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>> scan_control *sc)
>>> struct reclaim_state *reclaim_state = current->reclaim_state;
>>> unsigned long nr_reclaimed, nr_scanned;
>>> bool reclaimable = false;
>>> +   bool retry;
>>>  
>>> do {
>>> struct mem_cgroup *root = sc->target_mem_cgroup;
>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>> scan_control *sc)
>>> };
>>> struct mem_cgroup *memcg;
>>>  
>>> +   retry = false;
>>> +
>>> memset(>nr, 0, sizeof(sc->nr));
>>>  
>>> nr_reclaimed = sc->nr_reclaimed;
>>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>> scan_control *sc)
>>> }
>>> } while ((memcg = mem_cgroup_iter(root, memcg, 
>>> )));
>>>  
>>> +   if ((sc->nr_scanned - nr_scanned) == 0 &&
>>> +!sc->may_shrink_active) {
>>> +   sc->may_shrink_active = 1;
>>> +   retry = true;
>>> +   continue;
>>> +   }
>>
>> Using !scanned as the gate could be a problem. There might be a cgroup
>> that has inactive pages on the local level, but when viewed from the
>> system level the total inactive pages in the system might still be low
>> compared to active ones. In that case we should go after active pages.
>>
>> Basically, during global reclaim, the answer for whether active pages
>> should be scanned or not should be the same regardless of whether the
>> memory is all global or whether it's spread out between cgroups.
>>
>> The reason this isn't the case is because we're checking the ratio at
>> the lruvec level - which is the highest level (and identical to the
>> node counters) when memory is global, but it's at the lowest level
>> when memory is cgrouped.
>>
>> So IMO what we should do is:
>>
>> - 

Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-03-01 Thread Johannes Weiner
On Fri, Mar 01, 2019 at 10:46:34PM +0300, Andrey Ryabinin wrote:
> On 3/1/19 8:49 PM, Johannes Weiner wrote:
> > On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote:
> >> On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
> >>> On 2/22/19 10:15 PM, Johannes Weiner wrote:
>  On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> > In a presence of more than 1 memory cgroup in the system our reclaim
> > logic is just suck. When we hit memory limit (global or a limit on
> > cgroup with subgroups) we reclaim some memory from all cgroups.
> > This is sucks because, the cgroup that allocates more often always wins.
> > E.g. job that allocates a lot of clean rarely used page cache will push
> > out of memory other jobs with active relatively small all in memory
> > working set.
> >
> > To prevent such situations we have memcg controls like low/max, etc 
> > which
> > are supposed to protect jobs or limit them so they to not hurt others.
> > But memory cgroups are very hard to configure right because it requires
> > precise knowledge of the workload which may vary during the execution.
> > E.g. setting memory limit means that job won't be able to use all memory
> > in the system for page cache even if the rest the system is idle.
> > Basically our current scheme requires to configure every single cgroup
> > in the system.
> >
> > I think we can do better. The idea proposed by this patch is to reclaim
> > only inactive pages and only from cgroups that have big
> > (!inactive_is_low()) inactive list. And go back to shrinking active 
> > lists
> > only if all inactive lists are low.
> 
>  Yes, you are absolutely right.
> 
>  We shouldn't go after active pages as long as there are plenty of
>  inactive pages around. That's the global reclaim policy, and we
>  currently fail to translate that well to cgrouped systems.
> 
>  Setting group protections or limits would work around this problem,
>  but they're kind of a red herring. We shouldn't ever allow use-once
>  streams to push out hot workingsets, that's a bug.
> 
> > @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec 
> > *lruvec, struct mem_cgroup *memcg,
> >  
> > scan >>= sc->priority;
> >  
> > +   if (!sc->may_shrink_active && 
> > inactive_list_is_low(lruvec,
> > +   file, memcg, sc, false))
> > +   scan = 0;
> > +
> > /*
> >  * If the cgroup's already been deleted, make sure to
> >  * scrape out the remaining cache.
> > @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> > scan_control *sc)
> > struct reclaim_state *reclaim_state = current->reclaim_state;
> > unsigned long nr_reclaimed, nr_scanned;
> > bool reclaimable = false;
> > +   bool retry;
> >  
> > do {
> > struct mem_cgroup *root = sc->target_mem_cgroup;
> > @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> > scan_control *sc)
> > };
> > struct mem_cgroup *memcg;
> >  
> > +   retry = false;
> > +
> > memset(>nr, 0, sizeof(sc->nr));
> >  
> > nr_reclaimed = sc->nr_reclaimed;
> > @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> > scan_control *sc)
> > }
> > } while ((memcg = mem_cgroup_iter(root, memcg, 
> > )));
> >  
> > +   if ((sc->nr_scanned - nr_scanned) == 0 &&
> > +!sc->may_shrink_active) {
> > +   sc->may_shrink_active = 1;
> > +   retry = true;
> > +   continue;
> > +   }
> 
>  Using !scanned as the gate could be a problem. There might be a cgroup
>  that has inactive pages on the local level, but when viewed from the
>  system level the total inactive pages in the system might still be low
>  compared to active ones. In that case we should go after active pages.
> 
>  Basically, during global reclaim, the answer for whether active pages
>  should be scanned or not should be the same regardless of whether the
>  memory is all global or whether it's spread out between cgroups.
> 
>  The reason this isn't the case is because we're checking the ratio at
>  the lruvec level - which is the highest level (and identical to the
>  node counters) when memory is global, but it's at the lowest level
>  when memory is cgrouped.
> 
>  So IMO what we should do is:
> 
>  - At the beginning of global reclaim, use 

Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-03-01 Thread Andrey Ryabinin



On 3/1/19 8:49 PM, Johannes Weiner wrote:
> Hello Andrey,
> 
> On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote:
>> On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
>>> On 2/22/19 10:15 PM, Johannes Weiner wrote:
 On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> In a presence of more than 1 memory cgroup in the system our reclaim
> logic is just suck. When we hit memory limit (global or a limit on
> cgroup with subgroups) we reclaim some memory from all cgroups.
> This is sucks because, the cgroup that allocates more often always wins.
> E.g. job that allocates a lot of clean rarely used page cache will push
> out of memory other jobs with active relatively small all in memory
> working set.
>
> To prevent such situations we have memcg controls like low/max, etc which
> are supposed to protect jobs or limit them so they to not hurt others.
> But memory cgroups are very hard to configure right because it requires
> precise knowledge of the workload which may vary during the execution.
> E.g. setting memory limit means that job won't be able to use all memory
> in the system for page cache even if the rest the system is idle.
> Basically our current scheme requires to configure every single cgroup
> in the system.
>
> I think we can do better. The idea proposed by this patch is to reclaim
> only inactive pages and only from cgroups that have big
> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> only if all inactive lists are low.

 Yes, you are absolutely right.

 We shouldn't go after active pages as long as there are plenty of
 inactive pages around. That's the global reclaim policy, and we
 currently fail to translate that well to cgrouped systems.

 Setting group protections or limits would work around this problem,
 but they're kind of a red herring. We shouldn't ever allow use-once
 streams to push out hot workingsets, that's a bug.

> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>  
>   scan >>= sc->priority;
>  
> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
> + file, memcg, sc, false))
> + scan = 0;
> +
>   /*
>* If the cgroup's already been deleted, make sure to
>* scrape out the remaining cache.
> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   struct reclaim_state *reclaim_state = current->reclaim_state;
>   unsigned long nr_reclaimed, nr_scanned;
>   bool reclaimable = false;
> + bool retry;
>  
>   do {
>   struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   };
>   struct mem_cgroup *memcg;
>  
> + retry = false;
> +
>   memset(>nr, 0, sizeof(sc->nr));
>  
>   nr_reclaimed = sc->nr_reclaimed;
> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   }
>   } while ((memcg = mem_cgroup_iter(root, memcg, )));
>  
> + if ((sc->nr_scanned - nr_scanned) == 0 &&
> +  !sc->may_shrink_active) {
> + sc->may_shrink_active = 1;
> + retry = true;
> + continue;
> + }

 Using !scanned as the gate could be a problem. There might be a cgroup
 that has inactive pages on the local level, but when viewed from the
 system level the total inactive pages in the system might still be low
 compared to active ones. In that case we should go after active pages.

 Basically, during global reclaim, the answer for whether active pages
 should be scanned or not should be the same regardless of whether the
 memory is all global or whether it's spread out between cgroups.

 The reason this isn't the case is because we're checking the ratio at
 the lruvec level - which is the highest level (and identical to the
 node counters) when memory is global, but it's at the lowest level
 when memory is cgrouped.

 So IMO what we should do is:

 - At the beginning of global reclaim, use node_page_state() to compare
   the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
   can go after active pages or not. Regardless of what the ratio is in
   individual lruvecs.

 - And likewise at the beginning of cgroup limit reclaim, walk the
   subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
   and ACTIVE_FILE counters, and make inactive_is_low() decision on
   

Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-03-01 Thread Johannes Weiner
Hello Andrey,

On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote:
> On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
> > On 2/22/19 10:15 PM, Johannes Weiner wrote:
> >> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> >>> In a presence of more than 1 memory cgroup in the system our reclaim
> >>> logic is just suck. When we hit memory limit (global or a limit on
> >>> cgroup with subgroups) we reclaim some memory from all cgroups.
> >>> This is sucks because, the cgroup that allocates more often always wins.
> >>> E.g. job that allocates a lot of clean rarely used page cache will push
> >>> out of memory other jobs with active relatively small all in memory
> >>> working set.
> >>>
> >>> To prevent such situations we have memcg controls like low/max, etc which
> >>> are supposed to protect jobs or limit them so they to not hurt others.
> >>> But memory cgroups are very hard to configure right because it requires
> >>> precise knowledge of the workload which may vary during the execution.
> >>> E.g. setting memory limit means that job won't be able to use all memory
> >>> in the system for page cache even if the rest the system is idle.
> >>> Basically our current scheme requires to configure every single cgroup
> >>> in the system.
> >>>
> >>> I think we can do better. The idea proposed by this patch is to reclaim
> >>> only inactive pages and only from cgroups that have big
> >>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> >>> only if all inactive lists are low.
> >>
> >> Yes, you are absolutely right.
> >>
> >> We shouldn't go after active pages as long as there are plenty of
> >> inactive pages around. That's the global reclaim policy, and we
> >> currently fail to translate that well to cgrouped systems.
> >>
> >> Setting group protections or limits would work around this problem,
> >> but they're kind of a red herring. We shouldn't ever allow use-once
> >> streams to push out hot workingsets, that's a bug.
> >>
> >>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, 
> >>> struct mem_cgroup *memcg,
> >>>  
> >>>   scan >>= sc->priority;
> >>>  
> >>> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
> >>> + file, memcg, sc, false))
> >>> + scan = 0;
> >>> +
> >>>   /*
> >>>* If the cgroup's already been deleted, make sure to
> >>>* scrape out the remaining cache.
> >>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> >>> scan_control *sc)
> >>>   struct reclaim_state *reclaim_state = current->reclaim_state;
> >>>   unsigned long nr_reclaimed, nr_scanned;
> >>>   bool reclaimable = false;
> >>> + bool retry;
> >>>  
> >>>   do {
> >>>   struct mem_cgroup *root = sc->target_mem_cgroup;
> >>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> >>> scan_control *sc)
> >>>   };
> >>>   struct mem_cgroup *memcg;
> >>>  
> >>> + retry = false;
> >>> +
> >>>   memset(>nr, 0, sizeof(sc->nr));
> >>>  
> >>>   nr_reclaimed = sc->nr_reclaimed;
> >>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> >>> scan_control *sc)
> >>>   }
> >>>   } while ((memcg = mem_cgroup_iter(root, memcg, )));
> >>>  
> >>> + if ((sc->nr_scanned - nr_scanned) == 0 &&
> >>> +  !sc->may_shrink_active) {
> >>> + sc->may_shrink_active = 1;
> >>> + retry = true;
> >>> + continue;
> >>> + }
> >>
> >> Using !scanned as the gate could be a problem. There might be a cgroup
> >> that has inactive pages on the local level, but when viewed from the
> >> system level the total inactive pages in the system might still be low
> >> compared to active ones. In that case we should go after active pages.
> >>
> >> Basically, during global reclaim, the answer for whether active pages
> >> should be scanned or not should be the same regardless of whether the
> >> memory is all global or whether it's spread out between cgroups.
> >>
> >> The reason this isn't the case is because we're checking the ratio at
> >> the lruvec level - which is the highest level (and identical to the
> >> node counters) when memory is global, but it's at the lowest level
> >> when memory is cgrouped.
> >>
> >> So IMO what we should do is:
> >>
> >> - At the beginning of global reclaim, use node_page_state() to compare
> >>   the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
> >>   can go after active pages or not. Regardless of what the ratio is in
> >>   individual lruvecs.
> >>
> >> - And likewise at the beginning of cgroup limit reclaim, walk the
> >>   subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
> >>   and ACTIVE_FILE counters, and make inactive_is_low() decision on
> >>   those sums.
> >>
> > 
> > Sounds reasonable.
> > 
> 

Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-03-01 Thread Andrey Ryabinin



On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
> 
> 
> On 2/22/19 10:15 PM, Johannes Weiner wrote:
>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>>> In a presence of more than 1 memory cgroup in the system our reclaim
>>> logic is just suck. When we hit memory limit (global or a limit on
>>> cgroup with subgroups) we reclaim some memory from all cgroups.
>>> This is sucks because, the cgroup that allocates more often always wins.
>>> E.g. job that allocates a lot of clean rarely used page cache will push
>>> out of memory other jobs with active relatively small all in memory
>>> working set.
>>>
>>> To prevent such situations we have memcg controls like low/max, etc which
>>> are supposed to protect jobs or limit them so they to not hurt others.
>>> But memory cgroups are very hard to configure right because it requires
>>> precise knowledge of the workload which may vary during the execution.
>>> E.g. setting memory limit means that job won't be able to use all memory
>>> in the system for page cache even if the rest the system is idle.
>>> Basically our current scheme requires to configure every single cgroup
>>> in the system.
>>>
>>> I think we can do better. The idea proposed by this patch is to reclaim
>>> only inactive pages and only from cgroups that have big
>>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>>> only if all inactive lists are low.
>>
>> Yes, you are absolutely right.
>>
>> We shouldn't go after active pages as long as there are plenty of
>> inactive pages around. That's the global reclaim policy, and we
>> currently fail to translate that well to cgrouped systems.
>>
>> Setting group protections or limits would work around this problem,
>> but they're kind of a red herring. We shouldn't ever allow use-once
>> streams to push out hot workingsets, that's a bug.
>>
>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, 
>>> struct mem_cgroup *memcg,
>>>  
>>> scan >>= sc->priority;
>>>  
>>> +   if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
>>> +   file, memcg, sc, false))
>>> +   scan = 0;
>>> +
>>> /*
>>>  * If the cgroup's already been deleted, make sure to
>>>  * scrape out the remaining cache.
>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>> scan_control *sc)
>>> struct reclaim_state *reclaim_state = current->reclaim_state;
>>> unsigned long nr_reclaimed, nr_scanned;
>>> bool reclaimable = false;
>>> +   bool retry;
>>>  
>>> do {
>>> struct mem_cgroup *root = sc->target_mem_cgroup;
>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>> scan_control *sc)
>>> };
>>> struct mem_cgroup *memcg;
>>>  
>>> +   retry = false;
>>> +
>>> memset(>nr, 0, sizeof(sc->nr));
>>>  
>>> nr_reclaimed = sc->nr_reclaimed;
>>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>> scan_control *sc)
>>> }
>>> } while ((memcg = mem_cgroup_iter(root, memcg, )));
>>>  
>>> +   if ((sc->nr_scanned - nr_scanned) == 0 &&
>>> +!sc->may_shrink_active) {
>>> +   sc->may_shrink_active = 1;
>>> +   retry = true;
>>> +   continue;
>>> +   }
>>
>> Using !scanned as the gate could be a problem. There might be a cgroup
>> that has inactive pages on the local level, but when viewed from the
>> system level the total inactive pages in the system might still be low
>> compared to active ones. In that case we should go after active pages.
>>
>> Basically, during global reclaim, the answer for whether active pages
>> should be scanned or not should be the same regardless of whether the
>> memory is all global or whether it's spread out between cgroups.
>>
>> The reason this isn't the case is because we're checking the ratio at
>> the lruvec level - which is the highest level (and identical to the
>> node counters) when memory is global, but it's at the lowest level
>> when memory is cgrouped.
>>
>> So IMO what we should do is:
>>
>> - At the beginning of global reclaim, use node_page_state() to compare
>>   the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
>>   can go after active pages or not. Regardless of what the ratio is in
>>   individual lruvecs.
>>
>> - And likewise at the beginning of cgroup limit reclaim, walk the
>>   subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
>>   and ACTIVE_FILE counters, and make inactive_is_low() decision on
>>   those sums.
>>
> 
> Sounds reasonable.
> 

On the second thought it seems to be better to keep the decision on lru level.
There are couple reasons for this:

1) Using bare node_page_state() (or sc->targe_mem_cgroup's total_[in]active 
counters) would be wrong.
 Because some 

Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-02-26 Thread Roman Gushchin
On Tue, Feb 26, 2019 at 06:36:38PM +0300, Andrey Ryabinin wrote:
> 
> 
> On 2/25/19 7:03 AM, Roman Gushchin wrote:
> > On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> >> In a presence of more than 1 memory cgroup in the system our reclaim
> >> logic is just suck. When we hit memory limit (global or a limit on
> >> cgroup with subgroups) we reclaim some memory from all cgroups.
> >> This is sucks because, the cgroup that allocates more often always wins.
> >> E.g. job that allocates a lot of clean rarely used page cache will push
> >> out of memory other jobs with active relatively small all in memory
> >> working set.
> >>
> >> To prevent such situations we have memcg controls like low/max, etc which
> >> are supposed to protect jobs or limit them so they to not hurt others.
> >> But memory cgroups are very hard to configure right because it requires
> >> precise knowledge of the workload which may vary during the execution.
> >> E.g. setting memory limit means that job won't be able to use all memory
> >> in the system for page cache even if the rest the system is idle.
> >> Basically our current scheme requires to configure every single cgroup
> >> in the system.
> >>
> >> I think we can do better. The idea proposed by this patch is to reclaim
> >> only inactive pages and only from cgroups that have big
> >> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> >> only if all inactive lists are low.
> > 
> > Hi Andrey!
> > 
> > It's definitely an interesting idea! However, let me bring some concerns:
> > 1) What's considered active and inactive depends on memory pressure inside
> > a cgroup.
> 
> There is no such dependency. High memory pressure may be generated both
> by active and inactive pages. We also can have a cgroup creating no pressure
> with almost only active (or only inactive) pages.
> 
> > Actually active pages in one cgroup (e.g. just deleted) can be colder
> > than inactive pages in an other (e.g. a memory-hungry cgroup with a tight
> > memory.max).
> > 
> 
> Well, yes, this is a drawback of having per-memcg lrus.
> 
> > Also a workload inside a cgroup can to some extend control what's going
> > to the active LRU. So it opens a way to get more memory unfairly by
> > artificially promoting more pages to the active LRU. So a cgroup
> > can get an unfair advantage over other cgroups.
> > 
> 
> Unfair is usually a negative term, but in this case it's very much depends on 
> definition of what is "fair".
> 
> If fair means to put equal reclaim pressure on all cgroups, than yes, the 
> patch
> increases such unfairness, but such unfairness is a good thing.
> Obviously it's more valuable to keep in memory actively used page than the 
> page that not used.

I think that fairness is good here.

> 
> > Generally speaking, now we have a way to measure the memory pressure
> > inside a cgroup. So, in theory, it should be possible to balance
> > scanning effort based on memory pressure.
> > 
> 
> Simply by design, the inactive pages are the first candidates to reclaim.
> Any decision that doesn't take into account inactive pages probably would be 
> wrong.
> 
> E.g. cgroup A with active job loading a big and active working set which 
> creates high memory pressure
> and cgroup B - idle (no memory pressure) with a huge not used cache.
> It's definitely preferable to reclaim from B rather than from A.
>

For sure, if we're reclaiming hot pages instead of cold, it's bad for the
overall performance. But active and inactive LRUs are just an approximation of
what is hot and cold. E.g. I will run "cat some_large_file" twice in a cgroup,
and the whole file will reside in the active LRU and considered hot. Even if
nobody will ever use it again.

So it means that depending on memory usage pattern, some workloads will benefit
from your change, and some will suffer.

Btw, what will be with protected cgroups (with memory.low set)?
Those will still affect global scanning decisions (active/inactive ratio),
but will be exempted from scanning?

Thanks!


Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-02-26 Thread Andrey Ryabinin



On 2/25/19 7:03 AM, Roman Gushchin wrote:
> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>> In a presence of more than 1 memory cgroup in the system our reclaim
>> logic is just suck. When we hit memory limit (global or a limit on
>> cgroup with subgroups) we reclaim some memory from all cgroups.
>> This is sucks because, the cgroup that allocates more often always wins.
>> E.g. job that allocates a lot of clean rarely used page cache will push
>> out of memory other jobs with active relatively small all in memory
>> working set.
>>
>> To prevent such situations we have memcg controls like low/max, etc which
>> are supposed to protect jobs or limit them so they to not hurt others.
>> But memory cgroups are very hard to configure right because it requires
>> precise knowledge of the workload which may vary during the execution.
>> E.g. setting memory limit means that job won't be able to use all memory
>> in the system for page cache even if the rest the system is idle.
>> Basically our current scheme requires to configure every single cgroup
>> in the system.
>>
>> I think we can do better. The idea proposed by this patch is to reclaim
>> only inactive pages and only from cgroups that have big
>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>> only if all inactive lists are low.
> 
> Hi Andrey!
> 
> It's definitely an interesting idea! However, let me bring some concerns:
> 1) What's considered active and inactive depends on memory pressure inside
> a cgroup.

There is no such dependency. High memory pressure may be generated both
by active and inactive pages. We also can have a cgroup creating no pressure
with almost only active (or only inactive) pages.

> Actually active pages in one cgroup (e.g. just deleted) can be colder
> than inactive pages in an other (e.g. a memory-hungry cgroup with a tight
> memory.max).
> 

Well, yes, this is a drawback of having per-memcg lrus.

> Also a workload inside a cgroup can to some extend control what's going
> to the active LRU. So it opens a way to get more memory unfairly by
> artificially promoting more pages to the active LRU. So a cgroup
> can get an unfair advantage over other cgroups.
> 

Unfair is usually a negative term, but in this case it's very much depends on 
definition of what is "fair".

If fair means to put equal reclaim pressure on all cgroups, than yes, the patch
increases such unfairness, but such unfairness is a good thing.
Obviously it's more valuable to keep in memory actively used page than the page 
that not used.

> Generally speaking, now we have a way to measure the memory pressure
> inside a cgroup. So, in theory, it should be possible to balance
> scanning effort based on memory pressure.
> 

Simply by design, the inactive pages are the first candidates to reclaim.
Any decision that doesn't take into account inactive pages probably would be 
wrong.

E.g. cgroup A with active job loading a big and active working set which 
creates high memory pressure
and cgroup B - idle (no memory pressure) with a huge not used cache.
It's definitely preferable to reclaim from B rather than from A.


Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-02-26 Thread Andrey Ryabinin



On 2/22/19 10:15 PM, Johannes Weiner wrote:
> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>> In a presence of more than 1 memory cgroup in the system our reclaim
>> logic is just suck. When we hit memory limit (global or a limit on
>> cgroup with subgroups) we reclaim some memory from all cgroups.
>> This is sucks because, the cgroup that allocates more often always wins.
>> E.g. job that allocates a lot of clean rarely used page cache will push
>> out of memory other jobs with active relatively small all in memory
>> working set.
>>
>> To prevent such situations we have memcg controls like low/max, etc which
>> are supposed to protect jobs or limit them so they to not hurt others.
>> But memory cgroups are very hard to configure right because it requires
>> precise knowledge of the workload which may vary during the execution.
>> E.g. setting memory limit means that job won't be able to use all memory
>> in the system for page cache even if the rest the system is idle.
>> Basically our current scheme requires to configure every single cgroup
>> in the system.
>>
>> I think we can do better. The idea proposed by this patch is to reclaim
>> only inactive pages and only from cgroups that have big
>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>> only if all inactive lists are low.
> 
> Yes, you are absolutely right.
> 
> We shouldn't go after active pages as long as there are plenty of
> inactive pages around. That's the global reclaim policy, and we
> currently fail to translate that well to cgrouped systems.
> 
> Setting group protections or limits would work around this problem,
> but they're kind of a red herring. We shouldn't ever allow use-once
> streams to push out hot workingsets, that's a bug.
> 
>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, 
>> struct mem_cgroup *memcg,
>>  
>>  scan >>= sc->priority;
>>  
>> +if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
>> +file, memcg, sc, false))
>> +scan = 0;
>> +
>>  /*
>>   * If the cgroup's already been deleted, make sure to
>>   * scrape out the remaining cache.
>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>> scan_control *sc)
>>  struct reclaim_state *reclaim_state = current->reclaim_state;
>>  unsigned long nr_reclaimed, nr_scanned;
>>  bool reclaimable = false;
>> +bool retry;
>>  
>>  do {
>>  struct mem_cgroup *root = sc->target_mem_cgroup;
>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>> scan_control *sc)
>>  };
>>  struct mem_cgroup *memcg;
>>  
>> +retry = false;
>> +
>>  memset(>nr, 0, sizeof(sc->nr));
>>  
>>  nr_reclaimed = sc->nr_reclaimed;
>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>> scan_control *sc)
>>  }
>>  } while ((memcg = mem_cgroup_iter(root, memcg, )));
>>  
>> +if ((sc->nr_scanned - nr_scanned) == 0 &&
>> + !sc->may_shrink_active) {
>> +sc->may_shrink_active = 1;
>> +retry = true;
>> +continue;
>> +}
> 
> Using !scanned as the gate could be a problem. There might be a cgroup
> that has inactive pages on the local level, but when viewed from the
> system level the total inactive pages in the system might still be low
> compared to active ones. In that case we should go after active pages.
> 
> Basically, during global reclaim, the answer for whether active pages
> should be scanned or not should be the same regardless of whether the
> memory is all global or whether it's spread out between cgroups.
> 
> The reason this isn't the case is because we're checking the ratio at
> the lruvec level - which is the highest level (and identical to the
> node counters) when memory is global, but it's at the lowest level
> when memory is cgrouped.
> 
> So IMO what we should do is:
> 
> - At the beginning of global reclaim, use node_page_state() to compare
>   the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
>   can go after active pages or not. Regardless of what the ratio is in
>   individual lruvecs.
> 
> - And likewise at the beginning of cgroup limit reclaim, walk the
>   subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
>   and ACTIVE_FILE counters, and make inactive_is_low() decision on
>   those sums.
> 

Sounds reasonable.


Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-02-25 Thread Vlastimil Babka
On 2/22/19 6:58 PM, Andrey Ryabinin wrote:
> In a presence of more than 1 memory cgroup in the system our reclaim
> logic is just suck. When we hit memory limit (global or a limit on
> cgroup with subgroups) we reclaim some memory from all cgroups.
> This is sucks because, the cgroup that allocates more often always wins.
> E.g. job that allocates a lot of clean rarely used page cache will push
> out of memory other jobs with active relatively small all in memory
> working set.
> 
> To prevent such situations we have memcg controls like low/max, etc which
> are supposed to protect jobs or limit them so they to not hurt others.
> But memory cgroups are very hard to configure right because it requires
> precise knowledge of the workload which may vary during the execution.
> E.g. setting memory limit means that job won't be able to use all memory
> in the system for page cache even if the rest the system is idle.
> Basically our current scheme requires to configure every single cgroup
> in the system.
> 
> I think we can do better. The idea proposed by this patch is to reclaim
> only inactive pages and only from cgroups that have big
> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> only if all inactive lists are low.

Perhaps going this direction could also make page cache side-channel
attacks harder?
Quoting [1]:

"On Linux, we are only able
to evict pages efficiently because we can trick the page re-
placement algorithm into believing our target page would be
the best choice for eviction. The reason for this lies in the
fact that Linux uses a global page replacement algorithm,
i.e., an algorithm which does not distinguish between dif-
ferent processes. Global page replacement algorithms have
been known for decades to allow one process to perform a
denial-of-service on other processes"

[1] https://arxiv.org/abs/1901.01161



Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-02-24 Thread Roman Gushchin
On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> In a presence of more than 1 memory cgroup in the system our reclaim
> logic is just suck. When we hit memory limit (global or a limit on
> cgroup with subgroups) we reclaim some memory from all cgroups.
> This is sucks because, the cgroup that allocates more often always wins.
> E.g. job that allocates a lot of clean rarely used page cache will push
> out of memory other jobs with active relatively small all in memory
> working set.
> 
> To prevent such situations we have memcg controls like low/max, etc which
> are supposed to protect jobs or limit them so they to not hurt others.
> But memory cgroups are very hard to configure right because it requires
> precise knowledge of the workload which may vary during the execution.
> E.g. setting memory limit means that job won't be able to use all memory
> in the system for page cache even if the rest the system is idle.
> Basically our current scheme requires to configure every single cgroup
> in the system.
> 
> I think we can do better. The idea proposed by this patch is to reclaim
> only inactive pages and only from cgroups that have big
> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> only if all inactive lists are low.

Hi Andrey!

It's definitely an interesting idea! However, let me bring some concerns:
1) What's considered active and inactive depends on memory pressure inside
a cgroup. Actually active pages in one cgroup (e.g. just deleted) can be colder
than inactive pages in an other (e.g. a memory-hungry cgroup with a tight
memory.max).

Also a workload inside a cgroup can to some extend control what's going
to the active LRU. So it opens a way to get more memory unfairly by
artificially promoting more pages to the active LRU. So a cgroup
can get an unfair advantage over other cgroups.

Generally speaking, now we have a way to measure the memory pressure
inside a cgroup. So, in theory, it should be possible to balance
scanning effort based on memory pressure.

Thanks!


Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-02-22 Thread Johannes Weiner
On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
> In a presence of more than 1 memory cgroup in the system our reclaim
> logic is just suck. When we hit memory limit (global or a limit on
> cgroup with subgroups) we reclaim some memory from all cgroups.
> This is sucks because, the cgroup that allocates more often always wins.
> E.g. job that allocates a lot of clean rarely used page cache will push
> out of memory other jobs with active relatively small all in memory
> working set.
> 
> To prevent such situations we have memcg controls like low/max, etc which
> are supposed to protect jobs or limit them so they to not hurt others.
> But memory cgroups are very hard to configure right because it requires
> precise knowledge of the workload which may vary during the execution.
> E.g. setting memory limit means that job won't be able to use all memory
> in the system for page cache even if the rest the system is idle.
> Basically our current scheme requires to configure every single cgroup
> in the system.
> 
> I think we can do better. The idea proposed by this patch is to reclaim
> only inactive pages and only from cgroups that have big
> (!inactive_is_low()) inactive list. And go back to shrinking active lists
> only if all inactive lists are low.

Yes, you are absolutely right.

We shouldn't go after active pages as long as there are plenty of
inactive pages around. That's the global reclaim policy, and we
currently fail to translate that well to cgrouped systems.

Setting group protections or limits would work around this problem,
but they're kind of a red herring. We shouldn't ever allow use-once
streams to push out hot workingsets, that's a bug.

> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, 
> struct mem_cgroup *memcg,
>  
>   scan >>= sc->priority;
>  
> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
> + file, memcg, sc, false))
> + scan = 0;
> +
>   /*
>* If the cgroup's already been deleted, make sure to
>* scrape out the remaining cache.
> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   struct reclaim_state *reclaim_state = current->reclaim_state;
>   unsigned long nr_reclaimed, nr_scanned;
>   bool reclaimable = false;
> + bool retry;
>  
>   do {
>   struct mem_cgroup *root = sc->target_mem_cgroup;
> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   };
>   struct mem_cgroup *memcg;
>  
> + retry = false;
> +
>   memset(>nr, 0, sizeof(sc->nr));
>  
>   nr_reclaimed = sc->nr_reclaimed;
> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
> scan_control *sc)
>   }
>   } while ((memcg = mem_cgroup_iter(root, memcg, )));
>  
> + if ((sc->nr_scanned - nr_scanned) == 0 &&
> +  !sc->may_shrink_active) {
> + sc->may_shrink_active = 1;
> + retry = true;
> + continue;
> + }

Using !scanned as the gate could be a problem. There might be a cgroup
that has inactive pages on the local level, but when viewed from the
system level the total inactive pages in the system might still be low
compared to active ones. In that case we should go after active pages.

Basically, during global reclaim, the answer for whether active pages
should be scanned or not should be the same regardless of whether the
memory is all global or whether it's spread out between cgroups.

The reason this isn't the case is because we're checking the ratio at
the lruvec level - which is the highest level (and identical to the
node counters) when memory is global, but it's at the lowest level
when memory is cgrouped.

So IMO what we should do is:

- At the beginning of global reclaim, use node_page_state() to compare
  the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
  can go after active pages or not. Regardless of what the ratio is in
  individual lruvecs.

- And likewise at the beginning of cgroup limit reclaim, walk the
  subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
  and ACTIVE_FILE counters, and make inactive_is_low() decision on
  those sums.


Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-02-22 Thread Rik van Riel
On Fri, 2019-02-22 at 20:58 +0300, Andrey Ryabinin wrote:
> In a presence of more than 1 memory cgroup in the system our reclaim
> logic is just suck. When we hit memory limit (global or a limit on
> cgroup with subgroups) we reclaim some memory from all cgroups.
> This is sucks because, the cgroup that allocates more often always
> wins.
> E.g. job that allocates a lot of clean rarely used page cache will
> push
> out of memory other jobs with active relatively small all in memory
> working set.
> 
> To prevent such situations we have memcg controls like low/max, etc
> which
> are supposed to protect jobs or limit them so they to not hurt
> others.
> But memory cgroups are very hard to configure right because it
> requires
> precise knowledge of the workload which may vary during the
> execution.
> E.g. setting memory limit means that job won't be able to use all
> memory
> in the system for page cache even if the rest the system is idle.
> Basically our current scheme requires to configure every single
> cgroup
> in the system.
> 
> I think we can do better. The idea proposed by this patch is to
> reclaim
> only inactive pages and only from cgroups that have big
> (!inactive_is_low()) inactive list. And go back to shrinking active
> lists
> only if all inactive lists are low.

Your general idea seems like a good one, but
the logic in the code seems a little convoluted
to me.

I wonder if we can simplify things a little, by
checking (when we enter page reclaim) whether
the pgdat has enough inactive pages based on
the node_page_state statistics, and basing our
decision whether or not to scan the active lists
off that.

As it stands, your patch seems like the kind of
code that makes perfect sense today, but which
will confuse people who look at the code two
years from now.

If the code could be made a little more explicit,
great. If there are good reasons to do things in
the fallback way your current patch does it, the
code could use some good comments explaining why :)

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


[PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-02-22 Thread Andrey Ryabinin
In a presence of more than 1 memory cgroup in the system our reclaim
logic is just suck. When we hit memory limit (global or a limit on
cgroup with subgroups) we reclaim some memory from all cgroups.
This is sucks because, the cgroup that allocates more often always wins.
E.g. job that allocates a lot of clean rarely used page cache will push
out of memory other jobs with active relatively small all in memory
working set.

To prevent such situations we have memcg controls like low/max, etc which
are supposed to protect jobs or limit them so they to not hurt others.
But memory cgroups are very hard to configure right because it requires
precise knowledge of the workload which may vary during the execution.
E.g. setting memory limit means that job won't be able to use all memory
in the system for page cache even if the rest the system is idle.
Basically our current scheme requires to configure every single cgroup
in the system.

I think we can do better. The idea proposed by this patch is to reclaim
only inactive pages and only from cgroups that have big
(!inactive_is_low()) inactive list. And go back to shrinking active lists
only if all inactive lists are low.

Now, the simple test case to demonstrate the effect of the patch.
The job in one memcg repeatedly compresses one file:

 perf stat -n --repeat 20 gzip -ck sample > /dev/null

and just 'dd' running in parallel reading the disk in another cgroup.

Before:
Performance counter stats for 'gzip -ck sample' (20 runs):
  17.673572290 seconds time elapsed 
 ( +-  5.60% )
After:
Performance counter stats for 'gzip -ck sample' (20 runs):
  11.426193980 seconds time elapsed 
 ( +-  0.20% )

The more often dd cgroup allocates memory, the more gzip suffer.
With 4 parallel dd instead of one:

Before:
Performance counter stats for 'gzip -ck sample' (20 runs):
  499.976782013 seconds time elapsed
  ( +- 23.13% )
After:
Performance counter stats for 'gzip -ck sample' (20 runs):
  11.307450516 seconds time elapsed 
 ( +-  0.27% )

It would be possible to achieve the similar effect by
setting the memory.low on gzip cgroup, but the best value for memory.low
depends on the size of the 'sample' file. It also possible
to limit the 'dd' job, but just imagine something more sophisticated
than just 'dd', the job that would benefit from occupying all available
memory. The best limit for such job would be something like
'total_memory' - 'sample size' which is again unknown.

Signed-off-by: Andrey Ryabinin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
---
 mm/vmscan.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index efd10d6b9510..2f562c3358ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -104,6 +104,8 @@ struct scan_control {
/* One of the zones is ready for compaction */
unsigned int compaction_ready:1;
 
+   unsigned int may_shrink_active:1;
+
/* Allocation order */
s8 order;
 
@@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
 
scan >>= sc->priority;
 
+   if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
+   file, memcg, sc, false))
+   scan = 0;
+
/*
 * If the cgroup's already been deleted, make sure to
 * scrape out the remaining cache.
@@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_reclaimed, nr_scanned;
bool reclaimable = false;
+   bool retry;
 
do {
struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
};
struct mem_cgroup *memcg;
 
+   retry = false;
+
memset(>nr, 0, sizeof(sc->nr));
 
nr_reclaimed = sc->nr_reclaimed;
@@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
}
} while ((memcg = mem_cgroup_iter(root, memcg, )));
 
+   if ((sc->nr_scanned - nr_scanned) == 0 &&
+!sc->may_shrink_active) {
+   sc->may_shrink_active = 1;
+   retry = true;
+   continue;
+   }
+
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
@@ -2887,7 +2903,7 @@ static bool shrink_node(pg_data_t *pgdat, struct