Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-19 Thread Johannes Weiner
On Mon, Apr 19, 2021 at 01:26:29PM -0400, Waiman Long wrote:
> On 4/19/21 1:19 PM, Waiman Long wrote:
> > On 4/19/21 1:13 PM, Johannes Weiner wrote:
> > > The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
> > > configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
> > > even that doesn't make sense because while slob doesn't support slab
> > > object tracking, we can still do all the other stuff we do under
> > > KMEM. I have a patch in the works to remove the symbol and ifdefs.
> > > 
> > > With that in mind, it's better to group the functions based on what
> > > they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
> > > remove another ifdef later than it is to reorder the functions.
> > > 
> > OK, I will make the move then. Thanks for the explanation.

Thanks!

> BTW, have you ever thought of moving the cgroup-v1 specific functions out
> into a separate memcontrol-v1.c file just like kernel/cgroup/cgroup-v1.c?
> 
> I thought of that before, but memcontrol.c is a frequently changed file and
> so a bit hard to do.

I haven't looked too deeply at it so far, but I think it would make
sense to try.

There are indeed many of the entry paths from the MM code that are
shared between cgroup1 and cgroup2, with smaller branches here and
there to adjust behavior. Those would throw conflicts, but those we
should probably keep in the main memcontrol.c for readability anyway.

But there is also plenty of code that is exclusively about cgroup1,
and which actually doesn't change much in a long time. Moving that
elsewhere shouldn't create difficult conflicts - maybe a few line
offset warnings or fuzz in the diff context of unrelated changes:

- the soft limit tree and soft limit reclaim

- the threshold and oom event notification stuff

- the charge moving code

- remaining v1 interface files, as well as their helper functions

>From a quick scan, this adds up to ~2,500 lines of old code with no
actual dependencies from the common code or from v2, and which could
be moved out of the way without disrupting ongoing development much.


Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-19 Thread Johannes Weiner
On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:
> On 4/19/21 11:21 AM, Waiman Long wrote:
> > On 4/19/21 11:14 AM, Johannes Weiner wrote:
> > > On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
> > > > The mod_objcg_state() function is moved from mm/slab.h to
> > > > mm/memcontrol.c
> > > > so that further optimization can be done to it in later patches without
> > > > exposing unnecessary details to other mm components.
> > > > 
> > > > Signed-off-by: Waiman Long 
> > > > ---
> > > >   mm/memcontrol.c | 13 +
> > > >   mm/slab.h   | 16 ++--
> > > >   2 files changed, 15 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index e064ac0d850a..dc9032f28f2e 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
> > > > page *page, int order)
> > > >   css_put(>css);
> > > >   }
> > > >   +void mod_objcg_state(struct obj_cgroup *objcg, struct
> > > > pglist_data *pgdat,
> > > > + enum node_stat_item idx, int nr)
> > > > +{
> > > > +    struct mem_cgroup *memcg;
> > > > +    struct lruvec *lruvec = NULL;
> > > > +
> > > > +    rcu_read_lock();
> > > > +    memcg = obj_cgroup_memcg(objcg);
> > > > +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > > +    mod_memcg_lruvec_state(lruvec, idx, nr);
> > > > +    rcu_read_unlock();
> > > > +}
> > > It would be more naturally placed next to the others, e.g. below
> > > __mod_lruvec_kmem_state().
> > > 
> > > But no deal breaker if there isn't another revision.
> > > 
> > > Acked-by: Johannes Weiner 
> > > 
> > Yes, there will be another revision by rebasing patch series on the
> > linux-next. I will move the function then.
> 
> OK, it turns out that mod_objcg_state() is only defined if
> CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block with
> the other obj_stock functions. I think I will keep it there.

The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
even that doesn't make sense because while slob doesn't support slab
object tracking, we can still do all the other stuff we do under
KMEM. I have a patch in the works to remove the symbol and ifdefs.

With that in mind, it's better to group the functions based on what
they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
remove another ifdef later than it is to reorder the functions.


Re: [PATCH v4 4/5] mm/memcg: Save both reclaimable & unreclaimable bytes in object stock

2021-04-19 Thread Johannes Weiner
On Sun, Apr 18, 2021 at 08:00:31PM -0400, Waiman Long wrote:
> Currently, the object stock structure caches either reclaimable vmstat
> bytes or unreclaimable vmstat bytes in its object stock structure. The
> hit rate can be improved if both types of vmstat data can be cached
> especially for single-node system.
> 
> This patch supports the cacheing of both type of vmstat data, though
> at the expense of a slightly increased complexity in the caching code.
> For large object (>= PAGE_SIZE), vmstat array is done directly without
> going through the stock caching step.
> 
> On a 2-socket Cascade Lake server with instrumentation enabled, the
> miss rates are shown in the table below.
> 
>   Initial bootup:
> 
>   Kernel   __mod_objcg_statemod_objcg_state%age
>   --   ----
>   Before patch  634400  324383019.6%
>   After patch   419810  318242413.2%
> 
>   Parallel kernel build:
> 
>   Kernel   __mod_objcg_statemod_objcg_state%age
>   --   ----
>   Before patch  24329265   142512465   17.1%
>   After patch   24051721   142445825   16.9%
> 
> There was a decrease of miss rate after initial system bootup. However,
> the miss rate for parallel kernel build remained about the same probably
> because most of the touched kmemcache objects were reclaimable inodes
> and dentries.
> 
> Signed-off-by: Waiman Long 
> ---
>  mm/memcontrol.c | 79 +++--
>  1 file changed, 51 insertions(+), 28 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c13502eab282..a6dd18f6d8a8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2212,8 +2212,8 @@ struct obj_stock {
>   struct obj_cgroup *cached_objcg;
>   struct pglist_data *cached_pgdat;
>   unsigned int nr_bytes;
> - int vmstat_idx;
> - int vmstat_bytes;
> + int reclaimable_bytes;  /* NR_SLAB_RECLAIMABLE_B */
> + int unreclaimable_bytes;/* NR_SLAB_UNRECLAIMABLE_B */

How about

int nr_slab_reclaimable_b;
int nr_slab_unreclaimable_b;

so you don't need the comments?

>  #else
>   int dummy[0];
>  #endif
> @@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct 
> pglist_data *pgdat,
>enum node_stat_item idx, int nr)
>  {
>   unsigned long flags;
> - struct obj_stock *stock = get_obj_stock();
> + struct obj_stock *stock;
> + int *bytes, *alt_bytes, alt_idx;
> +
> + /*
> +  * Directly update vmstat array for big object.
> +  */
> + if (unlikely(abs(nr) >= PAGE_SIZE))
> + goto update_vmstat;

This looks like an optimization independent of the vmstat item split?

> + stock = get_obj_stock();
> + if (idx == NR_SLAB_RECLAIMABLE_B) {
> + bytes = >reclaimable_bytes;
> + alt_bytes = >unreclaimable_bytes;
> + alt_idx = NR_SLAB_UNRECLAIMABLE_B;
> + } else {
> + bytes = >unreclaimable_bytes;
> + alt_bytes = >reclaimable_bytes;
> + alt_idx = NR_SLAB_RECLAIMABLE_B;
> + }
>  
>   /*
> -  * Save vmstat data in stock and skip vmstat array update unless
> -  * accumulating over a page of vmstat data or when pgdat or idx
> +  * Try to save vmstat data in stock and skip vmstat array update
> +  * unless accumulating over a page of vmstat data or when pgdat
>* changes.
>*/
>   if (stock->cached_objcg != objcg) {
>   /* Output the current data as is */
> - } else if (!stock->vmstat_bytes) {
> - /* Save the current data */
> - stock->vmstat_bytes = nr;
> - stock->vmstat_idx = idx;
> - stock->cached_pgdat = pgdat;
> - nr = 0;
> - } else if ((stock->cached_pgdat != pgdat) ||
> -(stock->vmstat_idx != idx)) {
> - /* Output the cached data & save the current data */
> - swap(nr, stock->vmstat_bytes);
> - swap(idx, stock->vmstat_idx);
> + } else if (stock->cached_pgdat != pgdat) {
> + /* Save the current data and output cached data, if any */
> + swap(nr, *bytes);
>   swap(pgdat, stock->cached_pgdat);
> + if (*alt_bytes) {
> + __mod_objcg_state(objcg, pgdat, alt_idx,
> +   *alt_bytes);
> + *alt_bytes = 0;
> + }

As per the other email, I really don't think optimizing the pgdat
switch (in a percpu cache) is worth this level of complexity.

>   } else {
> - stock->vmstat_bytes += nr;
> - if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
> - nr = stock->vmstat_bytes;
> - stock->vmstat_bytes = 0;
> + *bytes += nr;
> + 

Re: [PATCH v4 2/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-19 Thread Johannes Weiner
On Sun, Apr 18, 2021 at 08:00:29PM -0400, Waiman Long wrote:
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
> 
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change. Caching the vmstat data in the per-cpu stock eliminates two
> writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
> specific vmstat data by a write to a hot local stock cacheline.
> 
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 20% (634400 out of 3243830)
> of the time when mod_objcg_state() is called leads to an actual call
> to __mod_objcg_state() after initial boot. When doing parallel kernel
> build, the figure was about 17% (24329265 out of 142512465). So caching
> the vmstat data reduces the number of calls to __mod_objcg_state()
> by more than 80%.
> 
> Signed-off-by: Waiman Long 
> Reviewed-by: Shakeel Butt 
> ---
>  mm/memcontrol.c | 64 ++---
>  1 file changed, 61 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index dc9032f28f2e..693453f95d99 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2213,7 +2213,10 @@ struct memcg_stock_pcp {
>  
>  #ifdef CONFIG_MEMCG_KMEM
>   struct obj_cgroup *cached_objcg;
> + struct pglist_data *cached_pgdat;
>   unsigned int nr_bytes;
> + int vmstat_idx;
> + int vmstat_bytes;
>  #endif
>  
>   struct work_struct work;
> @@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
> order)
>   css_put(>css);
>  }
>  
> -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> -  enum node_stat_item idx, int nr)
> +static inline void __mod_objcg_state(struct obj_cgroup *objcg,
> +  struct pglist_data *pgdat,
> +  enum node_stat_item idx, int nr)

This naming is dangerous, as the __mod_foo naming scheme we use
everywhere else suggests it's the same function as mod_foo() just with
preemption/irqs disabled.

> @@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct 
> pglist_data *pgdat,
>   rcu_read_lock();
>   memcg = obj_cgroup_memcg(objcg);
>   lruvec = mem_cgroup_lruvec(memcg, pgdat);
> - mod_memcg_lruvec_state(lruvec, idx, nr);
> + __mod_memcg_lruvec_state(lruvec, idx, nr);
>   rcu_read_unlock();
>  }
>  
> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> +  enum node_stat_item idx, int nr)
> +{
> + struct memcg_stock_pcp *stock;
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + stock = this_cpu_ptr(_stock);
> +
> + /*
> +  * Save vmstat data in stock and skip vmstat array update unless
> +  * accumulating over a page of vmstat data or when pgdat or idx
> +  * changes.
> +  */
> + if (stock->cached_objcg != objcg) {
> + /* Output the current data as is */

When you get here with the wrong objcg and hit the cold path, it's
usually immediately followed by an uncharge -> refill_obj_stock() that
will then flush and reset cached_objcg.

Instead of doing two cold paths, why not flush the old objcg right
away and set the new so that refill_obj_stock() can use the fast path?

> + } else if (!stock->vmstat_bytes) {
> + /* Save the current data */
> + stock->vmstat_bytes = nr;
> + stock->vmstat_idx = idx;
> + stock->cached_pgdat = pgdat;
> + nr = 0;
> + } else if ((stock->cached_pgdat != pgdat) ||
> +(stock->vmstat_idx != idx)) {
> + /* Output the cached data & save the current data */
> + swap(nr, stock->vmstat_bytes);
> + swap(idx, stock->vmstat_idx);
> + swap(pgdat, stock->cached_pgdat);

Is this optimization worth doing?

You later split vmstat_bytes and idx doesn't change anymore.

How often does the pgdat change? This is a per-cpu cache after all,
and the numa node a given cpu allocates from tends to not change that
often. Even with interleaving mode, which I think is pretty rare, the
interleaving happens at the slab/page level, not the object level, and
the cache isn't bigger than a page anyway.

> + } else {
> + stock->vmstat_bytes += nr;
> + if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
> + nr = stock->vmstat_bytes;
> + 

Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-19 Thread Johannes Weiner
On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
> so that further optimization can be done to it in later patches without
> exposing unnecessary details to other mm components.
> 
> Signed-off-by: Waiman Long 
> ---
>  mm/memcontrol.c | 13 +
>  mm/slab.h   | 16 ++--
>  2 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e064ac0d850a..dc9032f28f2e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int 
> order)
>   css_put(>css);
>  }
>  
> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> +  enum node_stat_item idx, int nr)
> +{
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec = NULL;
> +
> + rcu_read_lock();
> + memcg = obj_cgroup_memcg(objcg);
> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
> + mod_memcg_lruvec_state(lruvec, idx, nr);
> + rcu_read_unlock();
> +}

It would be more naturally placed next to the others, e.g. below
__mod_lruvec_kmem_state().

But no deal breaker if there isn't another revision.

Acked-by: Johannes Weiner 


Re: [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()

2021-04-16 Thread Johannes Weiner
On Thu, Apr 15, 2021 at 12:59:21PM -0400, Waiman Long wrote:
> On 4/15/21 12:40 PM, Johannes Weiner wrote:
> > On Tue, Apr 13, 2021 at 09:20:23PM -0400, Waiman Long wrote:
> > > The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
> > > available. So both of them are now passed to mod_memcg_lruvec_state()
> > > and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
> > > updated to allow either of the two parameters to be set to null. This
> > > makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
> > > is null.
> > > 
> > > The new __mod_memcg_lruvec_state() function will be used in the next
> > > patch as a replacement of mod_memcg_state() in mm/percpu.c for the
> > > consolidation of the memory uncharge and vmstat update functions in
> > > the kmem_cache_free() path.
> > This requires users who want both to pass a pgdat that can be derived
> > from the lruvec. This is error prone, and we just acked a patch that
> > removes this very thing from mem_cgroup_page_lruvec().
> > 
> > With the suggestion for patch 2, this shouldn't be necessary anymore,
> > though. And sort of underlines my point around that combined function
> > creating akwward code above and below it.
> > 
> The reason of passing in the pgdat is because of the caching of vmstat data.
> lruvec may be gone if the corresponding memory cgroup is removed, but pgdat
> should stay put. That is why I put pgdat in the obj_stock for caching. I
> could also put the node id instead of pgdat.

Internally storing the pgdat is fine.

I was talking about the interface requiring the user to pass redundant
information (pgdat, can be derived from lruvec) because the parameter
is overloaded to modify the function's behavior, which is unintuitive.


Re: [PATCH v2 5/8] mm: memcontrol: rename lruvec_holds_page_lru_lock to page_matches_lruvec

2021-04-16 Thread Johannes Weiner
On Fri, Apr 16, 2021 at 01:14:04PM +0800, Muchun Song wrote:
> lruvec_holds_page_lru_lock() doesn't check anything about locking and is
> used to check whether the page belongs to the lruvec. So rename it to
> page_matches_lruvec().
> 
> Signed-off-by: Muchun Song 

The rename makes sense, since the previous name was defined by a
specific use case rather than what it does. That said, it did imply a
lock context that makes the test result stable. Without that the
function could use a short comment, IMO. How about:

/* Test requires a stable page->memcg binding, see page_memcg() */

With that,
Acked-by: Johannes Weiner 


Re: [PATCH V2] sched,psi: fix the 'int' underflow for psi

2021-04-16 Thread Johannes Weiner
On Fri, Apr 16, 2021 at 08:32:16PM +0530, Charan Teja Reddy wrote:
> psi_group_cpu->tasks, represented by the unsigned int, stores the number
> of tasks that could be stalled on a psi resource(io/mem/cpu).
> Decrementing these counters at zero leads to wrapping which further
> leads to the psi_group_cpu->state_mask is being set with the respective
> pressure state. This could result into the unnecessary time sampling for
> the pressure state thus cause the spurious psi events. This can further
> lead to wrong actions being taken at the user land based on these psi
> events.
> Though psi_bug is set under these conditions but that just for debug
> purpose. Fix it by decrementing the ->tasks count only when it is
> non-zero.
> 
> Signed-off-by: Charan Teja Reddy 

The subject should be

  psi: handle potential task count underflow bugs more gracefully

since it's not fixing a known bug at this point. Otherwise,

Acked-by: Johannes Weiner 


Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-15 Thread Johannes Weiner
On Thu, Apr 15, 2021 at 03:44:56PM -0400, Waiman Long wrote:
> On 4/15/21 3:40 PM, Johannes Weiner wrote:
> > On Thu, Apr 15, 2021 at 02:47:31PM -0400, Waiman Long wrote:
> > > On 4/15/21 2:10 PM, Johannes Weiner wrote:
> > > > On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
> > > > > On 4/15/21 12:30 PM, Johannes Weiner wrote:
> > > > > > On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> > > > > > > In memcg_slab_free_hook()/pcpu_memcg_free_hook(), 
> > > > > > > obj_cgroup_uncharge()
> > > > > > > is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> > > > > > > function call goes through a separate irq_save/irq_restore cycle. 
> > > > > > > That
> > > > > > > is inefficient.  Introduce a new function 
> > > > > > > obj_cgroup_uncharge_mod_state()
> > > > > > > that combines them with a single irq_save/irq_restore cycle.
> > > > > > > 
> > > > > > > @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup 
> > > > > > > *objcg, size_t size)
> > > > > > >   refill_obj_stock(objcg, size);
> > > > > > > }
> > > > > > > +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, 
> > > > > > > size_t size,
> > > > > > > +struct pglist_data *pgdat, int idx)
> > > > > > The optimization makes sense.
> > > > > > 
> > > > > > But please don't combine independent operations like this into a
> > > > > > single function. It makes for an unclear parameter list, it's a pain
> > > > > > in the behind to change the constituent operations later on, and it
> > > > > > has a habit of attracting more random bools over time. E.g. what if
> > > > > > the caller already has irqs disabled? What if it KNOWS that irqs are
> > > > > > enabled and it could use local_irq_disable() instead of save?
> > > > > > 
> > > > > > Just provide an __obj_cgroup_uncharge() that assumes irqs are
> > > > > > disabled, combine with the existing __mod_memcg_lruvec_state(), and
> > > > > > bubble the irq handling up to those callsites which know better.
> > > > > > 
> > > > > That will also work. However, the reason I did that was because of 
> > > > > patch 5
> > > > > in the series. I could put the get_obj_stock() and put_obj_stock() 
> > > > > code in
> > > > > slab.h and allowed them to be used directly in various places, but 
> > > > > hiding in
> > > > > one function is easier.
> > > > Yeah it's more obvious after getting to patch 5.
> > > > 
> > > > But with the irq disabling gone entirely, is there still an incentive
> > > > to combine the atomic section at all? Disabling preemption is pretty
> > > > cheap, so it wouldn't matter to just do it twice.
> > > > 
> > > > I.e. couldn't the final sequence in slab code simply be
> > > > 
> > > > objcg_uncharge()
> > > > mod_objcg_state()
> > > > 
> > > > again and each function disables preemption (and in the rare case
> > > > irqs) as it sees fit?
> > > > 
> > > > You lose the irqsoff batching in the cold path, but as you say, hit
> > > > rates are pretty good, and it doesn't seem worth complicating the code
> > > > for the cold path.
> > > > 
> > > That does make sense, though a little bit of performance may be lost. I 
> > > will
> > > try that out to see how it work out performance wise.
> > Thanks.
> > 
> > Even if we still end up doing it, it's great to have that cost
> > isolated, so we know how much extra code complexity corresponds to how
> > much performance gain. It seems the task/irq split could otherwise be
> > a pretty localized change with no API implications.
> > 
> I still want to move mod_objcg_state() function to memcontrol.c though as I
> don't want to put any obj_stock stuff in mm/slab.h.

No objection from me!

That's actually a nice cleanup, IMO. Not sure why it was separated
from the rest of the objcg interface implementation to begin with.


Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-15 Thread Johannes Weiner
On Thu, Apr 15, 2021 at 02:47:31PM -0400, Waiman Long wrote:
> On 4/15/21 2:10 PM, Johannes Weiner wrote:
> > On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
> > > On 4/15/21 12:30 PM, Johannes Weiner wrote:
> > > > On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> > > > > In memcg_slab_free_hook()/pcpu_memcg_free_hook(), 
> > > > > obj_cgroup_uncharge()
> > > > > is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> > > > > function call goes through a separate irq_save/irq_restore cycle. That
> > > > > is inefficient.  Introduce a new function 
> > > > > obj_cgroup_uncharge_mod_state()
> > > > > that combines them with a single irq_save/irq_restore cycle.
> > > > > 
> > > > > @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup 
> > > > > *objcg, size_t size)
> > > > >   refill_obj_stock(objcg, size);
> > > > >}
> > > > > +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t 
> > > > > size,
> > > > > +struct pglist_data *pgdat, int idx)
> > > > The optimization makes sense.
> > > > 
> > > > But please don't combine independent operations like this into a
> > > > single function. It makes for an unclear parameter list, it's a pain
> > > > in the behind to change the constituent operations later on, and it
> > > > has a habit of attracting more random bools over time. E.g. what if
> > > > the caller already has irqs disabled? What if it KNOWS that irqs are
> > > > enabled and it could use local_irq_disable() instead of save?
> > > > 
> > > > Just provide an __obj_cgroup_uncharge() that assumes irqs are
> > > > disabled, combine with the existing __mod_memcg_lruvec_state(), and
> > > > bubble the irq handling up to those callsites which know better.
> > > > 
> > > That will also work. However, the reason I did that was because of patch 5
> > > in the series. I could put the get_obj_stock() and put_obj_stock() code in
> > > slab.h and allowed them to be used directly in various places, but hiding 
> > > in
> > > one function is easier.
> > Yeah it's more obvious after getting to patch 5.
> > 
> > But with the irq disabling gone entirely, is there still an incentive
> > to combine the atomic section at all? Disabling preemption is pretty
> > cheap, so it wouldn't matter to just do it twice.
> > 
> > I.e. couldn't the final sequence in slab code simply be
> > 
> > objcg_uncharge()
> > mod_objcg_state()
> > 
> > again and each function disables preemption (and in the rare case
> > irqs) as it sees fit?
> > 
> > You lose the irqsoff batching in the cold path, but as you say, hit
> > rates are pretty good, and it doesn't seem worth complicating the code
> > for the cold path.
> > 
> That does make sense, though a little bit of performance may be lost. I will
> try that out to see how it work out performance wise.

Thanks.

Even if we still end up doing it, it's great to have that cost
isolated, so we know how much extra code complexity corresponds to how
much performance gain. It seems the task/irq split could otherwise be
a pretty localized change with no API implications.


Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access

2021-04-15 Thread Johannes Weiner
On Thu, Apr 15, 2021 at 02:16:17PM -0400, Waiman Long wrote:
> On 4/15/21 1:53 PM, Johannes Weiner wrote:
> > On Tue, Apr 13, 2021 at 09:20:27PM -0400, Waiman Long wrote:
> > > Most kmem_cache_alloc() calls are from user context. With instrumentation
> > > enabled, the measured amount of kmem_cache_alloc() calls from non-task
> > > context was about 0.01% of the total.
> > > 
> > > The irq disable/enable sequence used in this case to access content
> > > from object stock is slow.  To optimize for user context access, there
> > > are now two object stocks for task context and interrupt context access
> > > respectively.
> > > 
> > > The task context object stock can be accessed after disabling preemption
> > > which is cheap in non-preempt kernel. The interrupt context object stock
> > > can only be accessed after disabling interrupt. User context code can
> > > access interrupt object stock, but not vice versa.
> > > 
> > > The mod_objcg_state() function is also modified to make sure that memcg
> > > and lruvec stat updates are done with interrupted disabled.
> > > 
> > > The downside of this change is that there are more data stored in local
> > > object stocks and not reflected in the charge counter and the vmstat
> > > arrays.  However, this is a small price to pay for better performance.
> > > 
> > > Signed-off-by: Waiman Long 
> > > Acked-by: Roman Gushchin 
> > > Reviewed-by: Shakeel Butt 
> > This makes sense, and also explains the previous patch a bit
> > better. But please merge those two.
> The reason I broke it into two is so that the patches are individually
> easier to review. I prefer to update the commit log of patch 4 to explain
> why the obj_stock structure is introduced instead of merging the two.

Well I did not find them easier to review separately.

> > > @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct 
> > > *dummy)
> > >   local_irq_save(flags);
> > >   stock = this_cpu_ptr(_stock);
> > > - drain_obj_stock(>obj);
> > > + drain_obj_stock(>irq_obj);
> > > + if (in_task())
> > > + drain_obj_stock(>task_obj);
> > >   drain_stock(stock);
> > >   clear_bit(FLUSHING_CACHED_CHARGE, >flags);
> > > @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct 
> > > obj_cgroup *objcg,
> > >   memcg = obj_cgroup_memcg(objcg);
> > >   if (pgdat)
> > >   lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > - __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> > > + mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> > >   rcu_read_unlock();
> > This is actually a bug introduced in the earlier patch, isn't it?
> > Calling __mod_memcg_lruvec_state() without irqs disabled...
> > 
> Not really, in patch 3, mod_objcg_state() is called only in the stock update
> context where interrupt had already been disabled. But now, that is no
> longer the case, that is why i need to update mod_objcg_state() to make sure
> irq is disabled before updating vmstat data array.

Oh, I see it now. Man, that's subtle. We've had several very hard to
track down preemption bugs in those stats, because they manifest as
counter imbalances and you have no idea if there is a leak somewhere.

The convention for these functions is that the __ prefix indicates
that preemption has been suitably disabled. Please always follow this
convention, even if the semantic change is temporary.

Btw, is there a reason why the stock caching isn't just part of
mod_objcg_state()? Why does the user need to choose if they want the
caching or not? It's not like we ask for this when charging, either.


Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-15 Thread Johannes Weiner
On Thu, Apr 15, 2021 at 01:08:29PM -0400, Waiman Long wrote:
> On 4/15/21 12:50 PM, Johannes Weiner wrote:
> > On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote:
> > > Before the new slab memory controller with per object byte charging,
> > > charging and vmstat data update happen only when new slab pages are
> > > allocated or freed. Now they are done with every kmem_cache_alloc()
> > > and kmem_cache_free(). This causes additional overhead for workloads
> > > that generate a lot of alloc and free calls.
> > > 
> > > The memcg_stock_pcp is used to cache byte charge for a specific
> > > obj_cgroup to reduce that overhead. To further reducing it, this patch
> > > makes the vmstat data cached in the memcg_stock_pcp structure as well
> > > until it accumulates a page size worth of update or when other cached
> > > data change.
> > > 
> > > On a 2-socket Cascade Lake server with instrumentation enabled and this
> > > patch applied, it was found that about 17% (946796 out of 5515184) of the
> > > time when __mod_obj_stock_state() is called leads to an actual call to
> > > mod_objcg_state() after initial boot. When doing parallel kernel build,
> > > the figure was about 16% (21894614 out of 139780628). So caching the
> > > vmstat data reduces the number of calls to mod_objcg_state() by more
> > > than 80%.
> > Right, but mod_objcg_state() is itself already percpu-cached. What's
> > the benefit of avoiding calls to it with another percpu cache?
> > 
> There are actually 2 set of vmstat data that have to be updated. One is
> associated with the memcg and other one is for each lruvec within the
> cgroup. Caching it in obj_stock, we replace 2 writes to two colder
> cachelines with one write to a hot cacheline. If you look at patch 5, I
> break obj_stock into two - one for task context and one for irq context.
> Interrupt disable is no longer needed in task context, but that is not
> possible when writing to the actual vmstat data arrays.

Ah, thanks for the explanation. Both of these points are worth
mentioning in the changelog of this patch.


Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-15 Thread Johannes Weiner
On Thu, Apr 15, 2021 at 12:35:45PM -0400, Waiman Long wrote:
> On 4/15/21 12:30 PM, Johannes Weiner wrote:
> > On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> > > In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> > > is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> > > function call goes through a separate irq_save/irq_restore cycle. That
> > > is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
> > > that combines them with a single irq_save/irq_restore cycle.
> > > 
> > > @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, 
> > > size_t size)
> > >   refill_obj_stock(objcg, size);
> > >   }
> > > +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> > > +struct pglist_data *pgdat, int idx)
> > The optimization makes sense.
> > 
> > But please don't combine independent operations like this into a
> > single function. It makes for an unclear parameter list, it's a pain
> > in the behind to change the constituent operations later on, and it
> > has a habit of attracting more random bools over time. E.g. what if
> > the caller already has irqs disabled? What if it KNOWS that irqs are
> > enabled and it could use local_irq_disable() instead of save?
> > 
> > Just provide an __obj_cgroup_uncharge() that assumes irqs are
> > disabled, combine with the existing __mod_memcg_lruvec_state(), and
> > bubble the irq handling up to those callsites which know better.
> > 
> That will also work. However, the reason I did that was because of patch 5
> in the series. I could put the get_obj_stock() and put_obj_stock() code in
> slab.h and allowed them to be used directly in various places, but hiding in
> one function is easier.

Yeah it's more obvious after getting to patch 5.

But with the irq disabling gone entirely, is there still an incentive
to combine the atomic section at all? Disabling preemption is pretty
cheap, so it wouldn't matter to just do it twice.

I.e. couldn't the final sequence in slab code simply be

objcg_uncharge()
mod_objcg_state()

again and each function disables preemption (and in the rare case
irqs) as it sees fit?

You lose the irqsoff batching in the cold path, but as you say, hit
rates are pretty good, and it doesn't seem worth complicating the code
for the cold path.


Re: [PATCH v3 5/5] mm/memcg: Optimize user context object stock access

2021-04-15 Thread Johannes Weiner
On Tue, Apr 13, 2021 at 09:20:27PM -0400, Waiman Long wrote:
> Most kmem_cache_alloc() calls are from user context. With instrumentation
> enabled, the measured amount of kmem_cache_alloc() calls from non-task
> context was about 0.01% of the total.
> 
> The irq disable/enable sequence used in this case to access content
> from object stock is slow.  To optimize for user context access, there
> are now two object stocks for task context and interrupt context access
> respectively.
> 
> The task context object stock can be accessed after disabling preemption
> which is cheap in non-preempt kernel. The interrupt context object stock
> can only be accessed after disabling interrupt. User context code can
> access interrupt object stock, but not vice versa.
> 
> The mod_objcg_state() function is also modified to make sure that memcg
> and lruvec stat updates are done with interrupted disabled.
> 
> The downside of this change is that there are more data stored in local
> object stocks and not reflected in the charge counter and the vmstat
> arrays.  However, this is a small price to pay for better performance.
> 
> Signed-off-by: Waiman Long 
> Acked-by: Roman Gushchin 
> Reviewed-by: Shakeel Butt 

This makes sense, and also explains the previous patch a bit
better. But please merge those two.

> @@ -2229,7 +2229,8 @@ struct obj_stock {
>  struct memcg_stock_pcp {
>   struct mem_cgroup *cached; /* this never be root cgroup */
>   unsigned int nr_pages;
> - struct obj_stock obj;
> + struct obj_stock task_obj;
> + struct obj_stock irq_obj;
>  
>   struct work_struct work;
>   unsigned long flags;
> @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct 
> memcg_stock_pcp *stock,
>  }
>  #endif
>  
> +/*
> + * Most kmem_cache_alloc() calls are from user context. The irq 
> disable/enable
> + * sequence used in this case to access content from object stock is slow.
> + * To optimize for user context access, there are now two object stocks for
> + * task context and interrupt context access respectively.
> + *
> + * The task context object stock can be accessed by disabling preemption only
> + * which is cheap in non-preempt kernel. The interrupt context object stock
> + * can only be accessed after disabling interrupt. User context code can
> + * access interrupt object stock, but not vice versa.
> + */
>  static inline struct obj_stock *current_obj_stock(void)
>  {
>   struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
>  
> - return >obj;
> + return in_task() ? >task_obj : >irq_obj;
> +}
> +
> +#define get_obj_stock(flags) \
> +({   \
> + struct memcg_stock_pcp *stock;  \
> + struct obj_stock *obj_stock;\
> + \
> + if (in_task()) {\
> + preempt_disable();  \
> + (flags) = -1L;  \
> + stock = this_cpu_ptr(_stock); \
> + obj_stock = >task_obj;   \
> + } else {\
> + local_irq_save(flags);  \
> + stock = this_cpu_ptr(_stock); \
> + obj_stock = >irq_obj;\
> + }   \
> + obj_stock;  \
> +})
> +
> +static inline void put_obj_stock(unsigned long flags)
> +{
> + if (flags == -1L)
> + preempt_enable();
> + else
> + local_irq_restore(flags);
>  }

Please make them both functions and use 'unsigned long *flags'.

Also I'm not sure doing in_task() twice would actually be more
expensive than the == -1 special case, and easier to understand.

> @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
>   local_irq_save(flags);
>  
>   stock = this_cpu_ptr(_stock);
> - drain_obj_stock(>obj);
> + drain_obj_stock(>irq_obj);
> + if (in_task())
> + drain_obj_stock(>task_obj);
>   drain_stock(stock);
>   clear_bit(FLUSHING_CACHED_CHARGE, >flags);
>  
> @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup 
> *objcg,
>   memcg = obj_cgroup_memcg(objcg);
>   if (pgdat)
>   lruvec = mem_cgroup_lruvec(memcg, pgdat);
> - __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> + mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
>   rcu_read_unlock();

This is actually a bug introduced in the earlier patch, isn't it?
Calling __mod_memcg_lruvec_state() without irqs disabled...


Re: [PATCH v3 4/5] mm/memcg: Separate out object stock data into its own struct

2021-04-15 Thread Johannes Weiner
On Tue, Apr 13, 2021 at 09:20:26PM -0400, Waiman Long wrote:
> The object stock data stored in struct memcg_stock_pcp are independent
> of the other page based data stored there. Separating them out into
> their own struct to highlight the independency.
> 
> Signed-off-by: Waiman Long 
> Acked-by: Roman Gushchin 
> Reviewed-by: Shakeel Butt 
> ---
>  mm/memcontrol.c | 41 ++---
>  1 file changed, 26 insertions(+), 15 deletions(-)

It's almost twice more code, and IMO not any clearer. Plus it adds
some warts: int dummy[0], redundant naming in stock->obj.cached_objcg,
this_cpu_ptr() doesn't really need a wrapper etc.


Re: [PATCH v3 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-15 Thread Johannes Weiner
On Tue, Apr 13, 2021 at 09:20:25PM -0400, Waiman Long wrote:
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
> 
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change.
> 
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 17% (946796 out of 5515184) of the
> time when __mod_obj_stock_state() is called leads to an actual call to
> mod_objcg_state() after initial boot. When doing parallel kernel build,
> the figure was about 16% (21894614 out of 139780628). So caching the
> vmstat data reduces the number of calls to mod_objcg_state() by more
> than 80%.

Right, but mod_objcg_state() is itself already percpu-cached. What's
the benefit of avoiding calls to it with another percpu cache?


Re: [PATCH v3 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()

2021-04-15 Thread Johannes Weiner
On Tue, Apr 13, 2021 at 09:20:23PM -0400, Waiman Long wrote:
> The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
> available. So both of them are now passed to mod_memcg_lruvec_state()
> and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
> updated to allow either of the two parameters to be set to null. This
> makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
> is null.
> 
> The new __mod_memcg_lruvec_state() function will be used in the next
> patch as a replacement of mod_memcg_state() in mm/percpu.c for the
> consolidation of the memory uncharge and vmstat update functions in
> the kmem_cache_free() path.

This requires users who want both to pass a pgdat that can be derived
from the lruvec. This is error prone, and we just acked a patch that
removes this very thing from mem_cgroup_page_lruvec().

With the suggestion for patch 2, this shouldn't be necessary anymore,
though. And sort of underlines my point around that combined function
creating akwward code above and below it.


Re: [PATCH v3 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-15 Thread Johannes Weiner
On Tue, Apr 13, 2021 at 09:20:24PM -0400, Waiman Long wrote:
> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> function call goes through a separate irq_save/irq_restore cycle. That
> is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
> that combines them with a single irq_save/irq_restore cycle.
>
> @@ -3292,6 +3296,25 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, 
> size_t size)
>   refill_obj_stock(objcg, size);
>  }
>  
> +void obj_cgroup_uncharge_mod_state(struct obj_cgroup *objcg, size_t size,
> +struct pglist_data *pgdat, int idx)

The optimization makes sense.

But please don't combine independent operations like this into a
single function. It makes for an unclear parameter list, it's a pain
in the behind to change the constituent operations later on, and it
has a habit of attracting more random bools over time. E.g. what if
the caller already has irqs disabled? What if it KNOWS that irqs are
enabled and it could use local_irq_disable() instead of save?

Just provide an __obj_cgroup_uncharge() that assumes irqs are
disabled, combine with the existing __mod_memcg_lruvec_state(), and
bubble the irq handling up to those callsites which know better.


Re: [PATCH] sched,psi: fix the 'int' underflow for psi

2021-04-15 Thread Johannes Weiner
On Thu, Apr 15, 2021 at 07:59:41PM +0530, Charan Teja Reddy wrote:
> psi_group_cpu->tasks, represented by the unsigned int, stores the number
> of tasks that could be stalled on a psi resource(io/mem/cpu).
> Decrementing these counters at zero leads to wrapping which further
> leads to the psi_group_cpu->state_mask is being set with the respective
> pressure state. This could result into the unnecessary time sampling for
> the pressure state thus cause the spurious psi events. This can further
> lead to wrong actions being taken at the user land based on these psi
> events.
> Though psi_bug is set under these conditions but that just for debug
> purpose. Fix it by decrementing the ->tasks count only when it is
> non-zero.

Makes sense, it's more graceful in the event of a bug.

But what motivates this change? Is it something you hit recently with
an upstream kernel and we should investigate?

> Signed-off-by: Charan Teja Reddy 
> ---
>  kernel/sched/psi.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 967732c..f925468 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -718,7 +718,8 @@ static void psi_group_change(struct psi_group *group, int 
> cpu,
>   groupc->tasks[3], clear, set);
>   psi_bug = 1;
>   }
> - groupc->tasks[t]--;
> + if (groupc->tasks[t])
> + groupc->tasks[t]--;

There is already a branch on the tasks to signal the bug. How about:

if (groupc->tasksk[t]) {
groupc->tasks[t]--;
} else if (!psi_bug) {
printk_deferred(...
psi_bug = 1;
}


Re: [External] Re: [PATCH 4/7] mm: memcontrol: simplify lruvec_holds_page_lru_lock

2021-04-14 Thread Johannes Weiner
On Wed, Apr 14, 2021 at 06:00:42PM +0800, Muchun Song wrote:
> On Wed, Apr 14, 2021 at 5:44 PM Michal Hocko  wrote:
> >
> > On Tue 13-04-21 14:51:50, Muchun Song wrote:
> > > We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> > > do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> > > lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> > > page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> > > of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> > > to simplify the code. We can have a single definition for this function
> > > that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> > > CONFIG_MEMCG.
> >
> > Neat. While you are at it wouldn't it make sesne to rename the function
> > as well. I do not want to bikeshed but this is really a misnomer. it
> > doesn't check anything about locking. page_belongs_lruvec?
> 
> Right. lruvec_holds_page_lru_lock is used to check whether
> the page belongs to the lruvec. page_belongs_lruvec
> obviously more clearer. I can rename it to
> page_belongs_lruvec the next version.

This sounds strange to me, I think 'belongs' needs a 'to' to be
correct, so page_belongs_to_lruvec(). Still kind of a mouthful.

page_matches_lruvec()?

page_from_lruvec()?


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Johannes Weiner
Hello Yu,

On Tue, Apr 13, 2021 at 12:56:17AM -0600, Yu Zhao wrote:
> What's new in v2
> 
> Special thanks to Jens Axboe for reporting a regression in buffered
> I/O and helping test the fix.
> 
> This version includes the support of tiers, which represent levels of
> usage from file descriptors only. Pages accessed N times via file
> descriptors belong to tier order_base_2(N). Each generation contains
> at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2
> bits in page->flags. In contrast to moving across generations which
> requires the lru lock, moving across tiers only involves an atomic
> operation on page->flags and therefore has a negligible cost. A
> feedback loop modeled after the well-known PID controller monitors the
> refault rates across all tiers and decides when to activate pages from
> which tiers, on the reclaim path.

Could you elaborate a bit more on the difference between generations
and tiers?

A refault, a page table reference, or a buffered read through a file
descriptor ultimately all boil down to a memory access. The value of
having that memory resident and the cost of bringing it in from
backing storage should be the same regardless of how it's accessed by
userspace; and whether it's an in-memory reference or a non-resident
reference should have the same relative impact on the page's age.

With that context, I don't understand why file descriptor refs and
refaults get such special treatment. Could you shed some light here?

> This feedback model has a few advantages over the current feedforward
> model:
> 1) It has a negligible overhead in the buffered I/O access path
>because activations are done in the reclaim path.

This is useful if the workload isn't reclaim bound, but it can be
hazardous to defer work to reclaim, too.

If you go through the git history, there have been several patches to
soften access recognition inside reclaim because it can come with
large latencies when page reclaim kicks in after a longer period with
no memory pressure and doesn't have uptodate reference information -
to the point where eating a few extra IOs tend to add less latency to
the workload than waiting for reclaim to refresh its aging data.

Could you elaborate a bit more on the tradeoff here?

> Highlights from the discussions on v1
> =
> Thanks to Ying Huang and Dave Hansen for the comments and suggestions
> on page table scanning.
> 
> A simple worst-case scenario test did not find page table scanning
> underperforms the rmap because of the following optimizations:
> 1) It will not scan page tables from processes that have been sleeping
>since the last scan.
> 2) It will not scan PTE tables under non-leaf PMD entries that do not
>have the accessed bit set, when
>CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> 3) It will not zigzag between the PGD table and the same PMD or PTE
>table spanning multiple VMAs. In other words, it finishes all the
>VMAs with the range of the same PMD or PTE table before it returns
>to the PGD table. This optimizes workloads that have large numbers
>of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
> 
> TLDR
> 
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> an alternative framework that is performant, versatile and
> straightforward.
> 
> Repo
> 
> git fetch https://linux-mm.googlesource.com/page-reclaim 
> refs/changes/73/1173/1
> 
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1173
> 
> Background
> ==
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment.

RAM cost on one hand.

On the other, paging backends have seen a revolutionary explosion in
iop/s capacity from solid state devices and CPUs that allow in-memory
compression at scale, so a higher rate of paging (semi-random IO) and
thus larger levels of overcommit are possible than ever before.

There is a lot of new opportunity here.

> Over the past decade of research and experimentation in memory
> overcommit, we observed a distinct trend across millions of servers
> and clients: the size of page cache has been decreasing because of
> the growing popularity of cloud storage. Nowadays anon pages account
> for more than 90% of our memory consumption and page cache contains
> mostly executable pages.

This gives the impression that because the number of setups heavily
using the page cache has reduced somewhat, its significance is waning
as well. I don't think that's true. I think we'll continue to have
mainstream workloads for which the page cache is significant.

Yes, the importance of paging anon memory more efficiently (or paging
it at all again, for that matter), has increased dramatically. But IMO
not because it's more prevalent, but rather because of the increase in
paging capacity from the hardware side. 

Re: [PATCH 6/7] mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock

2021-04-13 Thread Johannes Weiner
On Tue, Apr 13, 2021 at 02:51:52PM +0800, Muchun Song wrote:
> The css_set_lock is used to guard the list of inherited objcgs. So there
> is no need to uncharge kernel memory under css_set_lock. Just move it
> out of the lock.
> 
> Signed-off-by: Muchun Song 

Acked-by: Johannes Weiner 


Re: [RFC PATCH v2 00/18] Use obj_cgroup APIs to charge the LRU pages

2021-04-12 Thread Johannes Weiner
On Fri, Apr 09, 2021 at 06:29:46PM -0700, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> > Since the following patchsets applied. All the kernel memory are charged
> > with the new APIs of obj_cgroup.
> > 
> > [v17,00/19] The new cgroup slab memory controller
> > [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > 
> > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > it exists at a larger scale and is causing recurring problems in the real
> > world: page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted into
> > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > and make page reclaim very inefficient.
> > 
> > We can convert LRU pages and most other raw memcg pins to the objcg 
> > direction
> > to fix this problem, and then the LRU pages will not pin the memcgs.
> > 
> > This patchset aims to make the LRU pages to drop the reference to memory
> > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > of the dying cgroups will not increase if we run the following test script.
> > 
> > ```bash
> > #!/bin/bash
> > 
> > cat /proc/cgroups | grep memory
> > 
> > cd /sys/fs/cgroup/memory
> > 
> > for i in range{1..500}
> > do
> > mkdir test
> > echo $$ > test/cgroup.procs
> > sleep 60 &
> > echo $$ > cgroup.procs
> > echo `cat test/cgroup.procs` > cgroup.procs
> > rmdir test
> > done
> > 
> > cat /proc/cgroups | grep memory
> > ```
> > 
> > Patch 1 aims to fix page charging in page replacement.
> > Patch 2-5 are code cleanup and simplification.
> > Patch 6-18 convert LRU pages pin to the objcg direction.
> > 
> > Any comments are welcome. Thanks.
> 
> Indeed the problem exists for a long time and it would be nice to fix it.
> However I'm against merging the patchset in the current form (there are some
> nice fixes/clean-ups, which can/must be applied independently). Let me explain
> my concerns:
> 
> Back to the new slab controller discussion obj_cgroup was suggested by 
> Johannes
> as a union of two concepts:
> 1) reparenting (basically an auto-pointer to a memcg in c++ terms)
> 2) byte-sized accounting
> 
> I was initially against this union because I anticipated that the reparenting
> part will be useful separately. And the time told it was true.

"The idea of moving stocks and leftovers to the memcg_ptr/obj_cgroup
level is really good."

https://lore.kernel.org/lkml/20191025200020.ga8...@castle.dhcp.thefacebook.com/

If you recall, the main concern was how the byte charging interface
was added to the existing page charging interface, instead of being
layered on top of it. I suggested to do that and, since there was no
other user for the indirection pointer, just include it in the API.

It made sense at the time, and you seemed to agree. But I also agree
it makes sense to factor it out now that more users are materializing.

> I still think obj_cgroup API must be significantly reworked before being
> applied outside of the kmem area: reparenting part must be separated
> and moved to the cgroup core level to be used not only in the memcg
> context but also for other controllers, which are facing similar problems.
> Spilling obj_cgroup API in the current form over all memcg code will
> make it more complicated and will delay it, given the amount of changes
> and the number of potential code conflicts.
> 
> I'm working on the generalization of obj_cgroup API (as described above)
> and expect to have some patches next week.

Yeah, splitting the byte charging API from the reference API and
making the latter cgroup-generic makes sense. I'm looking forward to
your patches.

And yes, the conflicts between that work and Muchun's patches would be
quite large. However, most of them would come down to renames, since
the access rules and refcounting sites will remain the same, so it
shouldn't be too bad to rebase Muchun's patches on yours. And we can
continue reviewing his patches for correctness for now.

Thanks


Re: [PATCH V12 0/3] Charge loop device i/o to issuing cgroup

2021-04-12 Thread Johannes Weiner
It looks like all feedback has been addressed and there hasn't been
any new activity on it in a while.

As per the suggestion last time [1], Andrew, Jens, could this go
through the -mm tree to deal with the memcg conflicts?

[1] 
https://lore.kernel.org/lkml/CALvZod6FMQQC17Zsu9xoKs=dfwajdmc2qk3yidpuuqhx8te...@mail.gmail.com/

On Fri, Apr 02, 2021 at 12:16:31PM -0700, Dan Schatzberg wrote:
> No major changes, rebased on top of latest mm tree
> 
> Changes since V12:
> 
> * Small change to get_mem_cgroup_from_mm to avoid needing
>   get_active_memcg
> 
> Changes since V11:
> 
> * Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
>   can be driven by writeback, but this was causing a warning in xfs
>   and likely other filesystems aren't equipped to be driven by reclaim
>   at the VFS layer.
> * Included a small fix from Colin Ian King.
> * reworked get_mem_cgroup_from_mm to institute the necessary charge
>   priority.
> 
> Changes since V10:
> 
> * Added page-cache charging to mm: Charge active memcg when no mm is set
> 
> Changes since V9:
> 
> * Rebased against linus's branch which now includes Roman Gushchin's
>   patch this series is based off of
> 
> Changes since V8:
> 
> * Rebased on top of Roman Gushchin's patch
>   (https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
>   support for setting active memcg. Dropped the patch from this series
>   that did the same thing.
> 
> Changes since V7:
> 
> * Rebased against linus's branch
> 
> Changes since V6:
> 
> * Added separate spinlock for worker synchronization
> * Minor style changes
> 
> Changes since V5:
> 
> * Fixed a missing css_put when failing to allocate a worker
> * Minor style changes
> 
> Changes since V4:
> 
> Only patches 1 and 2 have changed.
> 
> * Fixed irq lock ordering bug
> * Simplified loop detach
> * Added support for nesting memalloc_use_memcg
> 
> Changes since V3:
> 
> * Fix race on loop device destruction and deferred worker cleanup
> * Ensure charge on shmem_swapin_page works just like getpage
> * Minor style changes
> 
> Changes since V2:
> 
> * Deferred destruction of workqueue items so in the common case there
>   is no allocation needed
> 
> Changes since V1:
> 
> * Split out and reordered patches so cgroup charging changes are
>   separate from kworker -> workqueue change
> 
> * Add mem_css to struct loop_cmd to simplify logic
> 
> The loop device runs all i/o to the backing file on a separate kworker
> thread which results in all i/o being charged to the root cgroup. This
> allows a loop device to be used to trivially bypass resource limits
> and other policy. This patch series fixes this gap in accounting.
> 
> A simple script to demonstrate this behavior on cgroupv2 machine:
> 
> '''
> #!/bin/bash
> set -e
> 
> CGROUP=/sys/fs/cgroup/test.slice
> LOOP_DEV=/dev/loop0
> 
> if [[ ! -d $CGROUP ]]
> then
> sudo mkdir $CGROUP
> fi
> 
> grep oom_kill $CGROUP/memory.events
> 
> # Set a memory limit, write more than that limit to tmpfs -> OOM kill
> sudo unshare -m bash -c "
> echo \$\$ > $CGROUP/cgroup.procs;
> echo 0 > $CGROUP/memory.swap.max;
> echo 64M > $CGROUP/memory.max;
> mount -t tmpfs -o size=512m tmpfs /tmp;
> dd if=/dev/zero of=/tmp/file bs=1M count=256" || true
> 
> grep oom_kill $CGROUP/memory.events
> 
> # Set a memory limit, write more than that limit through loopback
> # device -> no OOM kill
> sudo unshare -m bash -c "
> echo \$\$ > $CGROUP/cgroup.procs;
> echo 0 > $CGROUP/memory.swap.max;
> echo 64M > $CGROUP/memory.max;
> mount -t tmpfs -o size=512m tmpfs /tmp;
> truncate -s 512m /tmp/backing_file
> losetup $LOOP_DEV /tmp/backing_file
> dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
> losetup -D $LOOP_DEV" || true
> 
> grep oom_kill $CGROUP/memory.events
> '''
> 
> Naively charging cgroups could result in priority inversions through
> the single kworker thread in the case where multiple cgroups are
> reading/writing to the same loop device. This patch series does some
> minor modification to the loop driver so that each cgroup can make
> forward progress independently to avoid this inversion.
> 
> With this patch series applied, the above script triggers OOM kills
> when writing through the loop device as expected.
> 
> Dan Schatzberg (3):
>   loop: Use worker per cgroup instead of kworker
>   mm: Charge active memcg when no mm is set
>   loop: Charge i/o to mem and blk cg
> 
>  drivers/block/loop.c   | 244 ++---
>  drivers/block/loop.h   |  15 ++-
>  include/linux/memcontrol.h |   6 +
>  kernel/cgroup/cgroup.c |   1 +
>  mm/filemap.c   |   2 +-
>  mm/memcontrol.c|  49 +---
>  mm/shmem.c |   4 +-
>  7 files changed, 253 insertions(+), 68 deletions(-)
> 
> -- 
> 2.30.2
> 
> 


Re: [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack

2021-04-09 Thread Johannes Weiner
On Fri, Apr 09, 2021 at 08:29:50PM +0800, Muchun Song wrote:
> The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> set up pagevec as late as possible in shrink_inactive_list()"), its
> purpose is to delay the allocation of pagevec as late as possible to
> save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> not need noinline_for_stack, just remove it (let the compiler decide
> whether to inline).
> 
> Signed-off-by: Muchun Song 

Good catch.

Acked-by: Johannes Weiner 

Since this patch is somewhat independent of the rest of the series,
you may want to put it in the very beginning, or even submit it
separately, to keep the main series as compact as possible. Reviewers
can be more hesitant to get involved with larger series ;)


Re: [RFC PATCH v2 06/18] mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM

2021-04-09 Thread Johannes Weiner
On Fri, Apr 09, 2021 at 08:29:47PM +0800, Muchun Song wrote:
> Because memory allocations pinning memcgs for a long time - it exists
> at a larger scale and is causing recurring problems in the real world:
> page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted
> into a new cgroup every time. Unreclaimable dying cgroups pile up,
> waste memory, and make page reclaim very inefficient.
> 
> We can convert LRU pages and most other raw memcg pins to the objcg
> direction to fix this problem, and then the page->memcg will always
> point to an object cgroup pointer.
> 
> Therefore, the infrastructure of objcg no longer only serves
> CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
> objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
> can reuse it to charge pages.

Just an observation on this:

We actually may want to remove CONFIG_MEMCG_KMEM altogether at this
point. It used to be an optional feature, but nowadays it's not
configurable anymore, and always on unless slob is configured.

We've also added more than just slab accounting to it, like kernel
stack pages, and it all gets disabled on slob configs just because it
doesn't support slab object tracking.

We could probably replace CONFIG_MEMCG_KMEM with CONFIG_MEMCG in most
places, and add a couple of !CONFIG_SLOB checks in the slab callbacks.

But that's beyond the scope of your patch series, so I'm also okay
with this patch here.

> We know that the LRU pages are not accounted at the root level. But the
> page->memcg_data points to the root_mem_cgroup. So the page->memcg_data
> of the LRU pages always points to a valid pointer. But the root_mem_cgroup
> dose not have an object cgroup. If we use obj_cgroup APIs to charge the
> LRU pages, we should set the page->memcg_data to a root object cgroup. So
> we also allocate an object cgroup for the root_mem_cgroup and introduce
> root_obj_cgroup to cache its value just like root_mem_cgroup.
> 
> Signed-off-by: Muchun Song 

Overall, the patch makes sense to me. A few comments below:

> @@ -252,9 +253,14 @@ struct cgroup_subsys_state *vmpressure_to_css(struct 
> vmpressure *vmpr)
>   return _of(vmpr, struct mem_cgroup, vmpressure)->css;
>  }
>  
> -#ifdef CONFIG_MEMCG_KMEM
>  extern spinlock_t css_set_lock;
>  
> +static inline bool obj_cgroup_is_root(struct obj_cgroup *objcg)
> +{
  > +   return objcg == root_obj_cgroup;
> +}

This function, and by extension root_obj_cgroup, aren't used by this
patch. Please move them to the patch that adds users for them.

> @@ -298,6 +304,20 @@ static void obj_cgroup_release(struct percpu_ref *ref)
>   percpu_ref_exit(ref);
>   kfree_rcu(objcg, rcu);
>  }
> +#else
> +static void obj_cgroup_release(struct percpu_ref *ref)
> +{
> + struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> + unsigned long flags;
> +
> + spin_lock_irqsave(_set_lock, flags);
> + list_del(>list);
> + spin_unlock_irqrestore(_set_lock, flags);
> +
> + percpu_ref_exit(ref);
> + kfree_rcu(objcg, rcu);
> +}
> +#endif

Having two separate functions for if and else is good when the else
branch is a completely empty dummy function. In this case you end up
duplicating code, so it's better to have just one function and put the
ifdef around the nr_charged_bytes handling in it.

> @@ -318,10 +338,14 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
>   return objcg;
>  }
>  
> -static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> -   struct mem_cgroup *parent)
> +static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
>  {
>   struct obj_cgroup *objcg, *iter;
> + struct mem_cgroup *parent;
> +
> + parent = parent_mem_cgroup(memcg);
> + if (!parent)
> + parent = root_mem_cgroup;
>  
>   objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
>  
> @@ -342,6 +366,27 @@ static void memcg_reparent_objcgs(struct mem_cgroup 
> *memcg,
>   percpu_ref_kill(>refcnt);
>  }
>  
> +static int memcg_obj_cgroup_alloc(struct mem_cgroup *memcg)
> +{
> + struct obj_cgroup *objcg;
> +
> + objcg = obj_cgroup_alloc();
> + if (!objcg)
> + return -ENOMEM;
> +
> + objcg->memcg = memcg;
> + rcu_assign_pointer(memcg->objcg, objcg);
> +
> + return 0;
> +}
> +
> +static void memcg_obj_cgroup_free(struct mem_cgroup *memcg)
> +{
> + if (unlikely(memcg->objcg))
> + memcg_reparent_objcgs(memcg);
> +}

It's confusing to have a 'free' function not free the object it's
called on.

But rather than search for a fitting name, I think it might be better
to just fold both of these short functions into their only callsites.

Also, since memcg->objcg is reparented, and the pointer cleared, on
offlining, when could this ever be non-NULL? This deserves a comment.

> @@ -3444,7 +3489,6 @@ static u64 mem_cgroup_read_u64(struct 
> 

Re: [RFC PATCH v2 05/18] mm: memcontrol: simplify the logic of objcg pinning memcg

2021-04-09 Thread Johannes Weiner
On Fri, Apr 09, 2021 at 08:29:46PM +0800, Muchun Song wrote:
> The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by
> the css_set_lock. We do not need to care about objcg->memcg being
> released in the process of obj_cgroup_release(). So there is no need
> to pin memcg before releasing objcg. Remove those pinning logic to
> simplfy the code.

Hm yeah, it's not clear to me why inherited objcgs pinned the memcg in
the first place, since they are reparented during memcg deletion and
therefor have no actual impact on the memcg's lifetime.

> There are only two places that modifies the objcg->memcg. One is the
> initialization to objcg->memcg in the memcg_online_kmem(), another
> is objcgs reparenting in the memcg_reparent_objcgs(). It is also
> impossible for the two to run in parallel. So xchg() is unnecessary
> and it is enough to use WRITE_ONCE().

Good catch.

> Signed-off-by: Muchun Song 

Looks like a nice cleanup / simplification:

Acked-by: Johannes Weiner 


Re: [RFC PATCH v2 04/18] mm: memcontrol: simplify lruvec_holds_page_lru_lock

2021-04-09 Thread Johannes Weiner
On Fri, Apr 09, 2021 at 08:29:45PM +0800, Muchun Song wrote:
> We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> to simplify the code. We can have a single definition for this function
> that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> CONFIG_MEMCG.
> 
> Signed-off-by: Muchun Song 

Looks good to me.

Acked-by: Johannes Weiner 

If you haven't done so yet, please make sure to explicitly test with
all three config combinations, just because the dummy abstractions for
memcg disabled or compiled out tend to be paper thin and don't always
behave the way you might expect when you do more complicated things.

Something like

boot
echo sparsefile >/dev/null (> ram size to fill memory and reclaim)
echo 1 >/proc/sys/vm/compact_memory

should exercise this new function in a couple of important scenarios.


[PATCH] mm: page_counter: mitigate consequences of a page_counter underflow

2021-04-08 Thread Johannes Weiner
When the unsigned page_counter underflows, even just by a few pages, a
cgroup will not be able to run anything afterwards and trigger the OOM
killer in a loop.

Underflows shouldn't happen, but when they do in practice, we may just
be off by a small amount that doesn't interfere with the normal
operation - consequences don't need to be that dire.

Reset the page_counter to 0 upon underflow. We'll issue a warning that
the accounting will be off and then try to keep limping along.

[ We used to do this with the original res_counter, where it was a
  more straight-forward correction inside the spinlock section. I
  didn't carry it forward into the lockless page counters for
  simplicity, but it turns out this is quite useful in practice. ]

Signed-off-by: Johannes Weiner 
---
 mm/page_counter.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/page_counter.c b/mm/page_counter.c
index c6860f51b6c6..7d83641eb86b 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -52,9 +52,13 @@ void page_counter_cancel(struct page_counter *counter, 
unsigned long nr_pages)
long new;
 
new = atomic_long_sub_return(nr_pages, >usage);
-   propagate_protected_usage(counter, new);
/* More uncharges than charges? */
-   WARN_ON_ONCE(new < 0);
+   if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n",
+ new, nr_pages)) {
+   new = 0;
+   atomic_long_set(>usage, new);
+   }
+   propagate_protected_usage(counter, new);
 }
 
 /**
-- 
2.31.1



Re: [PATCH v2] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-04-02 Thread Johannes Weiner
On Thu, Apr 01, 2021 at 10:58:33PM -0400, Josh Hunt wrote:
> Currently only root can write files under /proc/pressure. Relax this to
> allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
> able to write to these files.
> 
> Signed-off-by: Josh Hunt 
> Acked-by: Johannes Weiner 

v2 looks good to me. Thanks


Re: [External] Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-04-02 Thread Johannes Weiner
On Thu, Apr 01, 2021 at 10:15:45AM -0700, Shakeel Butt wrote:
> On Thu, Apr 1, 2021 at 9:08 AM Muchun Song  wrote:
> >
> [...]
> > > The zombie issue is a pretty urgent concern that has caused several
> > > production emergencies now. It needs a fix sooner rather than later.
> >
> > Thank you very much for clarifying the problem for me. I do agree
> > with you. This issue should be fixed ASAP. Using objcg is a good
> > choice. Dying objcg should not be a problem. Because the size of
> > objcg is so small compared to memcg.
> >
> 
> Just wanted to say out loud that yes this patchset will reduce the
> memcg zombie issue but this is not the final destination. We should
> continue the discussions on sharing/reusing scenarios.

Absolutely. I think it's an important discussion to have.

My only concern is that Muchun's patches fix a regression, which
admittedly has built over a few years, but is a regression nonetheless
that can leave machines in undesirable states and may require reboots.

The sharing and reuse semantics on the other hand have been the same
since the beginning of cgroups. Users have had some time to structure
their requirements around these semantics :-)

If there were a concrete proposal or an RFC on the table for how
sharing and reusing could be implemented, and this proposal would be
in direct conflict with the reparenting patches, I would say let's try
to figure out a way whether the bug could be fixed in a way that is
compatible with such another imminent change.

But we shouldn't hold up a bug fix to start planning a new feature.


Re: [RFC PATCH 04/15] mm: memcontrol: use lruvec_memcg in lruvec_holds_page_lru_lock

2021-04-02 Thread Johannes Weiner
On Tue, Mar 30, 2021 at 06:15:20PM +0800, Muchun Song wrote:
> We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> lruvec_memcg() instead.
> 
> Signed-off-by: Muchun Song 
> ---
>  include/linux/memcontrol.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a35a22994cf7..6e3283828391 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -744,20 +744,20 @@ static inline struct lruvec 
> *mem_cgroup_page_lruvec(struct page *page)
>   return mem_cgroup_lruvec(memcg, pgdat);
>  }
>  
> +static inline struct mem_cgroup *lruvec_memcg(struct lruvec *lruvec);

Please reorder the functions instead to avoid forward decls.

>  static inline bool lruvec_holds_page_lru_lock(struct page *page,
> struct lruvec *lruvec)
>  {
>   pg_data_t *pgdat = page_pgdat(page);
>   const struct mem_cgroup *memcg;
> - struct mem_cgroup_per_node *mz;
>  
>   if (mem_cgroup_disabled())
>   return lruvec == >__lruvec;
>  
> - mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
>   memcg = page_memcg(page) ? : root_mem_cgroup;
>  
> - return lruvec->pgdat == pgdat && mz->memcg == memcg;
> + return lruvec->pgdat == pgdat && lruvec_memcg(lruvec) == memcg;

Looks reasonable to me, but I wonder if there is more we can do.

lruvec_memcg() already handles CONFIG_MEMCG and mem_cgroup_disabled()
combinations, and there is also a lruvec_pgdat() which does.

One thing that is odd is page_memcg(page) ? : root_mem_cgroup. How can
lruvec pages not have a page_memcg()? mem_cgroup_page_lruvec() has:

memcg = page_memcg(page);
VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page);

Unless I'm missing something, we should be able to have a single
definition for this function that works for !CONFIG_MEMCG,
CONFIG_MEMCG + mem_cgroup_disabled() and CONFIG_MEMCG:

lruvec_holds_page_lru_lock()
{
return lruvec_pgdat(lruvec) == page_pgdat(page) &&
lruvec_memcg(lruvec) == page_memcg(page);
}


Re: [RFC PATCH 03/15] mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec

2021-04-02 Thread Johannes Weiner
On Tue, Mar 30, 2021 at 06:15:19PM +0800, Muchun Song wrote:
> All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page)
> as the 2nd parameter to it (except isolate_migratepages_block()). But
> for isolate_migratepages_block(), the page_pgdat(page) is also equal
> to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not
> need the pgdat parameter. Just remove it to simplify the code.
> 
> Signed-off-by: Muchun Song 

There is a minor benefit in not doing the page_pgdat() lookup twice.
But I think it only really matters on 32-bit NUMA, where we have more
than one node but not enough space in page->flags to store the nid,
and so page_to_nid() is out-of-line and needs to go through section.

Those setups aren't really around anymore, and certainly don't justify
the hassle (and potential bugs) of the awkward interface.

Acked-by: Johannes Weiner 


Re: [RFC PATCH 02/15] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm

2021-04-02 Thread Johannes Weiner
On Tue, Mar 30, 2021 at 06:15:18PM +0800, Muchun Song wrote:
> When mm is NULL, we do not need to hold rcu lock and call css_tryget for
> the root memcg. And we also do not need to check !mm in every loop of
> while. So bail out early when !mm.
> 
> Signed-off-by: Muchun Song 

Good observation.

Acked-by: Johannes Weiner 


Re: [RFC PATCH 01/15] mm: memcontrol: fix page charging in page replacement

2021-04-02 Thread Johannes Weiner
On Tue, Mar 30, 2021 at 06:15:17PM +0800, Muchun Song wrote:
> The pages aren't accounted at the root level, so do not charge the page
> to the root memcg in page replacement. Although we do not display the
> value (mem_cgroup_usage) so there shouldn't be any actual problem, but
> there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it
> will trigger? So it is better to fix it.
> 
> Signed-off-by: Muchun Song 

Acked-by: Johannes Weiner 


Re: [PATCH] psi: allow unprivileged users with CAP_SYS_RESOURCE to write psi files

2021-04-01 Thread Johannes Weiner
On Thu, Apr 01, 2021 at 08:47:33AM +0200, Peter Zijlstra wrote:
> On Wed, Mar 31, 2021 at 11:31:56PM -0400, Josh Hunt wrote:
> > Currently only root can write files under /proc/pressure. Relax this to
> > allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
> > able to write to these files.
> > 
> > Signed-off-by: Josh Hunt 
> 
> I suppose that's ok, but lets also ask Johannes.

The write creates a kthread that runs as SCHED_FIFO. For userspace
threads this is reserved for CAP_SYS_NICE tasks, but it's a kernel
thread and not arbitrary code, so I suppose CAP_SYS_RESOURCE is fine.

Acked-by: Johannes Weiner 


Re: [PATCH v5 00/27] Memory Folios

2021-04-01 Thread Johannes Weiner
On Thu, Apr 01, 2021 at 05:05:37AM +, Al Viro wrote:
> On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote:
> 
> > That's a very Intel-centric way of looking at it.  Other architectures
> > support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
> > every power of four up to 4GB) to more reasonable options like (4k, 32k,
> > 256k, 2M, 16M, 128M).  But we (in software) shouldn't constrain ourselves
> > to thinking in terms of what the hardware currently supports.  Google
> > have data showing that for their workloads, 32kB is the goldilocks size.
> > I'm sure for some workloads, it's much higher and for others it's lower.
> > But for almost no workload is 4kB the right choice any more, and probably
> > hasn't been since the late 90s.
> 
> Out of curiosity I looked at the distribution of file sizes in the
> kernel tree:
> 71455 files total
> 0--4Kb36702
> 4--8Kb11820
> 8--16Kb   10066
> 16--32Kb  6984
> 32--64Kb  3804
> 64--128Kb 1498
> 128--256Kb393
> 256--512Kb108
> 512Kb--1Mb35
> 1--2Mb25
> 2--4Mb5
> 4--6Mb7
> 6--8Mb4
> 12Mb  2 
> 14Mb  1
> 16Mb  1
> 
> ... incidentally, everything bigger than 1.2Mb lives^Wshambles under
> drivers/gpu/drm/amd/include/asic_reg/
> 
> Page size Footprint
> 4Kb   1128Mb
> 8Kb   1324Mb
> 16Kb  1764Mb
> 32Kb  2739Mb
> 64Kb  4832Mb
> 128Kb 9191Mb
> 256Kb 18062Mb
> 512Kb 35883Mb
> 1Mb   71570Mb
> 2Mb   142958Mb
> 
> So for kernel builds (as well as grep over the tree, etc.) uniform 2Mb pages
> would be... interesting.

Right, I don't see us getting rid of 4k cache entries anytime
soon. Even 32k pages would double the footprint here.

The issue is just that at the other end of the spectrum we have IO
devices that do 10GB/s, which corresponds to 2.6 million pages per
second. At such data rates we are currently CPU-limited because of the
pure transaction overhead in page reclaim. Workloads like this tend to
use much larger files, and would benefit from a larger paging unit.

Likewise, most production workloads in cloud servers have enormous
anonymous regions and large executables that greatly benefit from
fewer page table levels and bigger TLB entries.

Today, fragmentation prevents the page allocator from producing 2MB
blocks at a satisfactory rate and allocation latency. It's not
feasible to allocate 2M inside page faults for example; getting huge
page coverage for the page cache will be even more difficult.

I'm not saying we should get rid of 4k cache entries. Rather, I'm
wondering out loud whether longer-term we'd want to change the default
page size to 2M, and implement the 4k cache entries, which we clearly
continue to need, with a slab style allocator on top. The idea being
that it'll do a better job at grouping cache entries with other cache
entries of a similar lifetime than the untyped page allocator does
naturally, and so make fragmentation a whole lot more manageable.

(I'm using x86 page sizes as examples because they matter to me. But
there is an architecture independent discrepancy between the smallest
cache entries we must continue to support, and larger blocks / huge
pages that we increasingly rely on as first class pages.)


Re: [PATCH v5 00/27] Memory Folios

2021-03-31 Thread Johannes Weiner
On Tue, Mar 30, 2021 at 10:09:29PM +0100, Matthew Wilcox wrote:
> On Tue, Mar 30, 2021 at 03:30:54PM -0400, Johannes Weiner wrote:
> > Hi Willy,
> > 
> > On Mon, Mar 29, 2021 at 05:58:32PM +0100, Matthew Wilcox wrote:
> > > I'm going to respond to some points in detail below, but there are a
> > > couple of overarching themes that I want to bring out up here.
> > > 
> > > Grand Vision
> > > 
> > > 
> > > I haven't outlined my long-term plan.  Partly because it is a _very_
> > > long way off, and partly because I think what I'm doing stands on its
> > > own.  But some of the points below bear on this, so I'll do it now.
> > > 
> > > Eventually, I want to make struct page optional for allocations.  It's too
> > > small for some things (allocating page tables, for example), and overly
> > > large for others (allocating a 2MB page, networking page_pool).  I don't
> > > want to change its size in the meantime; having a struct page refer to
> > > PAGE_SIZE bytes is something that's quite deeply baked in.
> > 
> > Right, I think it's overloaded and it needs to go away from many
> > contexts it's used in today.
> > 
> > I think it describes a real physical thing, though, and won't go away
> > as a concept. More on that below.
> 
> I'm at least 90% with you on this, and we're just quibbling over details
> at this point, I think.
> 
> > > In broad strokes, I think that having a Power Of Two Allocator
> > > with Descriptor (POTAD) is a useful foundational allocator to have.
> > > The specific allocator that we call the buddy allocator is very clever for
> > > the 1990s, but touches too many cachelines to be good with today's CPUs.
> > > The generalisation of the buddy allocator to the POTAD lets us allocate
> > > smaller quantities (eg a 512 byte block) and allocate descriptors which
> > > differ in size from a struct page.  For an extreme example, see xfs_buf
> > > which is 360 bytes and is the descriptor for an allocation between 512
> > > and 65536 bytes.
> > 
> > I actually disagree with this rather strongly. If anything, the buddy
> > allocator has turned out to be a pretty poor fit for the foundational
> > allocator.
> > 
> > On paper, it is elegant and versatile in serving essentially arbitrary
> > memory blocks. In practice, we mostly just need 4k and 2M chunks from
> > it. And it sucks at the 2M ones because of the fragmentation caused by
> > the ungrouped 4k blocks.
> 
> That's a very Intel-centric way of looking at it.  Other architectures
> support a multitude of page sizes, from the insane ia64 (4k, 8k, 16k, then
> every power of four up to 4GB) to more reasonable options like (4k, 32k,
> 256k, 2M, 16M, 128M).  But we (in software) shouldn't constrain ourselves
> to thinking in terms of what the hardware currently supports.  Google
> have data showing that for their workloads, 32kB is the goldilocks size.
> I'm sure for some workloads, it's much higher and for others it's lower.
> But for almost no workload is 4kB the right choice any more, and probably
> hasn't been since the late 90s.

You missed my point entirely.

It's not about the exact page sizes, it's about the fragmentation
issue when you mix variable-sized blocks without lifetime grouping.

Anyway, we digressed quite far here. My argument was simply that it's
conceivable we'll switch to a default allocation block and page size
that is larger than the smallest paging size supported by the CPU and
the kernel. (Various architectures might support multiple page sizes,
but once you pick one, that's the smallest quantity the kernel pages.)

That makes "bundle of pages" a short-sighted abstraction, and folio a
poor name for pageable units.

I might be wrong about what happens to PAGE_SIZE eventually (even
though your broader arguments around allocator behavior and
fragmentation don't seem to line up with my observations from
production systems, or the evolution of how we manage allocations of
different sizes) - but you also haven't made a good argument why the
API *should* continue to imply we're dealing with one or more pages.

Yes, it's a bit bikesheddy. But you're proposing an abstraction for
one of the most fundamental data structures in the operating system,
with tens of thousands of instances in almost all core subsystems.

"Bundle of pages (for now) with filesystem data (and maybe anon data
since it's sort of convenient in terms of data structure, for now)"
just doesn't make me go "Yeah, that's it."

I would understand cache_entry for the cache; mem for cache and file
(that discussion trailed off); pageable if we want to imply a sizing
a

Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-03-31 Thread Johannes Weiner
On Tue, Mar 30, 2021 at 03:05:42PM -0700, Roman Gushchin wrote:
> On Tue, Mar 30, 2021 at 05:30:10PM -0400, Johannes Weiner wrote:
> > On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> > > On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > > > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song  
> > > > wrote:
> > > > >
> > > > > Since the following patchsets applied. All the kernel memory are 
> > > > > charged
> > > > > with the new APIs of obj_cgroup.
> > > > >
> > > > > [v17,00/19] The new cgroup slab memory controller
> > > > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > > > >
> > > > > But user memory allocations (LRU pages) pinning memcgs for a long 
> > > > > time -
> > > > > it exists at a larger scale and is causing recurring problems in the 
> > > > > real
> > > > > world: page cache doesn't get reclaimed for a long time, or is used 
> > > > > by the
> > > > > second, third, fourth, ... instance of the same job that was 
> > > > > restarted into
> > > > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste 
> > > > > memory,
> > > > > and make page reclaim very inefficient.
> > > > >
> > > > > We can convert LRU pages and most other raw memcg pins to the objcg 
> > > > > direction
> > > > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > > > >
> > > > > This patchset aims to make the LRU pages to drop the reference to 
> > > > > memory
> > > > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the 
> > > > > number
> > > > > of the dying cgroups will not increase if we run the following test 
> > > > > script.
> > > > >
> > > > > ```bash
> > > > > #!/bin/bash
> > > > >
> > > > > cat /proc/cgroups | grep memory
> > > > >
> > > > > cd /sys/fs/cgroup/memory
> > > > >
> > > > > for i in range{1..500}
> > > > > do
> > > > > mkdir test
> > > > > echo $$ > test/cgroup.procs
> > > > > sleep 60 &
> > > > > echo $$ > cgroup.procs
> > > > > echo `cat test/cgroup.procs` > cgroup.procs
> > > > > rmdir test
> > > > > done
> > > > >
> > > > > cat /proc/cgroups | grep memory
> > > > > ```
> > > > >
> > > > > Patch 1 aims to fix page charging in page replacement.
> > > > > Patch 2-5 are code cleanup and simplification.
> > > > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > > > 
> > > > The main concern I have with *just* reparenting LRU pages is that for
> > > > the long running systems, the root memcg will become a dumping ground.
> > > > In addition a job running multiple times on a machine will see
> > > > inconsistent memory usage if it re-accesses the file pages which were
> > > > reparented to the root memcg.
> > > 
> > > I agree, but also the reparenting is not the perfect thing in a 
> > > combination
> > > with any memory protections (e.g. memory.low).
> > > 
> > > Imagine the following configuration:
> > > workload.slice
> > > - workload_gen_1.service   memory.min = 30G
> > > - workload_gen_2.service   memory.min = 30G
> > > - workload_gen_3.service   memory.min = 30G
> > >   ...
> > > 
> > > Parent cgroup and several generations of the child cgroup, protected by a 
> > > memory.low.
> > > Once the memory is getting reparented, it's not protected anymore.
> > 
> > That doesn't sound right.
> > 
> > A deleted cgroup today exerts no control over its abandoned
> > pages. css_reset() will blow out any control settings.
> 
> I know. Currently it works in the following way: once cgroup gen_1 is deleted,
> it's memory is not protected anymore, so eventually it's getting evicted and
> re-faulted as gen_2 (or gen_N) memory. Muchun's patchset doesn't change this,
> of course. But long-term we likely wanna re-charge such pages to new cgroups
> and avoid unnecessary evictions and re-faults. Switching to obj_cgroups 
> doesn't
> help and l

Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-03-30 Thread Johannes Weiner
On Tue, Mar 30, 2021 at 11:58:31AM -0700, Roman Gushchin wrote:
> On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> > On Tue, Mar 30, 2021 at 3:20 AM Muchun Song  
> > wrote:
> > >
> > > Since the following patchsets applied. All the kernel memory are charged
> > > with the new APIs of obj_cgroup.
> > >
> > > [v17,00/19] The new cgroup slab memory controller
> > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > >
> > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > it exists at a larger scale and is causing recurring problems in the real
> > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > second, third, fourth, ... instance of the same job that was restarted 
> > > into
> > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste 
> > > memory,
> > > and make page reclaim very inefficient.
> > >
> > > We can convert LRU pages and most other raw memcg pins to the objcg 
> > > direction
> > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > >
> > > This patchset aims to make the LRU pages to drop the reference to memory
> > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the 
> > > number
> > > of the dying cgroups will not increase if we run the following test 
> > > script.
> > >
> > > ```bash
> > > #!/bin/bash
> > >
> > > cat /proc/cgroups | grep memory
> > >
> > > cd /sys/fs/cgroup/memory
> > >
> > > for i in range{1..500}
> > > do
> > > mkdir test
> > > echo $$ > test/cgroup.procs
> > > sleep 60 &
> > > echo $$ > cgroup.procs
> > > echo `cat test/cgroup.procs` > cgroup.procs
> > > rmdir test
> > > done
> > >
> > > cat /proc/cgroups | grep memory
> > > ```
> > >
> > > Patch 1 aims to fix page charging in page replacement.
> > > Patch 2-5 are code cleanup and simplification.
> > > Patch 6-15 convert LRU pages pin to the objcg direction.
> > 
> > The main concern I have with *just* reparenting LRU pages is that for
> > the long running systems, the root memcg will become a dumping ground.
> > In addition a job running multiple times on a machine will see
> > inconsistent memory usage if it re-accesses the file pages which were
> > reparented to the root memcg.
> 
> I agree, but also the reparenting is not the perfect thing in a combination
> with any memory protections (e.g. memory.low).
> 
> Imagine the following configuration:
> workload.slice
> - workload_gen_1.service   memory.min = 30G
> - workload_gen_2.service   memory.min = 30G
> - workload_gen_3.service   memory.min = 30G
>   ...
> 
> Parent cgroup and several generations of the child cgroup, protected by a 
> memory.low.
> Once the memory is getting reparented, it's not protected anymore.

That doesn't sound right.

A deleted cgroup today exerts no control over its abandoned
pages. css_reset() will blow out any control settings.

If you're talking about protection previously inherited by
workload.slice, that continues to apply as it always has.

None of this is really accidental. Per definition the workload.slice
control domain includes workload_gen_1.service. And per definition,
the workload_gen_1.service domain ceases to exist when you delete it.

There are no (or shouldn't be any!) semantic changes from the physical
unlinking from a dead control domain.

> Also, I'm somewhat concerned about the interaction of the reparenting
> with the writeback and dirty throttling. How does it work together?

What interaction specifically?

When you delete a cgroup that had both the block and the memory
controller enabled, the control domain of both goes away and it
becomes subject to whatever control domain is above it (if any).

A higher control domain in turn takes a recursive view of the subtree,
see mem_cgroup_wb_stats(), so when control is exerted, it applies
regardless of how and where pages are physically linked in children.

It's also already possible to enable e.g. block control only at a very
high level and memory control down to a lower level. Per design this
code can live with different domain sizes for memory and block.


Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-03-30 Thread Johannes Weiner
On Tue, Mar 30, 2021 at 11:34:11AM -0700, Shakeel Butt wrote:
> On Tue, Mar 30, 2021 at 3:20 AM Muchun Song  wrote:
> >
> > Since the following patchsets applied. All the kernel memory are charged
> > with the new APIs of obj_cgroup.
> >
> > [v17,00/19] The new cgroup slab memory controller
> > [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> >
> > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > it exists at a larger scale and is causing recurring problems in the real
> > world: page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted into
> > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > and make page reclaim very inefficient.
> >
> > We can convert LRU pages and most other raw memcg pins to the objcg 
> > direction
> > to fix this problem, and then the LRU pages will not pin the memcgs.
> >
> > This patchset aims to make the LRU pages to drop the reference to memory
> > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > of the dying cgroups will not increase if we run the following test script.
> >
> > ```bash
> > #!/bin/bash
> >
> > cat /proc/cgroups | grep memory
> >
> > cd /sys/fs/cgroup/memory
> >
> > for i in range{1..500}
> > do
> > mkdir test
> > echo $$ > test/cgroup.procs
> > sleep 60 &
> > echo $$ > cgroup.procs
> > echo `cat test/cgroup.procs` > cgroup.procs
> > rmdir test
> > done
> >
> > cat /proc/cgroups | grep memory
> > ```
> >
> > Patch 1 aims to fix page charging in page replacement.
> > Patch 2-5 are code cleanup and simplification.
> > Patch 6-15 convert LRU pages pin to the objcg direction.
> 
> The main concern I have with *just* reparenting LRU pages is that for
> the long running systems, the root memcg will become a dumping ground.
> In addition a job running multiple times on a machine will see
> inconsistent memory usage if it re-accesses the file pages which were
> reparented to the root memcg.

I don't understand how Muchun's patches are supposed to *change* the
behavior the way you are describing it. This IS today's behavior.

We have hierarchical accounting, and a page that belongs to a leaf
cgroup will automatically belong to all its parents.

Further, if you delete a cgroup today, the abandoned cache will stay
physically linked to that cgroup, but that zombie group no longer acts
as a control structure: it imposes no limit and no protection; the
pages will be reclaimed as if it WERE linked to the parent.

For all intents and purposes, when you delete a cgroup today, its
remaining pages ARE dumped onto the parent.

The only difference is that today they pointlessly pin the leaf cgroup
as a holding vessel - which is then round-robin'd from the parent
during reclaim in order to pretend that all these child pages actually
ARE linked to the parent's LRU list.

Remember how we used to have every page physically linked to multiple
lrus? The leaf cgroup and the root?

All pages always belong to the (virtual) LRU list of all ancestor
cgroups. The only thing Muchun changes is that they no longer pin a
cgroup that has no semantical meaning anymore (because it's neither
visible to the user nor exerts any contol over the pages anymore).

Maybe I'm missing something that either you or Roman can explain to
me. But this series looks like a (rare) pure win.

Whether you like the current semantics is a separate discussion IMO.

> Please note that I do agree with the mentioned problem and we do see
> this issue in our fleet. Internally we have a "memcg mount option"
> feature which couples a file system with a memcg and all file pages
> allocated on that file system will be charged to that memcg. Multiple
> instances (concurrent or subsequent) of the job will use that file
> system (with a dedicated memcg) without leaving the zombies behind. I
> am not pushing for this solution as it comes with its own intricacies
> (e.g. if memcg coupled with a file system has a limit, the oom
> behavior would be awkward and therefore internally we don't put a
> limit on such memcgs). Though I want this to be part of discussion.

Right, you disconnect memory from the tasks that are allocating it,
and so you can't assign culpability when you need to.

OOM is one thing, but there are also CPU cycles and IO bandwidth
consumed during reclaim.

> I think the underlying reasons behind this issue are:
> 
> 1) Filesystem shared by disjoint jobs.
> 2) For job dedicated filesystems, the lifetime of the filesystem is
> different from the lifetime of the job.

There is also the case of deleting a cgroup just to recreate it right
after for the same job. Many job managers do this on restart right now
- like systemd, and what we're using in our fleet. This seems
avoidable by recycling a group for another instance of the same job.

Sharing is a more difficult discussion. If you 

Re: [PATCH v5 00/27] Memory Folios

2021-03-30 Thread Johannes Weiner
 refcount on a descriptor without knowing whether it's a file or
> anon page.  Or neither (eg device driver memory mapped to userspace.
> Or vmalloc memory mapped to userspace.  Or ...)

The rmap code is all about the page type specifics, but once you get
into mmap, page reclaim, page migration, we're dealing with fully
fungible blocks of memory.

I do like the idea of using actual language typing for the different
things struct page can be today (fs page), but with a common type to
manage the fungible block of memory backing it (allocation state, LRU
& aging state, mmap state etc.)

New types for the former are an easier sell. We all agree that there
are too many details of the page - including the compound page
implementation detail - inside the cache library, fs code and drivers.

It's a slightly tougher sell to say that the core VM code itself
(outside the cache library) needs a tighter abstraction for the struct
page building block and the compound page structure. At least at this
time while we're still sorting out how it all may work down the line.
Certainly, we need something to describe fungible memory blocks:
either a struct page that can be 4k and 2M compound, or a new thing
that can be backed by a 2M struct page or a 4k struct smallpage. We
don't know yet, so I would table the new abstraction type for this.

I generally don't think we want a new type that does everything that
the overloaded struct page already does PLUS the compound
abstraction. Whatever name we pick for it, it'll always be difficult
to wrap your head around such a beast.

IMO starting with an explicit page cache descriptor that resolves to
struct page inside core VM code (and maybe ->fault) for now makes the
most sense: it greatly mitigates the PAGE_SIZE and tail page issue
right away, and it's not in conflict with, but rather helps work
toward, replacing the fungible memory unit behind it.

There isn't too much overlap or generic code between cache and anon
pages such that sharing a common descriptor would be a huge win (most
overlap is at the fungible memory block level, and the physical struct
page layout of course), so I don't think we should aim for a generic
abstraction for both.

As drivers go, I think there are slightly different requirements to
filesystems, too. For filesystems, when the VM can finally do it (and
the file range permits it), I assume we want to rather transparently
increase the unit of data transfer from 4k to 2M. Most drivers that
currently hardcode alloc_page() or PAGE_SIZE OTOH probably don't want
us to bump their allocation sizes.

There ARE instances where drivers allocate pages based on buffer_size
/ PAGE_SIZE and then interact with virtual memory. Those are true VM
objects that could grow transparently if PAGE_SIZE grows, and IMO they
should share the "fungible memory block" abstraction the VM uses.

But there are also many instances where PAGE_SIZE just means 4006 is a
good size for me, and struct page is useful for refcounting. Those
just shouldn't use whatever the VM or the cache layer are using and
stop putting additional burden on an already tricky abstraction.

> On Fri, Mar 26, 2021 at 01:48:15PM -0400, Johannes Weiner wrote:
> > On Wed, Mar 24, 2021 at 06:24:21AM +, Matthew Wilcox wrote:
> > > On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote:
> > > > On Mon, Mar 22, 2021 at 06:47:44PM +, Matthew Wilcox wrote:
> > > > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> > > One of the patches I haven't posted yet starts to try to deal with 
> > > kmap()/mem*()/kunmap():
> > > 
> > > mm: Add kmap_local_folio
> > > 
> > > This allows us to map a portion of a folio.  Callers can only expect
> > > to access up to the next page boundary.
> > > 
> > > Signed-off-by: Matthew Wilcox (Oracle) 
> > > 
> > > diff --git a/include/linux/highmem-internal.h 
> > > b/include/linux/highmem-internal.h
> > > index 7902c7d8b55f..55a29c9d562f 100644
> > > --- a/include/linux/highmem-internal.h
> > > +++ b/include/linux/highmem-internal.h
> > > @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
> > > return __kmap_local_page_prot(page, kmap_prot);
> > >  }
> > >  
> > > +static inline void *kmap_local_folio(struct folio *folio, size_t offset)
> > > +{
> > > +   struct page *page = >page + offset / PAGE_SIZE;
> > > +   return __kmap_local_page_prot(page, kmap_prot) + offset % 
> > > PAGE_SIZE;
> > > +}
> > > 
> > > Partly I haven't shared that one because I'm not 100% sure that 'byte
> > > offset relative to start of folio' is the correct interface.  I'm looking
> > > at some users an

Re: [PATCH] sched/psi.c: Rudimentary typo fixes

2021-03-26 Thread Johannes Weiner
On Fri, Mar 26, 2021 at 06:12:33PM +0530, Bhaskar Chowdhury wrote:
> 
> s/possible/possible/
> s/ exceution/execution/
> s/manupulations/manipulations/
> 
> Signed-off-by: Bhaskar Chowdhury 
> ---
>  kernel/sched/psi.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 967732c0766c..316ebc57a115 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -59,7 +59,7 @@
>   * states, we would have to conclude a CPU SOME pressure number of
>   * 100%, since *somebody* is waiting on a runqueue at all
>   * times. However, that is clearly not the amount of contention the
> - * workload is experiencing: only one out of 256 possible exceution
> + * workload is experiencing: only one out of 256 possible execution

I thought this was the french spelling.

Joking aside, the corrections look right, but I also had no trouble
understanding what those sentences mean. Typos happen, plus we have a
lot of non-native speakers who won't use perfect spelling or grammar.

So for me, the bar is "this can be easily understood", not "needs to
be perfect English", and I'd say let's skip patches like these unless
they fix something truly unintelligble.



Re: [PATCH v5 00/27] Memory Folios

2021-03-26 Thread Johannes Weiner
On Wed, Mar 24, 2021 at 06:24:21AM +, Matthew Wilcox wrote:
> On Tue, Mar 23, 2021 at 08:29:16PM -0400, Johannes Weiner wrote:
> > On Mon, Mar 22, 2021 at 06:47:44PM +, Matthew Wilcox wrote:
> > > On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> One of the patches I haven't posted yet starts to try to deal with 
> kmap()/mem*()/kunmap():
> 
> mm: Add kmap_local_folio
> 
> This allows us to map a portion of a folio.  Callers can only expect
> to access up to the next page boundary.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> 
> diff --git a/include/linux/highmem-internal.h 
> b/include/linux/highmem-internal.h
> index 7902c7d8b55f..55a29c9d562f 100644
> --- a/include/linux/highmem-internal.h
> +++ b/include/linux/highmem-internal.h
> @@ -73,6 +73,12 @@ static inline void *kmap_local_page(struct page *page)
> return __kmap_local_page_prot(page, kmap_prot);
>  }
>  
> +static inline void *kmap_local_folio(struct folio *folio, size_t offset)
> +{
> +   struct page *page = >page + offset / PAGE_SIZE;
> +   return __kmap_local_page_prot(page, kmap_prot) + offset % PAGE_SIZE;
> +}
> 
> Partly I haven't shared that one because I'm not 100% sure that 'byte
> offset relative to start of folio' is the correct interface.  I'm looking
> at some users and thinking that maybe 'byte offset relative to start
> of file' might be better.  Or perhaps that's just filesystem-centric
> thinking.

Right, this doesn't seem specific to files just because they would be
the primary users of it.

> > But for that to work, we'll need the allocator to produce huge pages
> > at the necessary rate, too. The current implementation likely won't
> > scale. Compaction is expensive enough that we have to weigh when to
> > allocate huge pages for long-lived anon regions, let alone allocate
> > them for streaming IO cache entries.
> 
> Heh, I have that as a work item for later this year -- give the page
> allocator per-cpu lists of compound pages, not just order-0 pages.
> That'll save us turning compound pages back into buddy pages, only to
> turn them into compound pages again.
> 
> I also have a feeling that the page allocator either needs to become a
> sub-allocator of an allocator that deals in, say, 1GB chunks of memory,
> or it needs to become reluctant to break up larger orders.  eg if the
> dcache asks for just one more dentry, it should have to go through at
> least one round of reclaim before we choose to break up a high-order
> page to satisfy that request.

Slub already allocates higher-order pages for dentries:

slabinfo - version: 2.1
# name
 : tunables: slabdata 
  
dentry133350 133350192   422 : tunables000 : 
slabdata   3175   3175  0

   ^ here

and it could avoid even more internal fragmentation with bigger
orders. It only doesn't because of the overhead of allocating them.

If the default block size in the allocator were 2M, we'd also get slab
packing at that granularity, and we wouldn't have to worry about small
objects breaking huge pages any more than we worry about slab objects
fragmenting 4k pages today.

> > But if the overwhelming number of requests going to the page allocator
> > are larger than 4k pages - anon regions? check. page cache? likely a
> > sizable share. slub? check. network? check - does it even make sense
> > to have that as the default block size for the page allocator anymore?
> > Or even allocate struct page at this granularity?
> 
> Yep, others have talked about that as well.  I think I may even have said
> a few times at LSFMM, "What if we just make PAGE_SIZE 2MB?".  After all,
> my first 386 Linux system was 4-8MB of RAM (it got upgraded).  The 16GB
> laptop that I now have is 2048 times more RAM, so 4x the number of pages
> that system had.
> 
> But people seem attached to being able to use smaller page sizes.
> There's that pesky "compatibility" argument.

Right, that's why I'm NOT saying we should eliminate the support for
4k chunks in the page cache and page tables. That's still useful if
you have lots of small files.

I'm just saying it doesn't have to be the default that everything is
primarily optimized for. We can make the default allocation size of
the allocator correspond to a hugepage and have a secondary allocator
level for 4k chunks. Like slab, but fixed-size and highmem-aware.

It makes sense to make struct page 2M as well. It would save a ton of
memory on average and reduce the pressure we have on struct page's
size today.

And we really don't need struct page at 4k just to support this unit
of paging when necesary: page tables don't care, they use pfns and can
poin

Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

2021-03-25 Thread Johannes Weiner
On Thu, Mar 25, 2021 at 10:02:28AM +0100, Michal Hocko wrote:
> On Wed 24-03-21 15:49:15, Arjun Roy wrote:
> > On Wed, Mar 24, 2021 at 2:24 PM Johannes Weiner  wrote:
> > >
> > > On Wed, Mar 24, 2021 at 10:12:46AM +0100, Michal Hocko wrote:
> > > > On Tue 23-03-21 11:47:54, Arjun Roy wrote:
> > > > > On Tue, Mar 23, 2021 at 7:34 AM Michal Hocko  wrote:
> > > > > >
> > > > > > On Wed 17-03-21 18:12:55, Johannes Weiner wrote:
> > > > > > [...]
> > > > > > > Here is an idea of how it could work:
> > > > > > >
> > > > > > > struct page already has
> > > > > > >
> > > > > > > struct {/* page_pool used by netstack */
> > > > > > > /**
> > > > > > >  * @dma_addr: might require a 64-bit 
> > > > > > > value even on
> > > > > > >  * 32-bit architectures.
> > > > > > >  */
> > > > > > > dma_addr_t dma_addr;
> > > > > > > };
> > > > > > >
> > > > > > > and as you can see from its union neighbors, there is quite a bit 
> > > > > > > more
> > > > > > > room to store private data necessary for the page pool.
> > > > > > >
> > > > > > > When a page's refcount hits zero and it's a networking page, we 
> > > > > > > can
> > > > > > > feed it back to the page pool instead of the page allocator.
> > > > > > >
> > > > > > > From a first look, we should be able to use the PG_owner_priv_1 
> > > > > > > page
> > > > > > > flag for network pages (see how this flag is overloaded, we can 
> > > > > > > add a
> > > > > > > PG_network alias). With this, we can identify the page in 
> > > > > > > __put_page()
> > > > > > > and __release_page(). These functions are already aware of 
> > > > > > > different
> > > > > > > types of pages and do their respective cleanup handling. We can
> > > > > > > similarly make network a first-class citizen and hand pages back 
> > > > > > > to
> > > > > > > the network allocator from in there.
> > > > > >
> > > > > > For compound pages we have a concept of destructors. Maybe we can 
> > > > > > extend
> > > > > > that for order-0 pages as well. The struct page is heavily packed 
> > > > > > and
> > > > > > compound_dtor shares the storage without other metadata
> > > > > > intpages;/*16   
> > > > > >   4 */
> > > > > > unsigned char compound_dtor; /*16   
> > > > > >   1 */
> > > > > > atomic_t   hpage_pinned_refcount; /*16  
> > > > > >4 */
> > > > > > pgtable_t  pmd_huge_pte; /*16   
> > > > > >   8 */
> > > > > > void * zone_device_data; /*16   
> > > > > >   8 */
> > > > > >
> > > > > > But none of those should really require to be valid when a page is 
> > > > > > freed
> > > > > > unless I am missing something. It would really require to check 
> > > > > > their
> > > > > > users whether they can leave the state behind. But if we can 
> > > > > > establish a
> > > > > > contract that compound_dtor can be always valid when a page is freed
> > > > > > this would be really a nice and useful abstraction because you 
> > > > > > wouldn't
> > > > > > have to care about the specific type of page.
> > >
> > > Yeah technically nobody should leave these fields behind, but it
> > > sounds pretty awkward to manage an overloaded destructor with a
> > > refcounted object:
> > >
> > > Either every put would have to check ref==1 before to see if it will
> > > be the one to free the page, and then set up the destructor before
> > > pu

Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

2021-03-24 Thread Johannes Weiner
On Wed, Mar 24, 2021 at 10:12:46AM +0100, Michal Hocko wrote:
> On Tue 23-03-21 11:47:54, Arjun Roy wrote:
> > On Tue, Mar 23, 2021 at 7:34 AM Michal Hocko  wrote:
> > >
> > > On Wed 17-03-21 18:12:55, Johannes Weiner wrote:
> > > [...]
> > > > Here is an idea of how it could work:
> > > >
> > > > struct page already has
> > > >
> > > > struct {/* page_pool used by netstack */
> > > > /**
> > > >  * @dma_addr: might require a 64-bit value even 
> > > > on
> > > >  * 32-bit architectures.
> > > >  */
> > > > dma_addr_t dma_addr;
> > > > };
> > > >
> > > > and as you can see from its union neighbors, there is quite a bit more
> > > > room to store private data necessary for the page pool.
> > > >
> > > > When a page's refcount hits zero and it's a networking page, we can
> > > > feed it back to the page pool instead of the page allocator.
> > > >
> > > > From a first look, we should be able to use the PG_owner_priv_1 page
> > > > flag for network pages (see how this flag is overloaded, we can add a
> > > > PG_network alias). With this, we can identify the page in __put_page()
> > > > and __release_page(). These functions are already aware of different
> > > > types of pages and do their respective cleanup handling. We can
> > > > similarly make network a first-class citizen and hand pages back to
> > > > the network allocator from in there.
> > >
> > > For compound pages we have a concept of destructors. Maybe we can extend
> > > that for order-0 pages as well. The struct page is heavily packed and
> > > compound_dtor shares the storage without other metadata
> > > intpages;/*16 4 */
> > > unsigned char compound_dtor; /*16 1 */
> > > atomic_t   hpage_pinned_refcount; /*16 4 
> > > */
> > > pgtable_t  pmd_huge_pte; /*16 8 */
> > > void * zone_device_data; /*16 8 */
> > >
> > > But none of those should really require to be valid when a page is freed
> > > unless I am missing something. It would really require to check their
> > > users whether they can leave the state behind. But if we can establish a
> > > contract that compound_dtor can be always valid when a page is freed
> > > this would be really a nice and useful abstraction because you wouldn't
> > > have to care about the specific type of page.

Yeah technically nobody should leave these fields behind, but it
sounds pretty awkward to manage an overloaded destructor with a
refcounted object:

Either every put would have to check ref==1 before to see if it will
be the one to free the page, and then set up the destructor before
putting the final ref. But that means we can't support lockless
tryget() schemes like we have in the page cache with a destructor.

Or you'd have to set up the destructor every time an overloaded field
reverts to its null state, e.g. hpage_pinned_refcount goes back to 0.

Neither of those sound practical to me.

> > > But maybe I am just overlooking the real complexity there.
> > > --
> > 
> > For now probably the easiest way is to have network pages be first
> > class with a specific flag as previously discussed and have concrete
> > handling for it, rather than trying to establish the contract across
> > page types.
> 
> If you are going to claim a page flag then it would be much better to
> have it more generic. Flags are really scarce and if all you care about
> is PageHasDestructor() and provide one via page->dtor then the similar
> mechanism can be reused by somebody else. Or does anything prevent that?

I was suggesting to alias PG_owner_priv_1, which currently isn't used
on network pages. We don't need to allocate a brandnew page flag.

I agree that a generic destructor for order-0 pages would be nice, but
due to the decentralized nature of refcounting the only way I think it
would work in practice is by adding a new field to struct page that is
not in conflict with any existing ones.

Comparably, creating a network type page consumes no additional space.


Re: [PATCH v5 00/27] Memory Folios

2021-03-23 Thread Johannes Weiner
On Mon, Mar 22, 2021 at 06:47:44PM +, Matthew Wilcox wrote:
> On Mon, Mar 22, 2021 at 01:59:24PM -0400, Johannes Weiner wrote:
> > On Sat, Mar 20, 2021 at 05:40:37AM +, Matthew Wilcox (Oracle) wrote:
> > > This series introduces the 'struct folio' as a replacement for
> > > head-or-base pages.  This initial set reduces the kernel size by
> > > approximately 6kB, although its real purpose is adding infrastructure
> > > to enable further use of the folio.
> > > 
> > > The intent is to convert all filesystems and some device drivers to work
> > > in terms of folios.  This series contains a lot of explicit conversions,
> > > but it's important to realise it's removing a lot of implicit conversions
> > > in some relatively hot paths.  There will be very few conversions from
> > > folios when this work is completed; filesystems, the page cache, the
> > > LRU and so on will generally only deal with folios.
> > 
> > If that is the case, shouldn't there in the long term only be very
> > few, easy to review instances of things like compound_head(),
> > PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
> > never see tail pages and 2) never assume a compile-time page size?
> 
> I don't know exactly where we get to eventually.  There are definitely
> some aspects of the filesystem<->mm interface which are page-based
> (eg ->fault needs to look up the exact page, regardless of its
> head/tail/base nature), while ->readpage needs to talk in terms of
> folios.

I can imagine we'd eventually want fault handlers that can also fill
in larger chunks of data if the file is of the right size and the MM
is able to (and policy/heuristics determine to) go with a huge page.

> > What are the higher-level places that in the long-term should be
> > dealing with tail pages at all? Are there legit ones besides the page
> > allocator, THP splitting internals & pte-mapped compound pages?
> 
> I can't tell.  I think this patch maybe illustrates some of the
> problems, but maybe it's just an intermediate problem:
> 
> https://git.infradead.org/users/willy/pagecache.git/commitdiff/047e9185dc146b18f56c6df0b49fe798f1805c7b
> 
> It deals mostly in terms of folios, but when it needs to kmap() and
> memcmp(), then it needs to work in terms of pages.  I don't think it's
> avoidable (maybe we bury the "dealing with pages" inside a kmap()
> wrapper somewhere, but I'm not sure that's better).

Yeah it'd be nice to get low-level, PAGE_SIZE pages out of there. We
may be able to just kmap whole folios too, which are more likely to be
small pages on highmem systems anyway.

> > Some compound_head() that are currently in the codebase are already
> > unnecessary. Like the one in activate_page().
> 
> Right!  And it's hard to find & remove them without very careful analysis,
> or particularly deep knowledge.  With folios, we can remove them without
> terribly deep thought.

True. It definitely also helps mark the places that have been
converted from the top down and which ones haven't. Without that you
need to think harder about the context ("How would a tail page even
get here?" vs. "No page can get here, only folios" ;-))

Again, I think that's something that would automatically be better in
the long term when compound_page() and PAGE_SIZE themselves would
stand out like sore thumbs. But you raise a good point: there is such
an overwhelming amount of them right now that it's difficult to do
this without a clearer marker and help from the type system.

> > And looking at grep, I wouldn't be surprised if only the page table
> > walkers need the page_compound() that mark_page_accessed() does. We
> > would be better off if they did the translation once and explicitly in
> > the outer scope, where it's clear they're dealing with a pte-mapped
> > compound page, instead of having a series of rather low level helpers
> > (page flags testing, refcount operations, LRU operations, stat
> > accounting) all trying to be clever but really just obscuring things
> > and imposing unnecessary costs on the vast majority of cases.
> > 
> > So I fully agree with the motivation behind this patch. But I do
> > wonder why it's special-casing the commmon case instead of the rare
> > case. It comes at a huge cost. Short term, the churn of replacing
> > 'page' with 'folio' in pretty much all instances is enormous.
> 
> Because people (think they) know what a page is.  It's PAGE_SIZE bytes
> long, it occupies one PTE, etc, etc.  A folio is new and instead of
> changing how something familiar (a page) behaves, we're asking them
> to think about something new instead that behaves a lot li

Re: [PATCH] mm: page_alloc: fix memcg accounting leak in speculative cache lookup

2021-03-23 Thread Johannes Weiner
On Fri, Mar 19, 2021 at 06:52:58PM -0700, Hugh Dickins wrote:
> On Fri, 19 Mar 2021, Johannes Weiner wrote:
> 
> > When the freeing of a higher-order page block (non-compound) races
> > with a speculative page cache lookup, __free_pages() needs to leave
> > the first order-0 page in the chunk to the lookup but free the buddy
> > pages that the lookup doesn't know about separately.
> > 
> > However, if such a higher-order page is charged to a memcg (e.g. !vmap
> > kernel stack)), only the first page of the block has page->memcg
> > set. That means we'll uncharge only one order-0 page from the entire
> > block, and leak the remainder.
> > 
> > Add a split_page_memcg() to __free_pages() right before it starts
> > taking the higher-order page apart and freeing its individual
> > constituent pages. This ensures all of them will have the memcg
> > linkage set up for correct uncharging. Also update the comments a bit
> > to clarify what exactly is happening to the page during that race.
> > 
> > This bug is old and has its roots in the speculative page cache patch
> > and adding cgroup accounting of kernel pages. There are no known user
> > reports. A backport to stable is therefor not warranted.
> > 
> > Reported-by: Matthew Wilcox 
> > Signed-off-by: Johannes Weiner 
> 
> Acked-by: Hugh Dickins 
> 
> to the split_page_memcg() addition etc, but a doubt just hit me on the
> original e320d3012d25 ("mm/page_alloc.c: fix freeing non-compound pages"):
> see comment below.
> 
> > ---
> >  mm/page_alloc.c | 33 +++--
> >  1 file changed, 27 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c53fe4fa10bf..f4bd56656402 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5112,10 +5112,9 @@ static inline void free_the_page(struct page *page, 
> > unsigned int order)
> >   * the allocation, so it is easy to leak memory.  Freeing more memory
> >   * than was allocated will probably emit a warning.
> >   *
> > - * If the last reference to this page is speculative, it will be released
> > - * by put_page() which only frees the first page of a non-compound
> > - * allocation.  To prevent the remaining pages from being leaked, we free
> > - * the subsequent pages here.  If you want to use the page's reference
> > + * This function isn't a put_page(). Don't let the put_page_testzero()
> > + * fool you, it's only to deal with speculative cache references. It
> > + * WILL free pages directly. If you want to use the page's reference
> >   * count to decide when to free the allocation, you should allocate a
> >   * compound page, and use put_page() instead of __free_pages().
> >   *
> > @@ -5124,11 +5123,33 @@ static inline void free_the_page(struct page *page, 
> > unsigned int order)
> >   */
> >  void __free_pages(struct page *page, unsigned int order)
> >  {
> > -   if (put_page_testzero(page))
> > +   /*
> > +* Drop the base reference from __alloc_pages and free. In
> > +* case there is an outstanding speculative reference, from
> > +* e.g. the page cache, it will put and free the page later.
> > +*/
> > +   if (likely(put_page_testzero(page))) {
> > free_the_page(page, order);
> > -   else if (!PageHead(page))
> > +   return;
> > +   }
> > +
> > +   /*
> > +* The speculative reference will put and free the page.
> > +*
> > +* However, if the speculation was into a higher-order page
> > +* chunk that isn't marked compound, the other side will know
> > +* nothing about our buddy pages and only free the order-0
> > +* page at the start of our chunk! We must split off and free
> > +* the buddy pages here.
> > +*
> > +* The buddy pages aren't individually refcounted, so they
> > +* can't have any pending speculative references themselves.
> > +*/
> > +   if (!PageHead(page) && order > 0) {
> 
> The put_page_testzero() has released our reference to the first
> subpage of page: it's now under the control of the racing speculative
> lookup.  So it seems to me unsafe to be checking PageHead(page) here:
> if it was actually a compound page, PageHead might already be cleared
> by now, and we doubly free its tail pages below?  I think we need to
> use a "bool compound = PageHead(page)" on entry to __free_pages().

That's a good point.

> And would it be wrong to fix that too in this patch?

All aboard the mm-page_alloc-fix-stuff.patch!

No, I think

Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

2021-03-23 Thread Johannes Weiner
On Mon, Mar 22, 2021 at 02:35:11PM -0700, Arjun Roy wrote:
> To make sure we're on the same page, then, here's a tentative
> mechanism - I'd rather get buy in before spending too much time on
> something that wouldn't pass muster afterwards.
> 
> A) An opt-in mechanism, that a driver needs to explicitly support, in
> order to get properly accounted receive zerocopy.

Yep, opt-in makes sense. That allows piece-by-piece conversion and
avoids us having to have a flag day.

> B) Failure to opt-in (e.g. unchanged old driver) can either lead to
> unaccounted zerocopy (ie. existing behaviour) or, optionally,
> effectively disabled zerocopy (ie. any call to zerocopy will return
> something like EINVAL) (perhaps controlled via some sysctl, which
> either lets zerocopy through or not with/without accounting).

I'd suggest letting it fail gracefully (i.e. no -EINVAL) to not
disturb existing/working setups during the transition period. But the
exact policy is easy to change later on if we change our minds on it.

> The proposed mechanism would involve:
> 1) Some way of marking a page as being allocated by a driver that has
> decided to opt into this mechanism. Say, a page flag, or a memcg flag.

Right. I would stress it should not be a memcg flag or any direct
channel from the network to memcg, as this would limit its usefulness
while having the same maintenance overhead.

It should make the network page a first class MM citizen - like an LRU
page or a slab page - which can be accounted and introspected as such,
including from the memcg side.

So definitely a page flag.

> 2) A callback provided by the driver, that takes a struct page*, and
> returns a boolean. The value of the boolean being true indicates that
> any and all refs on the page are held by the driver. False means there
> exists at least one reference that is not held by the driver.

I was thinking the PageNetwork flag would cover this, but maybe I'm
missing something?

> 3) A branch in put_page() that, for pages marked thus, will consult
> the driver callback and if it returns true, will uncharge the memcg
> for the page.

The way I picture it, put_page() (and release_pages) should do this:

void __put_page(struct page *page)
{
if (is_zone_device_page(page)) {
put_dev_pagemap(page->pgmap);

/*
 * The page belongs to the device that created pgmap. Do
 * not return it to page allocator.
 */
return;
}
+
+   if (PageNetwork(page)) {
+   put_page_network(page);
+   /* Page belongs to the network stack, not the page allocator */
+   return;
+   }

if (unlikely(PageCompound(page)))
__put_compound_page(page);
else
__put_single_page(page);
}

where put_page_network() is the network-side callback that uncharges
the page.

(..and later can be extended to do all kinds of things when informed
that the page has been freed: update statistics (mod_page_state), put
it on a private network freelist, or ClearPageNetwork() and give it
back to the page allocator etc.

But for starters it can set_page_count(page, 1) after the uncharge to
retain the current silent recycling behavior.)

> The anonymous struct you defined above is part of a union that I think
> normally is one qword in length (well, could be more depending on the
> typedefs I saw there) and I think that can be co-opted to provide the
> driver callback - though, it might require growing the struct by one
> more qword since there may be drivers like mlx5 that are already using
> the field already in there  for dma_addr.

The page cache / anonymous struct it's shared with is 5 words (double
linked list pointers, mapping, index, private), and the network struct
is currently one word, so you can add 4 words to a PageNetwork() page
without increasing the size of struct page. That should be plenty of
space to store auxiliary data for drivers, right?

> Anyways, the callback could then be used by the driver to handle the
> other accounting quirks you mentioned, without needing to scan the
> full pool.

Right.

> Of course there are corner cases and such to properly account for, but
> I just wanted to provide a really rough sketch to see if this
> (assuming it were properly implemented) was what you had in mind. If
> so I can put together a v3 patch.

Yeah, makes perfect sense. We can keep iterating like this any time
you feel you accumulate too many open questions. Not just for MM but
also for the networking folks - although I suspect that the first step
would be mostly about the MM infrastructure, and I'm not sure how much
they care about the internals there ;)

> Per my response to Andrew earlier, this would make it even more
> confusing whether this is to be applied against net-next or mm trees.
> But that's a bridge to cross when we get to it.

The mm tree includes -next, so it should be a safe development target

Re: [PATCH v5 00/27] Memory Folios

2021-03-22 Thread Johannes Weiner
On Sat, Mar 20, 2021 at 05:40:37AM +, Matthew Wilcox (Oracle) wrote:
> Managing memory in 4KiB pages is a serious overhead.  Many benchmarks
> exist which show the benefits of a larger "page size".  As an example,
> an earlier iteration of this idea which used compound pages got a 7%
> performance boost when compiling the kernel using kernbench without any
> particular tuning.
> 
> Using compound pages or THPs exposes a serious weakness in our type
> system.  Functions are often unprepared for compound pages to be passed
> to them, and may only act on PAGE_SIZE chunks.  Even functions which are
> aware of compound pages may expect a head page, and do the wrong thing
> if passed a tail page.
> 
> There have been efforts to label function parameters as 'head' instead
> of 'page' to indicate that the function expects a head page, but this
> leaves us with runtime assertions instead of using the compiler to prove
> that nobody has mistakenly passed a tail page.  Calling a struct page
> 'head' is also inaccurate as they will work perfectly well on base pages.
> The term 'nottail' has not proven popular.
> 
> We also waste a lot of instructions ensuring that we're not looking at
> a tail page.  Almost every call to PageFoo() contains one or more hidden
> calls to compound_head().  This also happens for get_page(), put_page()
> and many more functions.  There does not appear to be a way to tell gcc
> that it can cache the result of compound_head(), nor is there a way to
> tell it that compound_head() is idempotent.
> 
> This series introduces the 'struct folio' as a replacement for
> head-or-base pages.  This initial set reduces the kernel size by
> approximately 6kB, although its real purpose is adding infrastructure
> to enable further use of the folio.
> 
> The intent is to convert all filesystems and some device drivers to work
> in terms of folios.  This series contains a lot of explicit conversions,
> but it's important to realise it's removing a lot of implicit conversions
> in some relatively hot paths.  There will be very few conversions from
> folios when this work is completed; filesystems, the page cache, the
> LRU and so on will generally only deal with folios.

If that is the case, shouldn't there in the long term only be very
few, easy to review instances of things like compound_head(),
PAGE_SIZE etc. deep in the heart of MM? And everybody else should 1)
never see tail pages and 2) never assume a compile-time page size?

What are the higher-level places that in the long-term should be
dealing with tail pages at all? Are there legit ones besides the page
allocator, THP splitting internals & pte-mapped compound pages?

I do agree that the current confusion around which layer sees which
types of pages is a problem. But I also think a lot of it is the
result of us being in a transitional period where we've added THP in
more places but not all code and data structures are or were fully
native yet, and so we had things leak out or into where maybe they
shouldn't be to make things work in the short term.

But this part is already getting better, and has gotten better, with
the page cache (largely?) going native for example.

Some compound_head() that are currently in the codebase are already
unnecessary. Like the one in activate_page().

And looking at grep, I wouldn't be surprised if only the page table
walkers need the page_compound() that mark_page_accessed() does. We
would be better off if they did the translation once and explicitly in
the outer scope, where it's clear they're dealing with a pte-mapped
compound page, instead of having a series of rather low level helpers
(page flags testing, refcount operations, LRU operations, stat
accounting) all trying to be clever but really just obscuring things
and imposing unnecessary costs on the vast majority of cases.

So I fully agree with the motivation behind this patch. But I do
wonder why it's special-casing the commmon case instead of the rare
case. It comes at a huge cost. Short term, the churn of replacing
'page' with 'folio' in pretty much all instances is enormous.

And longer term, I'm not convinced folio is the abstraction we want
throughout the kernel. If nobody should be dealing with tail pages in
the first place, why are we making everybody think in 'folios'? Why
does a filesystem care that huge pages are composed of multiple base
pages internally? This feels like an implementation detail leaking out
of the MM code. The vast majority of places should be thinking 'page'
with a size of 'page_size()'. Including most parts of the MM itself.

The compile-time check is nice, but I'm not sure it would be that much
more effective at catching things than a few centrally placed warns
inside PageFoo(), get_page() etc. and other things that should not
encounter tail pages in the first place (with __helpers for the few
instances that do). And given the invasiveness of this change, they
ought to be very drastically better at it, and obviously so, IMO.

Re: [PATCH 2/2] mm: memcontrol: deprecate swapaccounting=0 mode

2021-03-22 Thread Johannes Weiner
On Fri, Mar 19, 2021 at 05:37:05PM -0700, Hugh Dickins wrote:
> On Fri, 19 Mar 2021, Johannes Weiner wrote:
> 
> > The swapaccounting= commandline option already does very little
> > today. To close a trivial containment failure case, the swap ownership
> > tracking part of the swap controller has recently become mandatory
> > (see commit 2d1c498072de ("mm: memcontrol: make swap tracking an
> > integral part of memory control") for details), which makes up the
> > majority of the work during swapout, swapin, and the swap slot map.
> > 
> > The only thing left under this flag is the page_counter operations and
> > the visibility of the swap control files in the first place, which are
> > rather meager savings. There also aren't many scenarios, if any, where
> > controlling the memory of a cgroup while allowing it unlimited access
> > to a global swap space is a workable resource isolation stragegy.
> > 
> > On the other hand, there have been several bugs and confusion around
> > the many possible swap controller states (cgroup1 vs cgroup2 behavior,
> > memory accounting without swap accounting, memcg runtime disabled).
> > 
> > This puts the maintenance overhead of retaining the toggle above its
> > practical benefits. Deprecate it.
> > 
> > Suggested-by: Shakeel Butt 
> > Signed-off-by: Johannes Weiner 
> 
> This crashes, and needs a fix: see below (plus some nits).
> 
> But it's a very welcome cleanup: just getting rid of all those
> !cgroup_memory_noswap double negatives is a relief in itself.
> 
> It does suggest eliminating CONFIG_MEMCG_SWAP altogether (just
> using #ifdef CONFIG_SWAP instead, in those parts of CONFIG_MEMCG code);
> but you're right that's a separate cleanup, and not nearly so worthwhile
> as this one (I notice CONFIG_MEMCG_SWAP in some of the arch defconfigs,
> and don't know whether whoever removes CONFIG_MEMCG_SWAP would be
> obligated to remove those too).

2d1c498072de ("mm: memcontrol: make swap tracking an integral part of
memory control") made it invisible to the user already, I only kept
the symbol for convenience in the Makefile:

obj-$(CONFIG_MEMCG_SWAP) += swap_cgroup.o

But I guess we could replace it with

ifdef CONFIG_SWAP
obj-$(CONFIG_MEMCG) += swap_cgroup.c
endif

and I agree, everywhere else it's currently used would be nicer
without it. I'll send a separate patch.

> > @@ -99,7 +92,11 @@ static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
> >  /* Whether legacy memory+swap accounting is active */
> >  static bool do_memsw_account(void)
> >  {
> > -   return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && 
> > !cgroup_memory_noswap;
> > +   /* cgroup2 doesn't do mem+swap accounting */
> > +   if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > +   return false;
> > +
> > +   return true;
> 
> Nit: I'm not fond of the "if (boolean()) return true; else return false;"
> codestyle, and would prefer the straightforward
> 
>   return !cgroup_subsys_on_dfl(memory_cgrp_subsys);
> 
> but you've chosen otherwise, so, okay.

I prefer a series of branches if a single expression becomes unwieldy,
and individual conditions deserve their own comments.

But it's indeed pretty silly in this case, I'll make it a single line.

> > @@ -7124,7 +7121,7 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup 
> > *memcg)
> >  {
> > long nr_swap_pages = get_nr_swap_pages();
> >  
> > -   if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > +   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> 
> That needs to check mem_cgroup_disabled() where it used to check
> cgroup_memory_noswap.  The convolutions are confusing (which is
> precisely why this is such a welcome cleanup), but without a
> mem_cgroup_disabled() (or NULL memcg) check there, the
> cgroup_disable=memory case oopses on NULLish pointer dereference
> when mem_cgroup_get_nr_swap_pages() is called from get_scan_count().
> 
> (This little function has been repeatedly troublesome that way.)

Sigh, yes, this will hopefully be the last instance of this bug...

Thanks for catching that, I'll fix it up.

> > @@ -7141,7 +7138,7 @@ bool mem_cgroup_swap_full(struct page *page)
> >  
> > if (vm_swap_full())
> > return true;
> > -   if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
> > +   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> 
> Would it now be better to say "if (do_memsw_account())" there?

Yes, I'll change that.

> Or would it be better to eliminate do_memsw_account() altogether,
> and just use !cgroup_subsys_on_dfl(memory_cgrp_subsy

Re: [PATCH v5 1/7] mm: memcontrol: slab: fix obtain a reference to a freeing memcg

2021-03-22 Thread Johannes Weiner
On Sat, Mar 20, 2021 at 12:38:14AM +0800, Muchun Song wrote:
> The rcu_read_lock/unlock only can guarantee that the memcg will not be
> freed, but it cannot guarantee the success of css_get (which is in the
> refill_stock when cached memcg changed) to memcg.
> 
>   rcu_read_lock()
>   memcg = obj_cgroup_memcg(old)
>   __memcg_kmem_uncharge(memcg)
>   refill_stock(memcg)
>   if (stock->cached != memcg)
>   // css_get can change the ref counter from 0 back to 1.
>   css_get(>css)
>   rcu_read_unlock()
> 
> This fix is very like the commit:
> 
>   eefbfa7fd678 ("mm: memcg/slab: fix use after free in obj_cgroup_charge")
> 
> Fix this by holding a reference to the memcg which is passed to the
> __memcg_kmem_uncharge() before calling __memcg_kmem_uncharge().
> 
> Fixes: 3de7d4f25a74 ("mm: memcg/slab: optimize objcg stock draining")
> Signed-off-by: Muchun Song 

Acked-by: Johannes Weiner 

Good catch! Did you trigger the WARN_ON() in
percpu_ref_kill_and_confirm() during testing?


Re: [PATCH v5 6/7] mm: memcontrol: inline __memcg_kmem_{un}charge() into obj_cgroup_{un}charge_pages()

2021-03-22 Thread Johannes Weiner
On Sat, Mar 20, 2021 at 12:38:19AM +0800, Muchun Song wrote:
> There is only one user of __memcg_kmem_charge(), so manually inline
> __memcg_kmem_charge() to obj_cgroup_charge_pages(). Similarly manually
> inline __memcg_kmem_uncharge() into obj_cgroup_uncharge_pages() and
> call obj_cgroup_uncharge_pages() in obj_cgroup_release().
> 
> This is just code cleanup without any functionality changes.
> 
> Signed-off-by: Muchun Song 

Acked-by: Johannes Weiner 


Re: [PATCH] psi: reduce calls to sched_clock() in psi

2021-03-22 Thread Johannes Weiner
On Sun, Mar 21, 2021 at 01:51:56PM -0700, Shakeel Butt wrote:
> We noticed that the cost of psi increases with the increase in the
> levels of the cgroups. Particularly the cost of cpu_clock() sticks out
> as the kernel calls it multiple times as it traverses up the cgroup
> tree. This patch reduces the calls to cpu_clock().
> 
> Performed perf bench on Intel Broadwell with 3 levels of cgroup.
> 
> Before the patch:
> 
> $ perf bench sched all
>  # Running sched/messaging benchmark...
>  # 20 sender and receiver processes per group
>  # 10 groups == 400 processes run
> 
>  Total time: 0.747 [sec]
> 
>  # Running sched/pipe benchmark...
>  # Executed 100 pipe operations between two processes
> 
>  Total time: 3.516 [sec]
> 
>3.516689 usecs/op
>  284358 ops/sec
> 
> After the patch:
> 
> $ perf bench sched all
>  # Running sched/messaging benchmark...
>  # 20 sender and receiver processes per group
>  # 10 groups == 400 processes run
> 
>  Total time: 0.640 [sec]
> 
>  # Running sched/pipe benchmark...
>  # Executed 100 pipe operations between two processes
> 
>  Total time: 3.329 [sec]
> 
>3.329820 usecs/op
>  300316 ops/sec
> 
> Signed-off-by: Shakeel Butt 

Acked-by: Johannes Weiner 


Re: [PATCH 2/2] mm: memcontrol: deprecate swapaccounting=0 mode

2021-03-19 Thread Johannes Weiner
On Fri, Mar 19, 2021 at 06:49:55AM -0700, Shakeel Butt wrote:
> On Thu, Mar 18, 2021 at 10:49 PM Johannes Weiner  wrote:
> >
> > The swapaccounting= commandline option already does very little
> > today. To close a trivial containment failure case, the swap ownership
> > tracking part of the swap controller has recently become mandatory
> > (see commit 2d1c498072de ("mm: memcontrol: make swap tracking an
> > integral part of memory control") for details), which makes up the
> > majority of the work during swapout, swapin, and the swap slot map.
> >
> > The only thing left under this flag is the page_counter operations and
> > the visibility of the swap control files in the first place, which are
> > rather meager savings. There also aren't many scenarios, if any, where
> > controlling the memory of a cgroup while allowing it unlimited access
> > to a global swap space is a workable resource isolation stragegy.
> 
> *strategy

Will fix :)

> > On the other hand, there have been several bugs and confusion around
> > the many possible swap controller states (cgroup1 vs cgroup2 behavior,
> > memory accounting without swap accounting, memcg runtime disabled).
> >
> > This puts the maintenance overhead of retaining the toggle above its
> > practical benefits. Deprecate it.
> >
> > Suggested-by: Shakeel Butt 
> > Signed-off-by: Johannes Weiner 
> [...]
> >
> >  static int __init setup_swap_account(char *s)
> >  {
> > -   if (!strcmp(s, "1"))
> > -   cgroup_memory_noswap = false;
> > -   else if (!strcmp(s, "0"))
> > -   cgroup_memory_noswap = true;
> > -   return 1;
> > +   pr_warn_once("The swapaccount= commandline option is deprecated. "
> > +"Please report your usecase to linux...@kvack.org if 
> > you "
> > +"depend on this functionality.\n");
> > +   return 0;
> 
> What's the difference between returning 0 or 1 here?

It signals whether the parameter is "recognized" by the kernel as a
valid thing to pass, or whether it's unknown. If it's unknown, it'll
be passed on to the init system (which ignores it), so this resembles
the behavior we'll have when we remove the __setup in the future.

If somebody can make an argument why we should go with one over the
other, I'll happily go with that.

> >  __setup("swapaccount=", setup_swap_account);
> >
> > @@ -7291,27 +7287,13 @@ static struct cftype memsw_files[] = {
> > { },/* terminate */
> >  };
> >
> > -/*
> > - * If mem_cgroup_swap_init() is implemented as a subsys_initcall()
> > - * instead of a core_initcall(), this could mean cgroup_memory_noswap still
> > - * remains set to false even when memcg is disabled via 
> > "cgroup_disable=memory"
> > - * boot parameter. This may result in premature OOPS inside
> > - * mem_cgroup_get_nr_swap_pages() function in corner cases.
> > - */
> >  static int __init mem_cgroup_swap_init(void)
> >  {
> > -   /* No memory control -> no swap control */
> > -   if (mem_cgroup_disabled())
> > -   cgroup_memory_noswap = true;
> > -
> > -   if (cgroup_memory_noswap)
> > -   return 0;
> > -
> 
> Do we need a mem_cgroup_disabled() check here?

cgroup_add_cftypes() implies this check from the cgroup side and will
just do nothing and return success. So we don't need it now.

But it is something we'd have to remember to add if we do add more
code to this function later on.

Either way works for me. I can add it if people think it's better.

> 
> > WARN_ON(cgroup_add_dfl_cftypes(_cgrp_subsys, swap_files));
> > WARN_ON(cgroup_add_legacy_cftypes(_cgrp_subsys, 
> > memsw_files));
> >
> > return 0;
> >  }
> > -core_initcall(mem_cgroup_swap_init);
> > +subsys_initcall(mem_cgroup_swap_init);
> >
> >  #endif /* CONFIG_MEMCG_SWAP */
> > --
> > 2.30.1
> >


[PATCH] mm: page_alloc: fix memcg accounting leak in speculative cache lookup

2021-03-19 Thread Johannes Weiner
When the freeing of a higher-order page block (non-compound) races
with a speculative page cache lookup, __free_pages() needs to leave
the first order-0 page in the chunk to the lookup but free the buddy
pages that the lookup doesn't know about separately.

However, if such a higher-order page is charged to a memcg (e.g. !vmap
kernel stack)), only the first page of the block has page->memcg
set. That means we'll uncharge only one order-0 page from the entire
block, and leak the remainder.

Add a split_page_memcg() to __free_pages() right before it starts
taking the higher-order page apart and freeing its individual
constituent pages. This ensures all of them will have the memcg
linkage set up for correct uncharging. Also update the comments a bit
to clarify what exactly is happening to the page during that race.

This bug is old and has its roots in the speculative page cache patch
and adding cgroup accounting of kernel pages. There are no known user
reports. A backport to stable is therefor not warranted.

Reported-by: Matthew Wilcox 
Signed-off-by: Johannes Weiner 
---
 mm/page_alloc.c | 33 +++--
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c53fe4fa10bf..f4bd56656402 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5112,10 +5112,9 @@ static inline void free_the_page(struct page *page, 
unsigned int order)
  * the allocation, so it is easy to leak memory.  Freeing more memory
  * than was allocated will probably emit a warning.
  *
- * If the last reference to this page is speculative, it will be released
- * by put_page() which only frees the first page of a non-compound
- * allocation.  To prevent the remaining pages from being leaked, we free
- * the subsequent pages here.  If you want to use the page's reference
+ * This function isn't a put_page(). Don't let the put_page_testzero()
+ * fool you, it's only to deal with speculative cache references. It
+ * WILL free pages directly. If you want to use the page's reference
  * count to decide when to free the allocation, you should allocate a
  * compound page, and use put_page() instead of __free_pages().
  *
@@ -5124,11 +5123,33 @@ static inline void free_the_page(struct page *page, 
unsigned int order)
  */
 void __free_pages(struct page *page, unsigned int order)
 {
-   if (put_page_testzero(page))
+   /*
+* Drop the base reference from __alloc_pages and free. In
+* case there is an outstanding speculative reference, from
+* e.g. the page cache, it will put and free the page later.
+*/
+   if (likely(put_page_testzero(page))) {
free_the_page(page, order);
-   else if (!PageHead(page))
+   return;
+   }
+
+   /*
+* The speculative reference will put and free the page.
+*
+* However, if the speculation was into a higher-order page
+* chunk that isn't marked compound, the other side will know
+* nothing about our buddy pages and only free the order-0
+* page at the start of our chunk! We must split off and free
+* the buddy pages here.
+*
+* The buddy pages aren't individually refcounted, so they
+* can't have any pending speculative references themselves.
+*/
+   if (!PageHead(page) && order > 0) {
+   split_page_memcg(page, 1 << order);
while (order-- > 0)
free_the_page(page + (1 << order), order);
+   }
 }
 EXPORT_SYMBOL(__free_pages);
 
-- 
2.30.1



[PATCH 1/2] mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled

2021-03-18 Thread Johannes Weiner
Since commit 2d1c498072de ("mm: memcontrol: make swap tracking an
integral part of memory control"), the cgroup swap arrays are used to
track memory ownership at the time of swap readahead and swapoff, even
if swap space *accounting* has been turned off by the user via
swapaccount=0 (which sets cgroup_memory_noswap).

However, the patch was overzealous: by simply dropping the
cgroup_memory_noswap conditionals in the swapon, swapoff and uncharge
path, it caused the cgroup arrays being allocated even when the memory
controller as a whole is disabled. This is a waste of that memory.

Restore mem_cgroup_disabled() checks, implied previously by
cgroup_memory_noswap, in the swapon, swapoff, and swap_entry_free
callbacks.

Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of 
memory control")
Reported-by: Hugh Dickins 
Signed-off-by: Johannes Weiner 
---
 mm/memcontrol.c  | 3 +++
 mm/swap_cgroup.c | 6 ++
 2 files changed, 9 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 668d1d7c2645..49bdcf603af1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -7101,6 +7101,9 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned 
int nr_pages)
struct mem_cgroup *memcg;
unsigned short id;
 
+   if (mem_cgroup_disabled())
+   return;
+
id = swap_cgroup_record(entry, 0, nr_pages);
rcu_read_lock();
memcg = mem_cgroup_from_id(id);
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 7f34343c075a..08c3246f9269 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -171,6 +171,9 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
unsigned long length;
struct swap_cgroup_ctrl *ctrl;
 
+   if (mem_cgroup_disabled())
+   return 0;
+
length = DIV_ROUND_UP(max_pages, SC_PER_PAGE);
array_size = length * sizeof(void *);
 
@@ -206,6 +209,9 @@ void swap_cgroup_swapoff(int type)
unsigned long i, length;
struct swap_cgroup_ctrl *ctrl;
 
+   if (mem_cgroup_disabled())
+   return;
+
mutex_lock(_cgroup_mutex);
ctrl = _cgroup_ctrl[type];
map = ctrl->map;
-- 
2.30.1



[PATCH 2/2] mm: memcontrol: deprecate swapaccounting=0 mode

2021-03-18 Thread Johannes Weiner
The swapaccounting= commandline option already does very little
today. To close a trivial containment failure case, the swap ownership
tracking part of the swap controller has recently become mandatory
(see commit 2d1c498072de ("mm: memcontrol: make swap tracking an
integral part of memory control") for details), which makes up the
majority of the work during swapout, swapin, and the swap slot map.

The only thing left under this flag is the page_counter operations and
the visibility of the swap control files in the first place, which are
rather meager savings. There also aren't many scenarios, if any, where
controlling the memory of a cgroup while allowing it unlimited access
to a global swap space is a workable resource isolation stragegy.

On the other hand, there have been several bugs and confusion around
the many possible swap controller states (cgroup1 vs cgroup2 behavior,
memory accounting without swap accounting, memcg runtime disabled).

This puts the maintenance overhead of retaining the toggle above its
practical benefits. Deprecate it.

Suggested-by: Shakeel Butt 
Signed-off-by: Johannes Weiner 
---
 .../admin-guide/kernel-parameters.txt |  5 --
 include/linux/memcontrol.h|  4 --
 mm/memcontrol.c   | 48 ++-
 3 files changed, 15 insertions(+), 42 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 942bbef8f128..986d45dd8c37 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5322,11 +5322,6 @@
This parameter controls use of the Protected
Execution Facility on pSeries.
 
-   swapaccount=[0|1]
-   [KNL] Enable accounting of swap in memory resource
-   controller if no parameter or 1 is given or disable
-   it if 0 is given (See 
Documentation/admin-guide/cgroup-v1/memory.rst)
-
swiotlb=[ARM,IA-64,PPC,MIPS,X86]
Format: {  | force | noforce }
 -- Number of I/O TLB slabs
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4064c9dda534..ef9613538d36 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -874,10 +874,6 @@ struct mem_cgroup *mem_cgroup_get_oom_group(struct 
task_struct *victim,
struct mem_cgroup *oom_domain);
 void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
 
-#ifdef CONFIG_MEMCG_SWAP
-extern bool cgroup_memory_noswap;
-#endif
-
 void lock_page_memcg(struct page *page);
 void unlock_page_memcg(struct page *page);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 49bdcf603af1..b036c4fb0fa7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,13 +85,6 @@ static bool cgroup_memory_nosocket;
 /* Kernel memory accounting disabled? */
 static bool cgroup_memory_nokmem;
 
-/* Whether the swap controller is active */
-#ifdef CONFIG_MEMCG_SWAP
-bool cgroup_memory_noswap __read_mostly;
-#else
-#define cgroup_memory_noswap   1
-#endif
-
 #ifdef CONFIG_CGROUP_WRITEBACK
 static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
 #endif
@@ -99,7 +92,11 @@ static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
 /* Whether legacy memory+swap accounting is active */
 static bool do_memsw_account(void)
 {
-   return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && 
!cgroup_memory_noswap;
+   /* cgroup2 doesn't do mem+swap accounting */
+   if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+   return false;
+
+   return true;
 }
 
 #define THRESHOLDS_EVENTS_TARGET 128
@@ -7019,7 +7016,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
entry)
if (!mem_cgroup_is_root(memcg))
page_counter_uncharge(>memory, nr_entries);
 
-   if (!cgroup_memory_noswap && memcg != swap_memcg) {
+   if (memcg != swap_memcg) {
if (!mem_cgroup_is_root(swap_memcg))
page_counter_charge(_memcg->memsw, nr_entries);
page_counter_uncharge(>memsw, nr_entries);
@@ -7073,7 +7070,7 @@ int mem_cgroup_try_charge_swap(struct page *page, 
swp_entry_t entry)
 
memcg = mem_cgroup_id_get_online(memcg);
 
-   if (!cgroup_memory_noswap && !mem_cgroup_is_root(memcg) &&
+   if (!mem_cgroup_is_root(memcg) &&
!page_counter_try_charge(>swap, nr_pages, )) {
memcg_memory_event(memcg, MEMCG_SWAP_MAX);
memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
@@ -7108,7 +7105,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned 
int nr_pages)
rcu_read_lock();
memcg = mem_cgroup_from_id(id);
if (memcg) {
-   if (!cgroup_memory_noswap && !mem_

Re: [PATCH v4 4/5] mm: memcontrol: use obj_cgroup APIs to charge kmem pages

2021-03-18 Thread Johannes Weiner
On Thu, Mar 18, 2021 at 07:06:57PM +0800, Muchun Song wrote:
> Since Roman series "The new cgroup slab memory controller" applied. All
> slab objects are charged via the new APIs of obj_cgroup. The new APIs
> introduce a struct obj_cgroup to charge slab objects. It prevents
> long-living objects from pinning the original memory cgroup in the memory.
> But there are still some corner objects (e.g. allocations larger than
> order-1 page on SLUB) which are not charged via the new APIs. Those
> objects (include the pages which are allocated from buddy allocator
> directly) are charged as kmem pages which still hold a reference to
> the memory cgroup.
> 
> We want to reuse the obj_cgroup APIs to charge the kmem pages.
> If we do that, we should store an object cgroup pointer to
> page->memcg_data for the kmem pages.
> 
> Finally, page->memcg_data will have 3 different meanings.
> 
>   1) For the slab pages, page->memcg_data points to an object cgroups
>  vector.
> 
>   2) For the kmem pages (exclude the slab pages), page->memcg_data
>  points to an object cgroup.
> 
>   3) For the user pages (e.g. the LRU pages), page->memcg_data points
>  to a memory cgroup.
> 
> We do not change the behavior of page_memcg() and page_memcg_rcu().
> They are also suitable for LRU pages and kmem pages. Why?
> 
> Because memory allocations pinning memcgs for a long time - it exists
> at a larger scale and is causing recurring problems in the real world:
> page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted
> into a new cgroup every time. Unreclaimable dying cgroups pile up,
> waste memory, and make page reclaim very inefficient.
> 
> We can convert LRU pages and most other raw memcg pins to the objcg
> direction to fix this problem, and then the page->memcg will always
> point to an object cgroup pointer. At that time, LRU pages and kmem
> pages will be treated the same. The implementation of page_memcg()
> will remove the kmem page check.
> 
> This patch aims to charge the kmem pages by using the new APIs of
> obj_cgroup. Finally, the page->memcg_data of the kmem page points to
> an object cgroup. We can use the __page_objcg() to get the object
> cgroup associated with a kmem page. Or we can use page_memcg()
> to get the memory cgroup associated with a kmem page, but caller must
> ensure that the returned memcg won't be released (e.g. acquire the
> rcu_read_lock or css_set_lock).
> 
> Signed-off-by: Muchun Song 

Thanks for the updated version, looks good to me!

Acked-by: Johannes Weiner 


Re: [PATCH v4 3/5] mm: memcontrol: change ug->dummy_page only if memcg changed

2021-03-18 Thread Johannes Weiner
On Thu, Mar 18, 2021 at 07:06:56PM +0800, Muchun Song wrote:
> Just like assignment to ug->memcg, we only need to update ug->dummy_page
> if memcg changed. So move it to there. This is a very small optimization.
> 
> Signed-off-by: Muchun Song 

Acked-by: Johannes Weiner 


Re: [PATCH v4 2/5] mm: memcontrol: directly access page->memcg_data in mm/page_alloc.c

2021-03-18 Thread Johannes Weiner
On Thu, Mar 18, 2021 at 07:06:55PM +0800, Muchun Song wrote:
> The page_memcg() is not suitable for use by page_expected_state() and
> page_bad_reason(). Because it can BUG_ON() for the slab pages when
> CONFIG_DEBUG_VM is enabled. As neither lru, nor kmem, nor slab page
> should have anything left in there by the time the page is freed, what
> we care about is whether the value of page->memcg_data is 0. So just
> directly access page->memcg_data here.
> 
> Signed-off-by: Muchun Song 

Acked-by: Johannes Weiner 


Re: [PATCH v2 2/2] mm/memcg: set memcg when split page

2021-03-18 Thread Johannes Weiner
On Thu, Mar 18, 2021 at 03:05:00PM +0100, Michal Hocko wrote:
> On Thu 11-03-21 12:37:20, Hugh Dickins wrote:
> > On Thu, 11 Mar 2021, Michal Hocko wrote:
> > > On Thu 11-03-21 10:21:39, Johannes Weiner wrote:
> > > > On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> > > > > Johannes, Hugh,
> > > > > 
> > > > > what do you think about this approach? If we want to stick with
> > > > > split_page approach then we need to update the missing place Matthew 
> > > > > has
> > > > > pointed out.
> > > > 
> > > > I find the __free_pages() code quite tricky as well. But for that
> > > > reason I would actually prefer to initiate the splitting in there,
> > > > since that's the place where we actually split the page, rather than
> > > > spread the handling of this situation further out.
> > > > 
> > > > The race condition shouldn't be hot, so I don't think we need to be as
> > > > efficient about setting page->memcg_data only on the higher-order
> > > > buddies as in Willy's scratch patch. We can call split_page_memcg(),
> > > > which IMO should actually help document what's happening to the page.
> > > > 
> > > > I think that function could also benefit a bit more from step-by-step
> > > > documentation about what's going on. The kerneldoc is helpful, but I
> > > > don't think it does justice to how tricky this race condition is.
> > > > 
> > > > Something like this?
> > > > 
> > > > void __free_pages(struct page *page, unsigned int order)
> > > > {
> > > > /*
> > > >  * Drop the base reference from __alloc_pages and free. In
> > > >  * case there is an outstanding speculative reference, from
> > > >  * e.g. the page cache, it will put and free the page later.
> > > >  */
> > > > if (likely(put_page_testzero(page))) {
> > > > free_the_page(page, order);
> > > > return;
> > > > }
> > > > 
> > > > /*
> > > >  * The speculative reference will put and free the page.
> > > >  *
> > > >  * However, if the speculation was into a higher-order page
> > > >  * that isn't marked compound, the other side will know
> > > >  * nothing about our buddy pages and only free the order-0
> > > >  * page at the start of our chunk! We must split off and free
> > > >  * the buddy pages here.
> > > >  *
> > > >  * The buddy pages aren't individually refcounted, so they
> > > >  * can't have any pending speculative references themselves.
> > > >  */
> > > > if (!PageHead(page) && order > 0) {
> > > > split_page_memcg(page, 1 << order);
> > > > while (order-- > 0)
> > > > free_the_page(page + (1 << order), order);
> > > > }
> > > > }
> > > 
> > > Fine with me. Mathew was concerned about more places that do something
> > > similar but I would say that if we find out more places we might
> > > reconsider and currently stay with a reasonably clear model that it is
> > > only head patch that carries the memcg information and split_page_memcg
> > > is necessary to break such page into smaller pieces.
> > 
> > I agree: I do like Johannes' suggestion best, now that we already
> > have split_page_memcg().  Not too worried about contrived use of
> > free_unref_page() here; and whether non-compound high-order pages
> > should be perpetuated is a different discussion.
> 
> Matthew, are you planning to post a patch with suggested changes or
> should I do it?

I'll post a proper patch.


Re: [PATCH] memcg: set page->private before calling swap_readpage

2021-03-18 Thread Johannes Weiner
On Wed, Mar 17, 2021 at 06:59:59PM -0700, Shakeel Butt wrote:
> The function swap_readpage() (and other functions it call) extracts swap
> entry from page->private. However for SWP_SYNCHRONOUS_IO, the kernel
> skips the swapcache and thus we need to manually set the page->private
> with the swap entry before calling swap_readpage().
> 
> Signed-off-by: Shakeel Butt 
> Reported-by: Heiko Carstens 

LGTM


Re: [PATCH] memcg: set page->private before calling swap_readpage

2021-03-18 Thread Johannes Weiner
On Thu, Mar 18, 2021 at 03:01:25PM +0100, Michal Hocko wrote:
> On Wed 17-03-21 18:59:59, Shakeel Butt wrote:
> > The function swap_readpage() (and other functions it call) extracts swap
> > entry from page->private. However for SWP_SYNCHRONOUS_IO, the kernel
> > skips the swapcache and thus we need to manually set the page->private
> > with the swap entry before calling swap_readpage().
> 
> One thing that is not really clear to me is whether/why this is only
> needed with your patch. Can you expand a bit on that please? Maybe I am
> just missing something obvious but I just do not see any connection.

It was always needed, his original patch erroneously removed it.


Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

2021-03-17 Thread Johannes Weiner
On Tue, Mar 16, 2021 at 11:05:11PM -0700, Arjun Roy wrote:
> On Tue, Mar 16, 2021 at 3:27 AM Johannes Weiner  wrote:
> >
> > Hello,
> >
> > On Mon, Mar 15, 2021 at 09:16:45PM -0700, Arjun Roy wrote:
> > > From: Arjun Roy 
> > >
> > > TCP zerocopy receive is used by high performance network applications
> > > to further scale. For RX zerocopy, the memory containing the network
> > > data filled by the network driver is directly mapped into the address
> > > space of high performance applications. To keep the TLB cost low,
> > > these applications unmap the network memory in big batches. So, this
> > > memory can remain mapped for long time. This can cause a memory
> > > isolation issue as this memory becomes unaccounted after getting
> > > mapped into the application address space. This patch adds the memcg
> > > accounting for such memory.
> > >
> > > Accounting the network memory comes with its own unique challenges.
> > > The high performance NIC drivers use page pooling to reuse the pages
> > > to eliminate/reduce expensive setup steps like IOMMU. These drivers
> > > keep an extra reference on the pages and thus we can not depend on the
> > > page reference for the uncharging. The page in the pool may keep a
> > > memcg pinned for arbitrary long time or may get used by other memcg.
> >
> > The page pool knows when a page is unmapped again and becomes
> > available for recycling, right? Essentially the 'free' phase of that
> > private allocator. That's where the uncharge should be done.
> >
> 
> In general, no it does not.  The page pool, how it operates and whether it
> exists in the first place, is an optimization that a given NIC driver can 
> choose
> to make - and so there's no generic plumbing that ties page unmap events to
> something that a page pool could subscribe to that I am aware of. All it can 
> do
> is check, at a given point, whether it can reuse a page or not, typically by
> checking the current page refcount.
> 
> A couple of examples for drivers with such a mechanism - mlx5:
> (https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c#L248)

if (page_ref_count(cache->page_cache[cache->head].page) != 1)

So IIUC it co-opts the page count used by the page allocator, offset
by the base ref of the pool. That means it doesn't control the put
path and won't be aware of when pages are used and unused.

How does it free those pages back to the system eventually? Does it
just do a series of put_page() on pages in the pool when something
determines it is too big?

However it does it, it found some way to live without a
destructor. But now we need one.

> Or intel fm10k:
> (https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/intel/fm10k/fm10k_main.c#L207)
> 
> Note that typically map count is not checked (maybe because page-flipping
> receive zerocopy did not exist as a consideration when the driver was 
> written).
> 
> So given that the page pool is essentially checking on demand for whether a 
> page
> is usable or not - since there is no specific plumbing invoked when a page is
> usable again (i.e. unmapped, in this case) - we opted to hook into when the
> mapcount is decremented inside unmap() path.

The problem is that the page isn't reusable just because it's
unmapped. The user may have vmspliced those pages into a pipe and be
pinning them long after the unmap.

I don't know whether that happens in real life, but it's a legitimate
thing to do. At the very least it would be an avenue for abuse.

> > For one, it's more aligned with the usual memcg charge lifetime rules.
> >
> > But also it doesn't add what is essentially a private driver callback
> > to the generic file unmapping path.
> >
> 
> I understand the concern, and share it - the specific thing we'd like to avoid
> is to have driver specific code in the unmap path, and not in the least 
> because
> any given driver could do its own thing.
> 
> Rather, we consider this mechanism that we added as generic to zerocopy 
> network
> data reception - that it does the right thing, no matter what the driver is
> doing. This would be transparent to the driver, in other words - all the 
> driver
> has to do is to continue doing what it was before, using page->refcnt == 1 to
> decide whether it can use a page or if it is not already in use.
> 
> 
> Consider this instead as a broadly applicable networking feature adding a
> callback to the unmap path, instead of a particular driver. And while it is 
> just
> TCP at present, it fundamentally isn't limited to TCP.
> 
> I do have a request for clarifica

Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

2021-03-16 Thread Johannes Weiner
Hello,

On Mon, Mar 15, 2021 at 09:16:45PM -0700, Arjun Roy wrote:
> From: Arjun Roy 
> 
> TCP zerocopy receive is used by high performance network applications
> to further scale. For RX zerocopy, the memory containing the network
> data filled by the network driver is directly mapped into the address
> space of high performance applications. To keep the TLB cost low,
> these applications unmap the network memory in big batches. So, this
> memory can remain mapped for long time. This can cause a memory
> isolation issue as this memory becomes unaccounted after getting
> mapped into the application address space. This patch adds the memcg
> accounting for such memory.
> 
> Accounting the network memory comes with its own unique challenges.
> The high performance NIC drivers use page pooling to reuse the pages
> to eliminate/reduce expensive setup steps like IOMMU. These drivers
> keep an extra reference on the pages and thus we can not depend on the
> page reference for the uncharging. The page in the pool may keep a
> memcg pinned for arbitrary long time or may get used by other memcg.

The page pool knows when a page is unmapped again and becomes
available for recycling, right? Essentially the 'free' phase of that
private allocator. That's where the uncharge should be done.

For one, it's more aligned with the usual memcg charge lifetime rules.

But also it doesn't add what is essentially a private driver callback
to the generic file unmapping path.

Finally, this will eliminate the need for making up a new charge type
(MEMCG_DATA_SOCK) and allow using the standard kmem charging API.

> This patch decouples the uncharging of the page from the refcnt and
> associates it with the map count i.e. the page gets uncharged when the
> last address space unmaps it. Now the question is, what if the driver
> drops its reference while the page is still mapped? That is fine as
> the address space also holds a reference to the page i.e. the
> reference count can not drop to zero before the map count.
> 
> Signed-off-by: Arjun Roy 
> Co-developed-by: Shakeel Butt 
> Signed-off-by: Shakeel Butt 
> Signed-off-by: Eric Dumazet 
> Signed-off-by: Soheil Hassas Yeganeh 
> ---
> 
> Changelog since v1:
> - Pages accounted for in this manner are now tracked via MEMCG_SOCK.
> - v1 allowed for a brief period of double-charging, now we have a
>   brief period of under-charging to avoid undue memory pressure.

I'm afraid we'll have to go back to v1.

Let's address the issues raised with it:

1. The NR_FILE_MAPPED accounting. It is longstanding Linux behavior
   that driver pages mapped into userspace are accounted as file
   pages, because userspace is actually doing mmap() against a driver
   file/fd (as opposed to an anon mmap). That is how they show up in
   vmstat, in meminfo, and in the per process stats. There is no
   reason to make memcg deviate from this. If we don't like it, it
   should be taken on by changing vm_insert_page() - not trick rmap
   into thinking these arent memcg pages and then fixing it up with
   additional special-cased accounting callbacks.

   v1 did this right, it charged the pages the way we handle all other
   userspace pages: before rmap, and then let the generic VM code do
   the accounting for us with the cgroup-aware vmstat infrastructure.

2. The double charging. Could you elaborate how much we're talking
   about in any given batch? Is this a problem worth worrying about?

   The way I see it, any conflict here is caused by the pages being
   counted in the SOCK counter already, but not actually *tracked* on
   a per page basis. If it's worth addressing, we should look into
   fixing the root cause over there first if possible, before trying
   to work around it here.

   The newly-added GFP_NOFAIL is especially worrisome. The pages
   should be charged before we make promises to userspace, not be
   force-charged when it's too late.

   We have sk context when charging the inserted pages. Can we
   uncharge MEMCG_SOCK after each batch of inserts? That's only 32
   pages worth of overcharging, so not more than the regular charge
   batch memcg is using.

   An even better way would be to do charge stealing where we reuse
   the existing MEMCG_SOCK charges and don't have to get any new ones
   at all - just set up page->memcg and remove the charge from the sk.

   But yeah, it depends a bit if this is a practical concern.

Thanks,
Johannes


[PATCH] mm: memcontrol: switch to rstat fix

2021-03-15 Thread Johannes Weiner
Fix a sleep in atomic section problem: wb_writeback() takes a spinlock
and calls wb_over_bg_thresh() -> mem_cgroup_wb_stats, but the regular
rstat flushing function called from in there does lockbreaking and may
sleep. Switch to the atomic variant, cgroup_rstat_irqsafe().

To be consistent with other memcg flush calls, but without adding
another memcg wrapper, inline and drop memcg_flush_vmstats() instead.

Signed-off-by: Johannes Weiner 
---
 mm/memcontrol.c | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f7fb12d3c2fc..9091913ec877 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -757,11 +757,6 @@ mem_cgroup_largest_soft_limit_node(struct 
mem_cgroup_tree_per_node *mctz)
return mz;
 }
 
-static void memcg_flush_vmstats(struct mem_cgroup *memcg)
-{
-   cgroup_rstat_flush(memcg->css.cgroup);
-}
-
 /**
  * __mod_memcg_state - update cgroup memory statistics
  * @memcg: the memory cgroup
@@ -1572,7 +1567,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
 *
 * Current memory state:
 */
-   memcg_flush_vmstats(memcg);
+   cgroup_rstat_flush(memcg->css.cgroup);
 
for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
u64 size;
@@ -3523,7 +3518,7 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup 
*memcg, bool swap)
unsigned long val;
 
if (mem_cgroup_is_root(memcg)) {
-   memcg_flush_vmstats(memcg);
+   cgroup_rstat_flush(memcg->css.cgroup);
val = memcg_page_state(memcg, NR_FILE_PAGES) +
memcg_page_state(memcg, NR_ANON_MAPPED);
if (swap)
@@ -3925,7 +3920,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void 
*v)
int nid;
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-   memcg_flush_vmstats(memcg);
+   cgroup_rstat_flush(memcg->css.cgroup);
 
for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
seq_printf(m, "%s=%lu", stat->name,
@@ -3997,7 +3992,7 @@ static int memcg_stat_show(struct seq_file *m, void *v)
 
BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
-   memcg_flush_vmstats(memcg);
+   cgroup_rstat_flush(memcg->css.cgroup);
 
for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
unsigned long nr;
@@ -4500,7 +4495,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, 
unsigned long *pfilepages,
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
struct mem_cgroup *parent;
 
-   memcg_flush_vmstats(memcg);
+   cgroup_rstat_flush_irqsafe(memcg->css.cgroup);
 
*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
-- 
2.30.1



Re: [External] Re: [PATCH v3 2/4] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-12 Thread Johannes Weiner
On Fri, Mar 12, 2021 at 02:42:45PM -0800, Shakeel Butt wrote:
> Hi Johannes,
> 
> On Fri, Mar 12, 2021 at 11:23 AM Johannes Weiner  wrote:
> >
> [...]
> >
> > Longer term we most likely need it there anyway. The issue you are
> > describing in the cover letter - allocations pinning memcgs for a long
> > time - it exists at a larger scale and is causing recurring problems
> > in the real world: page cache doesn't get reclaimed for a long time,
> > or is used by the second, third, fourth, ... instance of the same job
> > that was restarted into a new cgroup every time. Unreclaimable dying
> > cgroups pile up, waste memory, and make page reclaim very inefficient.
> >
> 
> For the scenario described above, do we really want to reparent the
> page cache pages? Shouldn't we recharge the pages to the second,
> third, fourth and so on, memcgs? My concern is that we will see a big
> chunk of page cache pages charged to root and will only get reclaimed
> on global pressure.

Sorry, I'm proposing to reparent to the ancestor, not root. It's an
optimization, not a change in user-visible behavior.

As far as the user can tell, the pages already belong to the parent
after deletion: they'll show up in the parent's stats, naturally, and
they will get reclaimed as part of the parent being reclaimed.

The dead cgroup doesn't even have its own limit anymore after
.css_reset() has run. And we already physically reparent slab objects
in memcg_reparent_objcgs() and memcg_drain_all_list_lrus().

I'm just saying we should do the same thing for LRU pages.


Re: [PATCH v3 4/4] mm: memcontrol: move PageMemcgKmem to the scope of CONFIG_MEMCG_KMEM

2021-03-12 Thread Johannes Weiner
On Tue, Mar 09, 2021 at 06:07:17PM +0800, Muchun Song wrote:
> The page only can be marked as kmem when CONFIG_MEMCG_KMEM is enabled.
> So move PageMemcgKmem() to the scope of the CONFIG_MEMCG_KMEM.
> 
> As a bonus, on !CONFIG_MEMCG_KMEM build some code can be compiled out.
> 
> Signed-off-by: Muchun Song 

Acked-by: Johannes Weiner 


Re: [External] Re: [PATCH v3 2/4] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-12 Thread Johannes Weiner
Hello Muchun,

On Fri, Mar 12, 2021 at 03:14:07PM +0800, Muchun Song wrote:
> On Thu, Mar 11, 2021 at 9:12 PM Johannes Weiner  wrote:
> > > @@ -358,14 +358,26 @@ enum page_memcg_data_flags {
> > >
> > >  #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
> > >
> > > +/* Return true for charged page, otherwise false. */
> > > +static inline bool page_memcg_charged(struct page *page)
> > > +{
> > > + unsigned long memcg_data = page->memcg_data;
> > > +
> > > + VM_BUG_ON_PAGE(PageSlab(page), page);
> > > + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> > > +
> > > + return !!memcg_data;
> > > +}
> >
> > This is mosntly used right before a page_memcg_check(), which makes it
> > somewhat redundant except for the VM_BUG_ON_PAGE() for slab pages.
> 
> Should I rename page_memcg_charged to page_memcg_raw?
> And use page_memcg_raw to check whether the page is charged.
> 
> static inline bool page_memcg_charged(struct page *page)
> {
> return page->memcg_data;
> }

You can just directly access page->memcg_data in places where you'd
use this helper. I think it's only the two places in mm/page_alloc.c,
and they already have CONFIG_MEMCG in place, so raw access works.

> > But it's also a bit of a confusing name: slab pages are charged too,
> > but this function would crash if you called it on one.
> >
> > In light of this, and in light of what I wrote above about hopefully
> > converting more and more allocations from raw memcg pins to
> > reparentable objcg, it would be bettor to have
> >
> > page_memcg() for 1:1 page-memcg mappings, i.e. LRU & kmem
> 
> Sorry. I do not get the point. Because in the next patch, the kmem
> page will use objcg to charge memory. So the page_memcg()
> should not be suitable for the kmem pages. So I add a VM_BUG_ON
> in the page_memcg() to catch invalid usage.
> 
> So I changed some page_memcg() calling to page_memcg_check()
> in this patch, but you suggest using page_memcg().

It would be better if page_memcg() worked on LRU and kmem pages. I'm
proposing to change its implementation.

The reason is that page_memcg_check() allows everything and does no
sanity checking. You need page_memcg_charged() for the sanity checks
that it's LRU or kmem, but that's a bit difficult to understand, and
it's possible people will add more callsites to page_memcg_check()
without the page_memcg_charged() checks. It makes the code less safe.

We should discourage page_memcg_check() and make page_memcg() more
useful instead.

> I am very confused. Are you worried about the extra overhead brought
> by calling page_memcg_rcu()? In the next patch, I will remove
> page_memcg_check() calling and use objcg APIs.

I'm just worried about the usability of the interface. It should be
easy to use, and make it obvious if there is a user bug.

For example, in your next patch, mod_lruvec_page_state does this:

   if (PageMemcgKmem(head)) {
   rcu_read_lock();
   memcg = obj_cgroup_memcg(page_objcg(page));
   } else {
   memcg = page_memcg(head);
   /*
* Untracked pages have no memcg, no lruvec. Update only the
* node.
*/
   if (!memcg) {
   __mod_node_page_state(pgdat, idx, val);
   return;
   }
}

lruvec = mem_cgroup_lruvec(memcg, pgdat);
__mod_lruvec_state(lruvec, idx, val);

   if (PageMemcgKmem(head))
   rcu_read_unlock();

I'm proposing to implement page_memcg() in a way where you can do this:

rcu_read_lock();
memcg = page_memcg(page);
if (!memcg) {
rcu_read_unlock();
__mod_node_page_state();
return;
}
lruvec = mem_cgroup_lruvec(memcg, pgdat);
__mod_lruvec_state(lruvec, idx, val);
rcu_read_unlock();

[ page_memcg() is:

if (PageMemcgKmem(page))
return obj_cgroup_memcg(__page_objcg(page));
else
return __page_memcg(page);

  and __page_objcg() and __page_memcg() do the raw page->memcg_data
  translation and the VM_BUG_ON_PAGE() checks for MEMCG_DATA_* ]

This is a lot simpler and less error prone.

It does take rcu_read_lock() for LRU pages too, which strictly it
doesn't need to right now. But it's cheap enough (and actually saves a
branch).

Longer term we most likely need it there anyway. The issue you are
describing in the cover letter - allocations pinning memcgs for a long
time - it exists at a larger scale and is causing recurring problems
in the real world: page cache doesn't get reclaimed for a long time,
or is used by the second, third, f

Re: [External] Re: [PATCH v3 3/4] mm: memcontrol: use obj_cgroup APIs to charge kmem pages

2021-03-12 Thread Johannes Weiner
On Fri, Mar 12, 2021 at 05:22:55PM +0800, Muchun Song wrote:
> On Thu, Mar 11, 2021 at 6:05 AM Johannes Weiner  wrote:
> > > @@ -6828,7 +6857,7 @@ static void uncharge_batch(const struct 
> > > uncharge_gather *ug)
> > >
> > >  static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> > >  {
> > > - unsigned long nr_pages;
> > > + unsigned long nr_pages, nr_kmem;
> > >   struct mem_cgroup *memcg;
> > >
> > >   VM_BUG_ON_PAGE(PageLRU(page), page);
> > > @@ -6836,34 +6865,44 @@ static void uncharge_page(struct page *page, 
> > > struct uncharge_gather *ug)
> > >   if (!page_memcg_charged(page))
> > >   return;
> > >
> > > + nr_pages = compound_nr(page);
> > >   /*
> > >* Nobody should be changing or seriously looking at
> > > -  * page memcg at this point, we have fully exclusive
> > > -  * access to the page.
> > > +  * page memcg or objcg at this point, we have fully
> > > +  * exclusive access to the page.
> > >*/
> > > - memcg = page_memcg_check(page);
> > > + if (PageMemcgKmem(page)) {
> > > + struct obj_cgroup *objcg;
> > > +
> > > + objcg = page_objcg(page);
> > > + memcg = obj_cgroup_memcg_get(objcg);
> > > +
> > > + page->memcg_data = 0;
> > > + obj_cgroup_put(objcg);
> > > + nr_kmem = nr_pages;
> > > + } else {
> > > + memcg = page_memcg(page);
> > > + page->memcg_data = 0;
> > > + nr_kmem = 0;
> > > + }
> >
> > Why is all this moved above the uncharge_batch() call?
> 
> Before calling obj_cgroup_put(), we need set page->memcg_data
> to zero. So I move "page->memcg_data = 0" to here.

Yeah, it makes sense to keep those together, but we can move them both
down to after the uncharge, right?

> > It separates the pointer manipulations from the refcounting, which
> > makes the code very difficult to follow.
> >
> > > +
> > >   if (ug->memcg != memcg) {
> > >   if (ug->memcg) {
> > >   uncharge_batch(ug);
> > >   uncharge_gather_clear(ug);
> > >   }
> > >   ug->memcg = memcg;
> > > + ug->dummy_page = page;
> >
> > Why this change?
> 
> Just like ug->memcg, we do not need to set
> ug->dummy_page in every loop.

Ah, okay. That's a reasonable change, it's just confusing because I
thought this was a requirement for the new code to work. But I didn't
see how it relied on that, and it made me think I'm not understanding
your code ;) It's better to split that into a separate patch.

> I will rework the code in the next version.

Thanks!


Re: [PATCH v2 2/2] mm/memcg: set memcg when split page

2021-03-11 Thread Johannes Weiner
On Thu, Mar 11, 2021 at 09:37:02AM +0100, Michal Hocko wrote:
> Johannes, Hugh,
> 
> what do you think about this approach? If we want to stick with
> split_page approach then we need to update the missing place Matthew has
> pointed out.

I find the __free_pages() code quite tricky as well. But for that
reason I would actually prefer to initiate the splitting in there,
since that's the place where we actually split the page, rather than
spread the handling of this situation further out.

The race condition shouldn't be hot, so I don't think we need to be as
efficient about setting page->memcg_data only on the higher-order
buddies as in Willy's scratch patch. We can call split_page_memcg(),
which IMO should actually help document what's happening to the page.

I think that function could also benefit a bit more from step-by-step
documentation about what's going on. The kerneldoc is helpful, but I
don't think it does justice to how tricky this race condition is.

Something like this?

void __free_pages(struct page *page, unsigned int order)
{
/*
 * Drop the base reference from __alloc_pages and free. In
 * case there is an outstanding speculative reference, from
 * e.g. the page cache, it will put and free the page later.
 */
if (likely(put_page_testzero(page))) {
free_the_page(page, order);
return;
}

/*
 * The speculative reference will put and free the page.
 *
 * However, if the speculation was into a higher-order page
 * that isn't marked compound, the other side will know
 * nothing about our buddy pages and only free the order-0
 * page at the start of our chunk! We must split off and free
 * the buddy pages here.
 *
 * The buddy pages aren't individually refcounted, so they
 * can't have any pending speculative references themselves.
 */
if (!PageHead(page) && order > 0) {
split_page_memcg(page, 1 << order);
while (order-- > 0)
free_the_page(page + (1 << order), order);
}
}


Re: [PATCH v3 2/4] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-11 Thread Johannes Weiner
On Tue, Mar 09, 2021 at 06:07:15PM +0800, Muchun Song wrote:
> We want to reuse the obj_cgroup APIs to charge the kmem pages.
> If we do that, we should store an object cgroup pointer to
> page->memcg_data for the kmem pages.
> 
> Finally, page->memcg_data can have 3 different meanings.
> 
>   1) For the slab pages, page->memcg_data points to an object cgroups
>  vector.
> 
>   2) For the kmem pages (exclude the slab pages), page->memcg_data
>  points to an object cgroup.
> 
>   3) For the user pages (e.g. the LRU pages), page->memcg_data points
>  to a memory cgroup.
> 
> Currently we always get the memory cgroup associated with a page via
> page_memcg() or page_memcg_rcu(). page_memcg_check() is special, it
> has to be used in cases when it's not known if a page has an
> associated memory cgroup pointer or an object cgroups vector. Because
> the page->memcg_data of the kmem page is not pointing to a memory
> cgroup in the later patch, the page_memcg() and page_memcg_rcu()
> cannot be applicable for the kmem pages. In this patch, make
> page_memcg() and page_memcg_rcu() no longer apply to the kmem pages.
> We do not change the behavior of the page_memcg_check(), it is also
> applicable for the kmem pages.
> 
> In the end, there are 3 helpers to get the memcg associated with a page.
> Usage is as follows.
> 
>   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
>  pages).
> 
>  - page_memcg()
>  - page_memcg_rcu()
> 
>   2) Get the memory cgroup associated with a page. It has to be used in
>  cases when it's not known if a page has an associated memory cgroup
>  pointer or an object cgroups vector. Returns NULL for slab pages or
>  uncharged pages. Otherwise, returns memory cgroup for charged pages
>  (e.g. the kmem pages, the LRU pages).
> 
>  - page_memcg_check()
> 
> In some place, we use page_memcg() to check whether the page is charged.
> Now introduce page_memcg_charged() helper to do that.
> 
> This is a preparation for reparenting the kmem pages.
> 
> Signed-off-by: Muchun Song 

I'm pretty excited about the direction this series is taking us. The
direct/raw pinning of memcgs from objects and allocations of various
lifetimes has been causing chronic problems with dying cgroups piling
up, consuming memory, and gumming up the works in everything that
needs to iterate the cgroup tree (page reclaim comes to mind).

The more allocation types we can convert to objcg, the better.

This patch in particular looks mostly good to me too. Some comments
inline:

> ---
>  include/linux/memcontrol.h | 33 +++--
>  mm/memcontrol.c| 23 +--
>  mm/page_alloc.c|  4 ++--
>  3 files changed, 42 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e6dc793d587d..83cbcdcfcc92 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -358,14 +358,26 @@ enum page_memcg_data_flags {
>  
>  #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
>  
> +/* Return true for charged page, otherwise false. */
> +static inline bool page_memcg_charged(struct page *page)
> +{
> + unsigned long memcg_data = page->memcg_data;
> +
> + VM_BUG_ON_PAGE(PageSlab(page), page);
> + VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
> +
> + return !!memcg_data;
> +}

This is mosntly used right before a page_memcg_check(), which makes it
somewhat redundant except for the VM_BUG_ON_PAGE() for slab pages.

But it's also a bit of a confusing name: slab pages are charged too,
but this function would crash if you called it on one.

In light of this, and in light of what I wrote above about hopefully
converting more and more allocations from raw memcg pins to
reparentable objcg, it would be bettor to have

page_memcg() for 1:1 page-memcg mappings, i.e. LRU & kmem
page_objcg() for 1:n page-memcg mappings, i.e. slab pages
page_memcg_check() for the very rare ambiguous cases
drop page_memcg_rcu() since page_memcg() is now rcu-safe

If we wanted to optimize, we could identify places that could do a
page_memcg_raw() that does page->memcg_data & ~MEMCG_DATA_FLAGS_MASK -
without READ_ONCE and without the kmem branch. However, I think the
stat functions are probably the hottest path when it comes to that,
and they now already include the kmem branch*.

* Roman mentioned splitting up the stats interface to optimize that,
  but I think we should be careful optimizing prematurely here. It's a
  bit of a maintainability concern, and it would also get in the way
  of easily converting more allocation types to objcg.

> @@ -855,10 +855,11 @@ void __mod_lruvec_page_state(struct page *page, enum 
> node_stat_item idx,
>int val)
>  {
>   struct page *head = compound_head(page); /* rmap on tail pages */
> - struct mem_cgroup *memcg = page_memcg(head);
> + 

Re: [PATCH v3 1/4] mm: memcontrol: introduce obj_cgroup_{un}charge_pages

2021-03-11 Thread Johannes Weiner
On Tue, Mar 09, 2021 at 06:07:14PM +0800, Muchun Song wrote:
> We know that the unit of slab object charging is bytes, the unit of
> kmem page charging is PAGE_SIZE. If we want to reuse obj_cgroup APIs
> to charge the kmem pages, we should pass PAGE_SIZE (as third parameter)
> to obj_cgroup_charge(). Because the size is already PAGE_SIZE, we can
> skip touch the objcg stock. And obj_cgroup_{un}charge_pages() are
> introduced to charge in units of page level.
> 
> In the later patch, we also can reuse those two helpers to charge or
> uncharge a number of kernel pages to a object cgroup. This is just
> a code movement without any functional changes.
> 
> Signed-off-by: Muchun Song 
> Acked-by: Roman Gushchin 

Acked-by: Johannes Weiner 


Re: [PATCH v3 3/4] mm: memcontrol: use obj_cgroup APIs to charge kmem pages

2021-03-10 Thread Johannes Weiner
Hello Munchun,

On Tue, Mar 09, 2021 at 06:07:16PM +0800, Muchun Song wrote:
> @@ -6806,11 +6823,23 @@ static inline void uncharge_gather_clear(struct 
> uncharge_gather *ug)
>  static void uncharge_batch(const struct uncharge_gather *ug)
>  {
>   unsigned long flags;
> + unsigned long nr_pages;
>  
> - if (!mem_cgroup_is_root(ug->memcg)) {
> - page_counter_uncharge(>memcg->memory, ug->nr_pages);
> + /*
> +  * The kmem pages can be reparented to the root memcg, in
> +  * order to prevent the memory counter of root memcg from
> +  * increasing indefinitely. We should decrease the memory
> +  * counter when unchange.
> +  */
> + if (mem_cgroup_is_root(ug->memcg))
> + nr_pages = ug->nr_kmem;
> + else
> + nr_pages = ug->nr_pages;

Correct or not, I find this unreadable. We're uncharging nr_kmem on
the root, and nr_pages against leaf groups?

It implies several things that might not be immediately obvious to the
reader of this function. Namely, that nr_kmem is a subset of nr_pages.
Or that we don't *want* to account LRU pages for the root cgroup.

The old code followed a very simple pattern: the root memcg's page
counters aren't touched.

This is no longer true: we modify them depending on very specific
circumstances. But that's too clever for the stupid uncharge_batch()
which is only supposed to flush a number of accumulators into their
corresponding page counters.

This distinction really needs to be moved down to uncharge_page() now.

> @@ -6828,7 +6857,7 @@ static void uncharge_batch(const struct uncharge_gather 
> *ug)
>  
>  static void uncharge_page(struct page *page, struct uncharge_gather *ug)
>  {
> - unsigned long nr_pages;
> + unsigned long nr_pages, nr_kmem;
>   struct mem_cgroup *memcg;
>  
>   VM_BUG_ON_PAGE(PageLRU(page), page);
> @@ -6836,34 +6865,44 @@ static void uncharge_page(struct page *page, struct 
> uncharge_gather *ug)
>   if (!page_memcg_charged(page))
>   return;
>  
> + nr_pages = compound_nr(page);
>   /*
>* Nobody should be changing or seriously looking at
> -  * page memcg at this point, we have fully exclusive
> -  * access to the page.
> +  * page memcg or objcg at this point, we have fully
> +  * exclusive access to the page.
>*/
> - memcg = page_memcg_check(page);
> + if (PageMemcgKmem(page)) {
> + struct obj_cgroup *objcg;
> +
> + objcg = page_objcg(page);
> + memcg = obj_cgroup_memcg_get(objcg);
> +
> + page->memcg_data = 0;
> + obj_cgroup_put(objcg);
> + nr_kmem = nr_pages;
> + } else {
> + memcg = page_memcg(page);
> + page->memcg_data = 0;
> + nr_kmem = 0;
> + }

Why is all this moved above the uncharge_batch() call?

It separates the pointer manipulations from the refcounting, which
makes the code very difficult to follow.

> +
>   if (ug->memcg != memcg) {
>   if (ug->memcg) {
>   uncharge_batch(ug);
>   uncharge_gather_clear(ug);
>   }
>   ug->memcg = memcg;
> + ug->dummy_page = page;

Why this change?

>   /* pairs with css_put in uncharge_batch */
>   css_get(>memcg->css);
>   }
>  
> - nr_pages = compound_nr(page);
>   ug->nr_pages += nr_pages;
> + ug->nr_kmem += nr_kmem;
> + ug->pgpgout += !nr_kmem;

Oof.

Yes, this pgpgout line is an equivalent transformation for counting
LRU compound pages. But unless you already know that, it's completely
impossible to understand what the intent here is.

Please avoid clever tricks like this. If you need to check whether the
page is kmem, test PageMemcgKmem() instead of abusing the counters as
boolean flags. This is supposed to be read by human beings, too.

> - if (PageMemcgKmem(page))
> - ug->nr_kmem += nr_pages;
> - else
> - ug->pgpgout++;
> -
> - ug->dummy_page = page;
> - page->memcg_data = 0;
> - css_put(>memcg->css);
> + css_put(>css);

Sorry, these two functions are no longer readable after your changes.

Please retain the following sequence as discrete steps:

1. look up memcg from the page
2. flush existing batch if memcg changed
3. add page's various counts to the current batch
4. clear page->memcg and decrease the referece count to whatever it was 
pointing to

And as per above, step 3. is where we should check whether to uncharge
the memcg's page counter at all:

if (PageMemcgKmem(page)) {
ug->nr_pages += nr_pages;
ug->nr_kmem += nr_pages;
} else {
/* LRU pages aren't accounted at the root level */
if (!mem_cgroup_is_root(memcg))
ug->nr_pages += nr_pages;
ug->pgpgout++;
}

In fact, it might be a good idea to rename ug->nr_pages to

Re: [PATCH V6] x86/mm: Tracking linear mapping split events

2021-03-08 Thread Johannes Weiner
On Fri, Mar 05, 2021 at 04:57:15PM -0800, Andrew Morton wrote:
> On Thu, 18 Feb 2021 15:57:44 -0800 Saravanan D  wrote:
> 
> > To help with debugging the sluggishness caused by TLB miss/reload,
> > we introduce monotonic hugepage [direct mapped] split event counts since
> > system state: SYSTEM_RUNNING to be displayed as part of
> > /proc/vmstat in x86 servers
> >
> > ...
> >
> > --- a/arch/x86/mm/pat/set_memory.c
> > +++ b/arch/x86/mm/pat/set_memory.c
> > @@ -120,6 +120,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  #ifdef CONFIG_SWAP
> > SWAP_RA,
> > SWAP_RA_HIT,
> > +#endif
> > +#ifdef CONFIG_X86
> > +   DIRECT_MAP_LEVEL2_SPLIT,
> > +   DIRECT_MAP_LEVEL3_SPLIT,
> >  #endif
> > NR_VM_EVENT_ITEMS
> >  };
> 
> This is the first appearance of arch-specific fields in /proc/vmstat.
> 
> I don't really see a problem with this - vmstat is basically a dumping
> ground of random developer stuff.  But was this the best place in which
> to present this data?

IMO it's a big plus for discoverability.

One of the first things I tend to do when triaging mysterious memory
issues is going to /proc/vmstat and seeing if anything looks abnormal.
There is value in making that file comprehensive for all things that
could indicate memory-related pathologies.

The impetus for adding these is a real-world tlb regression caused by
kprobes chewing up the direct mapping that took longer to debug than
necessary. We have the /proc/meminfo lines on the DirectMap, but those
are more useful when you already have a theory - they simply don't
make problems immediately stand out the same way.


[tip: sched/core] psi: Pressure states are unlikely

2021-03-06 Thread tip-bot2 for Johannes Weiner
The following commit has been merged into the sched/core branch of tip:

Commit-ID: fddc8bab531e217806b84906681324377d741c6c
Gitweb:
https://git.kernel.org/tip/fddc8bab531e217806b84906681324377d741c6c
Author:Johannes Weiner 
AuthorDate:Wed, 03 Mar 2021 11:46:58 +08:00
Committer: Ingo Molnar 
CommitterDate: Sat, 06 Mar 2021 12:40:23 +01:00

psi: Pressure states are unlikely

Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.

Signed-off-by: Johannes Weiner 
Signed-off-by: Chengming Zhou 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Link: 
https://lkml.kernel.org/r/20210303034659.91735-4-zhouchengm...@bytedance.com
---
 kernel/sched/psi.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 0fe6ff6..3907a6b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -219,17 +219,17 @@ static bool test_state(unsigned int *tasks, enum 
psi_states state)
 {
switch (state) {
case PSI_IO_SOME:
-   return tasks[NR_IOWAIT];
+   return unlikely(tasks[NR_IOWAIT]);
case PSI_IO_FULL:
-   return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+   return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]);
case PSI_MEM_SOME:
-   return tasks[NR_MEMSTALL];
+   return unlikely(tasks[NR_MEMSTALL]);
case PSI_MEM_FULL:
-   return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
+   return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
case PSI_CPU_SOME:
-   return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+   return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
case PSI_CPU_FULL:
-   return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
+   return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -729,7 +729,7 @@ static void psi_group_change(struct psi_group *group, int 
cpu,
 * task in a cgroup is in_memstall, the corresponding groupc
 * on that cpu is in PSI_MEM_FULL state.
 */
-   if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+   if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);
 
groupc->state_mask = state_mask;


Re: [PATCH v3] memcg: charge before adding to swapcache on swapin

2021-03-05 Thread Johannes Weiner
On Fri, Mar 05, 2021 at 08:42:00AM -0800, Shakeel Butt wrote:
> On Fri, Mar 5, 2021 at 8:25 AM Johannes Weiner  wrote:
> >
> [...]
> > I'd also rename cgroup_memory_noswap to cgroup_swapaccount - to match
> > the commandline and (hopefully) make a bit clearer what it effects.
> 
> Do we really need to keep supporting "swapaccount=0"? Is swap
> page_counter really a performance issue for systems with memcg and
> swap? To me, deprecating "swapaccount=0" simplifies already
> complicated code.

Now that you mention it, it's probably really not worth it.

I'll replace my cleanup patch with a removal patch that eliminates
everything behind swapaccount= except for a deprecation warning...


Re: [PATCH v3] memcg: charge before adding to swapcache on swapin

2021-03-05 Thread Johannes Weiner
On Fri, Mar 05, 2021 at 12:06:31AM -0800, Hugh Dickins wrote:
> On Wed, 3 Mar 2021, Shakeel Butt wrote:
> 
> > Currently the kernel adds the page, allocated for swapin, to the
> > swapcache before charging the page. This is fine but now we want a
> > per-memcg swapcache stat which is essential for folks who wants to
> > transparently migrate from cgroup v1's memsw to cgroup v2's memory and
> > swap counters. In addition charging a page before exposing it to other
> > parts of the kernel is a step in the right direction.
> > 
> > To correctly maintain the per-memcg swapcache stat, this patch has
> > adopted to charge the page before adding it to swapcache. One
> > challenge in this option is the failure case of add_to_swap_cache() on
> > which we need to undo the mem_cgroup_charge(). Specifically undoing
> > mem_cgroup_uncharge_swap() is not simple.
> > 
> > To resolve the issue, this patch introduces transaction like interface
> > to charge a page for swapin. The function mem_cgroup_charge_swapin_page()
> > initiates the charging of the page and mem_cgroup_finish_swapin_page()
> > completes the charging process. So, the kernel starts the charging
> > process of the page for swapin with mem_cgroup_charge_swapin_page(),
> > adds the page to the swapcache and on success completes the charging
> > process with mem_cgroup_finish_swapin_page().
> > 
> > Signed-off-by: Shakeel Butt 
> 
> Quite apart from helping with the stat you want, what you've ended
> up with here is a nice cleanup in several different ways (and I'm
> glad Johannes talked you out of __GFP_NOFAIL: much better like this).
> I'll say
> 
> Acked-by: Hugh Dickins 
> 
> but I am quite unhappy with the name mem_cgroup_finish_swapin_page():
> it doesn't finish the swapin, it doesn't finish the page, and I'm
> not persuaded by your paragraph above that there's any "transaction"
> here (if there were, I'd suggest "commit" instead of "finish"'; and
> I'd get worried by the css_put before it's called - but no, that's
> fine, it's independent).
> 
> How about complementing mem_cgroup_charge_swapin_page() with
> mem_cgroup_uncharge_swapin_swap()?  I think that describes well
> what it does, at least in the do_memsw_account() case, and I hope
> we can overlook that it does nothing at all in the other case.

Yes, that's better. The sequence is still somewhat mysterious for
people not overly familiar with memcg swap internals, but it's much
clearer for people who are.

I mildly prefer swapping the swapin bit:

mem_cgroup_swapin_charge_page()
mem_cgroup_swapin_uncharge_swap()

But either way works for me.

> And it really doesn't need a page argument: both places it's called
> have just allocated an order-0 page, there's no chance of a THP here;
> but you might have some idea of future expansion, or matching
> put_swap_page() - I won't object if you prefer to pass in the page.

Agreed.

> > + * mem_cgroup_finish_swapin_page - complete the swapin page charge 
> > transaction
> > + * @page: page charged for swapin
> > + * @entry: swap entry for which the page is charged
> > + *
> > + * This function completes the transaction of charging the page allocated 
> > for
> > + * swapin.
> > + */
> > +void mem_cgroup_finish_swapin_page(struct page *page, swp_entry_t entry)
> > +{
> > /*
> >  * Cgroup1's unified memory+swap counter has been charged with the
> >  * new swapcache page, finish the transfer by uncharging the swap
> > @@ -6760,20 +6796,14 @@ int mem_cgroup_charge(struct page *page, struct 
> > mm_struct *mm, gfp_t gfp_mask)
> >  * correspond 1:1 to page and swap slot lifetimes: we charge the
> >  * page to memory here, and uncharge swap when the slot is freed.
> >  */
> > -   if (do_memsw_account() && PageSwapCache(page)) {
> > -   swp_entry_t entry = { .val = page_private(page) };
> > +   if (!mem_cgroup_disabled() && do_memsw_account()) {
> 
> I understand why you put that !mem_cgroup_disabled() check in there,
> but I have a series of observations on that.
> 
> First I was going to say that it would be better left to
> mem_cgroup_uncharge_swap() itself.
> 
> Then I was going to say that I think it's already covered here
> by the cgroup_memory_noswap check inside do_memsw_account().
> 
> Then, going back to mem_cgroup_uncharge_swap(), I realized that 5.8's
> 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of
> memory control") removed the do_swap_account or cgroup_memory_noswap
> checks from mem_cgroup_uncharge_swap() and swap_cgroup_swapon() and
> swap_cgroup_swapoff() - so since then we have been allocating totally
> unnecessary swap_cgroup arrays when mem_cgroup_disabled() (and
> mem_cgroup_uncharge_swap() has worked by reading the zalloced array).
> 
> I think, or am I confused? If I'm right on that, one of us ought to
> send another patch putting back, either cgroup_memory_noswap checks
> or mem_cgroup_disabled() checks in those three places - I suspect the
> static key mem_cgroup_disabled() is 

Re: [PATCH v2 2/2] mm/memcg: set memcg when split page

2021-03-04 Thread Johannes Weiner
On Thu, Mar 04, 2021 at 07:40:53AM +, Zhou Guanghui wrote:
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
> 
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
> 
> Therefore, the memcg of the tail page needs to be set when split page.
> 
> Signed-off-by: Zhou Guanghui 

Acked-by: Johannes Weiner 


Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg

2021-03-04 Thread Johannes Weiner
On Thu, Mar 04, 2021 at 07:40:52AM +, Zhou Guanghui wrote:
> Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
> pass in page number argument.
> 
> In this way, the interface name is more common and can be used by
> potential users. In addition, the complete info(memcg and flag) of
> the memcg needs to be set to the tail pages.
> 
> Signed-off-by: Zhou Guanghui 

Acked-by: Johannes Weiner 


Re: [PATCH v3] memcg: charge before adding to swapcache on swapin

2021-03-04 Thread Johannes Weiner
On Wed, Mar 03, 2021 at 05:42:29PM -0800, Shakeel Butt wrote:
> Currently the kernel adds the page, allocated for swapin, to the
> swapcache before charging the page. This is fine but now we want a
> per-memcg swapcache stat which is essential for folks who wants to
> transparently migrate from cgroup v1's memsw to cgroup v2's memory and
> swap counters. In addition charging a page before exposing it to other
> parts of the kernel is a step in the right direction.
> 
> To correctly maintain the per-memcg swapcache stat, this patch has
> adopted to charge the page before adding it to swapcache. One
> challenge in this option is the failure case of add_to_swap_cache() on
> which we need to undo the mem_cgroup_charge(). Specifically undoing
> mem_cgroup_uncharge_swap() is not simple.
> 
> To resolve the issue, this patch introduces transaction like interface
> to charge a page for swapin. The function mem_cgroup_charge_swapin_page()
> initiates the charging of the page and mem_cgroup_finish_swapin_page()
> completes the charging process. So, the kernel starts the charging
> process of the page for swapin with mem_cgroup_charge_swapin_page(),
> adds the page to the swapcache and on success completes the charging
> process with mem_cgroup_finish_swapin_page().
> 
> Signed-off-by: Shakeel Butt 

The patch looks good to me, I have just a minor documentation nit
below. But with that addressed, please add:

Acked-by: Johannes Weiner 

> +/**
> + * mem_cgroup_charge_swapin_page - charge a newly allocated page for swapin
> + * @page: page to charge
> + * @mm: mm context of the victim
> + * @gfp: reclaim mode
> + * @entry: swap entry for which the page is allocated
> + *
> + * This function marks the start of the transaction of charging the page for
> + * swapin. Complete the transaction with mem_cgroup_finish_swapin_page().
> + *
> + * Returns 0 on success. Otherwise, an error code is returned.
> + */
> +int mem_cgroup_charge_swapin_page(struct page *page, struct mm_struct *mm,
> +   gfp_t gfp, swp_entry_t entry)
> +{
> + struct mem_cgroup *memcg;
> + unsigned short id;
> + int ret;
>  
> - if (!memcg)
> - memcg = get_mem_cgroup_from_mm(mm);
> + if (mem_cgroup_disabled())
> + return 0;
>  
> - ret = try_charge(memcg, gfp_mask, nr_pages);
> - if (ret)
> - goto out_put;
> + id = lookup_swap_cgroup_id(entry);
> + rcu_read_lock();
> + memcg = mem_cgroup_from_id(id);
> + if (!memcg || !css_tryget_online(>css))
> + memcg = get_mem_cgroup_from_mm(mm);
> + rcu_read_unlock();
>  
> - css_get(>css);
> - commit_charge(page, memcg);
> + ret = __mem_cgroup_charge(page, memcg, gfp);
>  
> - local_irq_disable();
> - mem_cgroup_charge_statistics(memcg, page, nr_pages);
> - memcg_check_events(memcg, page);
> - local_irq_enable();
> + css_put(>css);
> + return ret;
> +}
>  
> +/*
> + * mem_cgroup_finish_swapin_page - complete the swapin page charge 
> transaction
> + * @page: page charged for swapin
> + * @entry: swap entry for which the page is charged
> + *
> + * This function completes the transaction of charging the page allocated for
> + * swapin.

It's possible somebody later needs to change things around in the
swapin path and it's not immediately obvious when exactly these two
functions need to be called in the swapin sequence.

Maybe add here and above that charge_swapin_page needs to be called
before we try adding the page to the swapcache, and finish_swapin_page
needs to be called when swapcache insertion has been successful?


[tip: sched/core] psi: Pressure states are unlikely

2021-03-04 Thread tip-bot2 for Johannes Weiner
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 24f3cb558f59debe0e9159459bb9627b51b47c17
Gitweb:
https://git.kernel.org/tip/24f3cb558f59debe0e9159459bb9627b51b47c17
Author:Johannes Weiner 
AuthorDate:Wed, 03 Mar 2021 11:46:58 +08:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 04 Mar 2021 09:56:02 +01:00

psi: Pressure states are unlikely

Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.

Signed-off-by: Johannes Weiner 
Signed-off-by: Chengming Zhou 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
https://lkml.kernel.org/r/20210303034659.91735-4-zhouchengm...@bytedance.com
---
 kernel/sched/psi.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 0fe6ff6..3907a6b 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -219,17 +219,17 @@ static bool test_state(unsigned int *tasks, enum 
psi_states state)
 {
switch (state) {
case PSI_IO_SOME:
-   return tasks[NR_IOWAIT];
+   return unlikely(tasks[NR_IOWAIT]);
case PSI_IO_FULL:
-   return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
+   return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]);
case PSI_MEM_SOME:
-   return tasks[NR_MEMSTALL];
+   return unlikely(tasks[NR_MEMSTALL]);
case PSI_MEM_FULL:
-   return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
+   return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
case PSI_CPU_SOME:
-   return tasks[NR_RUNNING] > tasks[NR_ONCPU];
+   return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
case PSI_CPU_FULL:
-   return tasks[NR_RUNNING] && !tasks[NR_ONCPU];
+   return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]);
case PSI_NONIDLE:
return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
tasks[NR_RUNNING];
@@ -729,7 +729,7 @@ static void psi_group_change(struct psi_group *group, int 
cpu,
 * task in a cgroup is in_memstall, the corresponding groupc
 * on that cpu is in PSI_MEM_FULL state.
 */
-   if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
+   if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
state_mask |= (1 << PSI_MEM_FULL);
 
groupc->state_mask = state_mask;


Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization

2021-03-03 Thread Johannes Weiner
On Wed, Mar 03, 2021 at 05:05:15PM +0100, Peter Zijlstra wrote:
> On Wed, Mar 03, 2021 at 04:32:18PM +0100, Peter Zijlstra wrote:
> 
> > Yes, I can do that. Thanks!
> 
> Please double check the patches as found here:
> 
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/core
> 
> I've manually edited the tags.

Looks good to me, thanks!


Re: [PATCH v2] mm: memcontrol: fix kernel stack account

2021-03-03 Thread Johannes Weiner
On Wed, Mar 03, 2021 at 05:39:56PM +0800, Muchun Song wrote:
> For simplification 991e7673859e ("mm: memcontrol: account kernel stack
> per node") has changed the per zone vmalloc backed stack pages
> accounting to per node. By doing that we have lost a certain precision
> because those pages might live in different NUMA nodes. In the end
> NR_KERNEL_STACK_KB exported to the userspace might be over estimated on
> some nodes while underestimated on others.
> 
> This doesn't impose any real problem to correctnes of the kernel
> behavior as the counter is not used for any internal processing but it
> can cause some confusion to the userspace.
> 
> Address the problem by accounting each vmalloc backing page to its own
> node.
> 
> Fixes: 991e7673859e ("mm: memcontrol: account kernel stack per node")
> Signed-off-by: Muchun Song 
> Reviewed-by: Shakeel Butt 

With changes proposed by Shakeel and Michal, this looks good to me.

Acked-by: Johannes Weiner 

I guess the BUG_ON() was inspired by memcg_charge_kernel_stack() - not
really your fault for following that example. But yeah, please drop it
from your patch. Thanks!


Re: [PATCH v2 0/4] psi: Add PSI_CPU_FULL state and some code optimization

2021-03-03 Thread Johannes Weiner
On Wed, Mar 03, 2021 at 11:46:55AM +0800, Chengming Zhou wrote:
> This patch series is RESEND of the previous patches on psi subsystem. A few
> weeks passed since the last review, so I put them together and resend for
> more convenient review and merge.
> 
> Patch 1 add PSI_CPU_FULL state means all non-idle tasks in a cgroup are 
> delayed
> on the CPU resource which used by others outside of the cgroup or throttled
> by the cgroup cpu.max configuration.
> 
> Patch 2 use ONCPU state and the current in_memstall flag to detect reclaim,
> remove the hook in timer tick to make code more concise and maintainable.
> And patch 3 adds unlikely() annotations to move the pressure state branches
> out of line to eliminate undesirable jumps during wakeup and sleeps.
> 
> Patch 4 optimize the voluntary sleep switch by remove one call of
> psi_group_change() for every common cgroup ancestor of the two tasks.
> 
> Chengming Zhou (3):
>   psi: Add PSI_CPU_FULL state
>   psi: Use ONCPU state tracking machinery to detect reclaim
>   psi: Optimize task switch inside shared cgroups
> 
> Johannes Weiner (1):
>   psi: pressure states are unlikely

Peter, would you mind routing these through the sched tree for 5.13?


Re: [PATCH v2 2/4] psi: Use ONCPU state tracking machinery to detect reclaim

2021-03-03 Thread Johannes Weiner
On Wed, Mar 03, 2021 at 11:46:57AM +0800, Chengming Zhou wrote:
> Move the reclaim detection from the timer tick to the task state
> tracking machinery using the recently added ONCPU state. And we
> also add task psi_flags changes checking in the psi_task_switch()
> optimization to update the parents properly.
> 
> In terms of performance and cost, this ONCPU task state tracking
> is not cheaper than previous timer tick in aggregate. But the code is
> simpler and shorter this way, so it's a maintainability win. And
> Johannes did some testing with perf bench, the performace and cost
> changes would be acceptable for real workloads.
> 
> Thanks to Johannes Weiner for pointing out the psi_task_switch()
> optimization things and the clearer changelog.
> 
> Signed-off-by: Muchun Song 
> Signed-off-by: Chengming Zhou 

Acked-by: Johannes Weiner 


Re: [PATCH v2 4/4] psi: Optimize task switch inside shared cgroups

2021-03-03 Thread Johannes Weiner
On Wed, Mar 03, 2021 at 11:46:59AM +0800, Chengming Zhou wrote:
> The commit 36b238d57172 ("psi: Optimize switching tasks inside shared
> cgroups") only update cgroups whose state actually changes during a
> task switch only in task preempt case, not in task sleep case.
> 
> We actually don't need to clear and set TSK_ONCPU state for common cgroups
> of next and prev task in sleep case, that can save many psi_group_change
> especially when most activity comes from one leaf cgroup.
> 
> sleep before:
> psi_dequeue()
>   while ((group = iterate_groups(prev)))  # all ancestors
> psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
> psi_task_switch()
>   while ((group = iterate_groups(next)))  # all ancestors
> psi_group_change(next, .set=TSK_ONCPU)
> 
> sleep after:
> psi_dequeue()
>   nop
> psi_task_switch()
>   while ((group = iterate_groups(next)))  # until (prev & next)
> psi_group_change(next, .set=TSK_ONCPU)
>   while ((group = iterate_groups(prev)))  # all ancestors
> psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)
> 
> When a voluntary sleep switches to another task, we remove one call of
> psi_group_change() for every common cgroup ancestor of the two tasks.
> 
> Signed-off-by: Muchun Song 
> Signed-off-by: Chengming Zhou 

Acked-by: Johannes Weiner 


Re: [PATCH] mm/memcg: set memcg when split pages

2021-03-03 Thread Johannes Weiner
On Tue, Mar 02, 2021 at 12:24:41PM -0800, Hugh Dickins wrote:
> On Tue, 2 Mar 2021, Michal Hocko wrote:
> > [Cc Johannes for awareness and fixup Nick's email]
> > 
> > On Tue 02-03-21 01:34:51, Zhou Guanghui wrote:
> > > When split page, the memory cgroup info recorded in first page is
> > > not copied to tail pages. In this case, when the tail pages are
> > > freed, the uncharge operation is not performed. As a result, the
> > > usage of this memcg keeps increasing, and the OOM may occur.
> > > 
> > > So, the copying of first page's memory cgroup info to tail pages
> > > is needed when split page.
> > 
> > I was not aware that alloc_pages_exact is used for accounted allocations
> > but git grep told me otherwise so this is not a theoretical one. Both
> > users (arm64 and s390 kvm) are quite recent AFAICS. split_page is also
> > used in dma allocator but I got lost in indirection so I have no idea
> > whether there are any users there.
> 
> Yes, it's a bit worrying that such a low-level thing as split_page()
> can now get caught up in memcg accounting, but I suppose that's okay.
> 
> I feel rather strongly that whichever way it is done, THP splitting
> and split_page() should use the same interface to memcg.
> 
> And a look at mem_cgroup_split_huge_fixup() suggests that nowadays
> there need to be css_get()s too - or better, a css_get_many().
> 
> Its #ifdef CONFIG_TRANSPARENT_HUGEPAGE should be removed, rename
> it mem_cgroup_split_page_fixup(), and take order from caller.

+1

There is already a split_page_owner() in both these places as well
which does a similar thing. Mabye we can match that by calling it
split_page_memcg() and having it take a nr of pages?

> Though I've never much liked that separate pass: would it be
> better page by page, like this copy_page_memcg() does?  Though
> mem_cgroup_disabled() and css_getting make that less appealing.

Agreed on both counts. mem_cgroup_disabled() is a jump label and would
be okay, IMO, but the refcounting - though it is (usually) per-cpu -
adds at least two branches and rcu read locking.


Re: [PATCH] mm: introduce clear all vm events counters

2021-03-02 Thread Johannes Weiner
On Tue, Mar 02, 2021 at 04:00:34PM +0530, pi...@codeaurora.org wrote:
> On 2021-03-01 20:41, Johannes Weiner wrote:
> > On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote:
> > > At times there is a need to regularly monitor vm counters while we
> > > reproduce some issue, or it could be as simple as gathering some
> > > system
> > > statistics when we run some scenario and every time we like to start
> > > from
> > > beginning.
> > > The current steps are:
> > > Dump /proc/vmstat
> > > Run some scenario
> > > Dump /proc/vmstat again
> > > Generate some data or graph
> > > reboot and repeat again
> > 
> > You can subtract the first vmstat dump from the second to get the
> > event delta for the scenario run. That's what I do, and I'd assume
> > most people are doing. Am I missing something?
> 
> Thanks so much for your comments.
> Yes in most cases it works.
> 
> But I guess there are sometimes where we need to compare with fresh data
> (just like reboot) at least for some of the counters.
> Suppose we wanted to monitor pgalloc_normal and pgfree.

Hopefully these would already be balanced out pretty well before you
run a test, or there is a risk that whatever outstanding allocations
there are can cause a large number of frees during your test that
don't match up to your recorded allocation events. Resetting to zero
doesn't eliminate the risk of such background noise.

> Or, suppose we want to monitor until the field becomes non-zero..
> Or, how certain values are changing compared to fresh reboot.
> Or, suppose we want to reset all counters after boot and start capturing
> fresh stats.

Again, there simply is no mathematical difference between

reset events to 0
run test
look at events - 0

and

read events baseline
run test
look at events - baseline

> Some of the counters could be growing too large and too fast. Will there be
> chances of overflow ?
> Then resetting using this could help without rebooting.

Overflows are just a fact of life on 32 bit systems. However, they can
also be trivially handled - you can always subtract a ulong start
state from a ulong end state and get a reliable delta of up to 2^32
events, whether the end state has overflowed or not.

The bottom line is that the benefit of this patch adds a minor
convenience for something that can already be done in userspace. But
the downside is that there would be one more possible source of noise
for kernel developers to consider when looking at a bug report. Plus
the extra code and user interface that need to be maintained.

I don't think we should merge this patch.


Re: [PATCH 2/5] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-01 Thread Johannes Weiner
Muchun, can you please reduce the CC list to mm/memcg folks only for
the next submission? I think probably 80% of the current recipients
don't care ;-)

On Mon, Mar 01, 2021 at 10:11:45AM -0800, Shakeel Butt wrote:
> On Sun, Feb 28, 2021 at 10:25 PM Muchun Song  wrote:
> >
> > We want to reuse the obj_cgroup APIs to reparent the kmem pages when
> > the memcg offlined. If we do this, we should store an object cgroup
> > pointer to page->memcg_data for the kmem pages.
> >
> > Finally, page->memcg_data can have 3 different meanings.
> >
> >   1) For the slab pages, page->memcg_data points to an object cgroups
> >  vector.
> >
> >   2) For the kmem pages (exclude the slab pages), page->memcg_data
> >  points to an object cgroup.
> >
> >   3) For the user pages (e.g. the LRU pages), page->memcg_data points
> >  to a memory cgroup.
> >
> > Currently we always get the memcg associated with a page via page_memcg
> > or page_memcg_rcu. page_memcg_check is special, it has to be used in
> > cases when it's not known if a page has an associated memory cgroup
> > pointer or an object cgroups vector. Because the page->memcg_data of
> > the kmem page is not pointing to a memory cgroup in the later patch,
> > the page_memcg and page_memcg_rcu cannot be applicable for the kmem
> > pages. In this patch, we introduce page_memcg_kmem to get the memcg
> > associated with the kmem pages. And make page_memcg and page_memcg_rcu
> > no longer apply to the kmem pages.
> >
> > In the end, there are 4 helpers to get the memcg associated with a
> > page. The usage is as follows.
> >
> >   1) Get the memory cgroup associated with a non-kmem page (e.g. the LRU
> >  pages).
> >
> >  - page_memcg()
> >  - page_memcg_rcu()
> 
> Can you rename these to page_memcg_lru[_rcu] to make them explicitly
> for LRU pages?

The next patch removes page_memcg_kmem() again to replace it with
page_objcg(). That should (luckily) remove the need for this
distinction and keep page_memcg() simple and obvious.

It would be better to not introduce page_memcg_kmem() in the first
place in this patch, IMO.


Re: [PATCH] mm: introduce clear all vm events counters

2021-03-01 Thread Johannes Weiner
On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote:
> At times there is a need to regularly monitor vm counters while we
> reproduce some issue, or it could be as simple as gathering some system
> statistics when we run some scenario and every time we like to start from
> beginning.
> The current steps are:
> Dump /proc/vmstat
> Run some scenario
> Dump /proc/vmstat again
> Generate some data or graph
> reboot and repeat again

You can subtract the first vmstat dump from the second to get the
event delta for the scenario run. That's what I do, and I'd assume
most people are doing. Am I missing something?


Re: [v8 PATCH 00/13] Make shrinker's nr_deferred memcg aware

2021-03-01 Thread Johannes Weiner
Hello Yang,

On Thu, Feb 25, 2021 at 09:00:16AM -0800, Yang Shi wrote:
> Hi Andrew,
> 
> Just checking in whether this series is on your radar. The patch 1/13
> ~ patch 12/13 have been reviewed and acked. Vlastimil had had some
> comments on patch 13/13, I'm not sure if he is going to continue
> reviewing that one. I hope the last patch could get into the -mm tree
> along with the others so that it can get a broader test. What do you
> think about it?

The merge window for 5.12 is/has been open, which is when maintainers
are busy getting everything from the previous development cycle ready
to send upstream. Usually, only fixes but no new features are picked
up during that time. If you don't hear back, try resending in a week.

That reminds me, I also have patches I need to resend :)


Re: [PATCH] memcg: cleanup root memcg checks

2021-02-24 Thread Johannes Weiner
On Tue, Feb 23, 2021 at 12:56:25PM -0800, Shakeel Butt wrote:
> Replace the implicit checking of root memcg with explicit root memcg
> checking i.e. !css->parent with mem_cgroup_is_root().
> 
> Signed-off-by: Shakeel Butt 

Acked-by: Johannes Weiner 


Re: [PATCH] memcg: enable memcg oom-kill for __GFP_NOFAIL

2021-02-24 Thread Johannes Weiner
On Tue, Feb 23, 2021 at 12:43:37PM -0800, Shakeel Butt wrote:
> In the era of async memcg oom-killer, the commit a0d8b00a3381 ("mm:
> memcg: do not declare OOM from __GFP_NOFAIL allocations") added the code
> to skip memcg oom-killer for __GFP_NOFAIL allocations. The reason was
> that the __GFP_NOFAIL callers will not enter aync oom synchronization
> path and will keep the task marked as in memcg oom. At that time the
> tasks marked in memcg oom can bypass the memcg limits and the oom
> synchronization would have happened later in the later userspace
> triggered page fault. Thus letting the task marked as under memcg oom
> bypass the memcg limit for arbitrary time.
> 
> With the synchronous memcg oom-killer (commit 29ef680ae7c21 ("memcg,
> oom: move out_of_memory back to the charge path")) and not letting the
> task marked under memcg oom to bypass the memcg limits (commit
> 1f14c1ac19aa4 ("mm: memcg: do not allow task about to OOM kill to bypass
> the limit")), we can again allow __GFP_NOFAIL allocations to trigger
> memcg oom-kill. This will make memcg oom behavior closer to page
> allocator oom behavior.
> 
> Signed-off-by: Shakeel Butt 

Acked-by: Johannes Weiner 


Re: [PATCH v2 3/3] mm: Fix missing mem cgroup soft limit tree updates

2021-02-23 Thread Johannes Weiner
On Mon, Feb 22, 2021 at 10:38:27AM -0800, Tim Chen wrote:
> Johannes,
> 
> Thanks for your feedback.  Since Michal has concerns about the overhead
> this patch could incur, I think we'll hold the patch for now.  If later
> on Michal think that this patch is a good idea, I'll incorporate these
> changes you suggested.

That works for me.

Thanks!


Re: [PATCH] memcg: charge before adding to swapcache on swapin

2021-02-19 Thread Johannes Weiner
On Fri, Feb 19, 2021 at 02:44:05PM -0800, Shakeel Butt wrote:
> Currently the kernel adds the page, allocated for swapin, to the
> swapcache before charging the page. This is fine but now we want a
> per-memcg swapcache stat which is essential for folks who wants to
> transparently migrate from cgroup v1's memsw to cgroup v2's memory and
> swap counters.
> 
> To correctly maintain the per-memcg swapcache stat, one option which
> this patch has adopted is to charge the page before adding it to
> swapcache. One challenge in this option is the failure case of
> add_to_swap_cache() on which we need to undo the mem_cgroup_charge().
> Specifically undoing mem_cgroup_uncharge_swap() is not simple.
> 
> This patch circumvent this specific issue by removing the failure path
> of  add_to_swap_cache() by providing __GFP_NOFAIL. Please note that in
> this specific situation ENOMEM was the only possible failure of
> add_to_swap_cache() which is removed by using __GFP_NOFAIL.
> 
> Another option was to use __mod_memcg_lruvec_state(NR_SWAPCACHE) in
> mem_cgroup_charge() but then we need to take of the do_swap_page() case
> where synchronous swap devices bypass the swapcache. The do_swap_page()
> already does hackery to set and reset PageSwapCache bit to make
> mem_cgroup_charge() execute the swap accounting code and then we would
> need to add additional parameter to tell to not touch NR_SWAPCACHE stat
> as that code patch bypass swapcache.
> 
> This patch added memcg charging API explicitly foe swapin pages and
> cleaned up do_swap_page() to not set and reset PageSwapCache bit.
> 
> Signed-off-by: Shakeel Butt 

The patch makes sense to me. While it extends the charge interface, I
actually quite like that it charges the page earlier - before putting
it into wider circulation. It's a step in the right direction.

But IMO the semantics of mem_cgroup_charge_swapin_page() are a bit too
fickle: the __GFP_NOFAIL in add_to_swap_cache() works around it, but
having a must-not-fail-after-this line makes the code tricky to work
on and error prone.

It would be nicer to do a proper transaction sequence.

> @@ -497,16 +497,15 @@ struct page *__read_swap_cache_async(swp_entry_t entry, 
> gfp_t gfp_mask,
>   __SetPageLocked(page);
>   __SetPageSwapBacked(page);
>  
> - /* May fail (-ENOMEM) if XArray node allocation failed. */
> - if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK, 
> )) {
> - put_swap_page(page, entry);
> + if (mem_cgroup_charge_swapin_page(page, NULL, gfp_mask, entry))
>   goto fail_unlock;
> - }
>  
> - if (mem_cgroup_charge(page, NULL, gfp_mask)) {
> - delete_from_swap_cache(page);
> - goto fail_unlock;
> - }
> + /*
> +  * Use __GFP_NOFAIL to not worry about undoing the changes done by
> +  * mem_cgroup_charge_swapin_page() on failure of add_to_swap_cache().
> +  */
> + add_to_swap_cache(page, entry,
> +   (gfp_mask|__GFP_NOFAIL) & GFP_RECLAIM_MASK, );

How about:

mem_cgroup_charge_swapin_page()
add_to_swap_cache()
mem_cgroup_finish_swapin_page()

where finish_swapin_page() only uncharges the swap entry (on cgroup1)
once the swap->memory transition is complete?

Otherwise the patch looks good to me.


Re: [PATCH v3 4/8] cgroup: rstat: support cgroup1

2021-02-18 Thread Johannes Weiner
On Thu, Feb 18, 2021 at 04:45:11PM +0100, Michal Koutný wrote:
> On Wed, Feb 17, 2021 at 03:52:59PM -0500, Johannes Weiner 
>  wrote:
> > In this case, we're talking about a relatively small data structure
> > and the overhead is per mountpoint.
> IIUC, it is per each mountpoint's number of cgroups. But I still accept
> the argument above. Furthermore, this can be changed later.

Oops, you're right of course.

> > The default root group has statically preallocated percpu data before
> > and after this patch. See cgroup.c:
> I stand corrected, the comment is still valid.
> 
> Therefore,
> Reviewed-by: Michal Koutný 

Thanks for your reviews, Michal!


Re: [PATCH v2 3/3] mm: Fix missing mem cgroup soft limit tree updates

2021-02-17 Thread Johannes Weiner
On Wed, Feb 17, 2021 at 12:41:36PM -0800, Tim Chen wrote:
> On a per node basis, the mem cgroup soft limit tree on each node tracks
> how much a cgroup has exceeded its soft limit memory limit and sorts
> the cgroup by its excess usage.  On page release, the trees are not
> updated right away, until we have gathered a batch of pages belonging to
> the same cgroup. This reduces the frequency of updating the soft limit tree
> and locking of the tree and associated cgroup.
> 
> However, the batch of pages could contain pages from multiple nodes but
> only the soft limit tree from one node would get updated.  Change the
> logic so that we update the tree in batch of pages, with each batch of
> pages all in the same mem cgroup and memory node.  An update is issued for
> the batch of pages of a node collected till now whenever we encounter
> a page belonging to a different node.  Note that this batching for
> the same node logic is only relevant for v1 cgroup that has a memory
> soft limit.
> 
> Reviewed-by: Ying Huang 
> Signed-off-by: Tim Chen 
> ---
>  mm/memcontrol.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d72449eeb85a..8bddee75f5cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6804,6 +6804,7 @@ struct uncharge_gather {
>   unsigned long pgpgout;
>   unsigned long nr_kmem;
>   struct page *dummy_page;
> + int nid;
>  };
>  
>  static inline void uncharge_gather_clear(struct uncharge_gather *ug)
> @@ -6849,7 +6850,13 @@ static void uncharge_page(struct page *page, struct 
> uncharge_gather *ug)
>* exclusive access to the page.
>*/
>  
> - if (ug->memcg != page_memcg(page)) {
> + if (ug->memcg != page_memcg(page) ||
> + /*
> +  * Update soft limit tree used in v1 cgroup in page batch for
> +  * the same node. Relevant only to v1 cgroup with a soft limit.
> +  */
> + (ug->dummy_page && ug->nid != page_to_nid(page) &&
> +  ug->memcg->soft_limit != PAGE_COUNTER_MAX)) {

Sorry, I used weird phrasing in my last email.

Can you please preface the checks you're adding with a
!cgroup_subsys_on_dfl(memory_cgrp_subsys) to static branch for
cgroup1? The uncharge path is pretty hot, and this would avoid the
runtime overhead on cgroup2 at least, which doesn't have the SL.

Also, do we need the ug->dummy_page check? It's only NULL on the first
loop - where ug->memcg is NULL as well and the branch is taken anyway.

The soft limit check is also slightly cheaper than the nid check, as
page_to_nid() might be out-of-line, so we should do it first. This?

/*
 * Batch-uncharge all pages of the same memcg.
 *
 * Unless we're looking at a cgroup1 with a softlimit
 * set: the soft limit trees are maintained per-node
 * and updated on uncharge (via dummy_page), so keep
 * batches confined to a single node as well.
 */
if (ug->memcg != page_memcg(page) ||
(!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
 ug->memcg->soft_limit != PAGE_COUNTER_MAX &&
 ug->nid != page_to_nid(page)))


Re: [PATCH v3 4/8] cgroup: rstat: support cgroup1

2021-02-17 Thread Johannes Weiner
On Wed, Feb 17, 2021 at 06:42:32PM +0100, Michal Koutný wrote:
> Hello.
> 
> On Tue, Feb 09, 2021 at 11:33:00AM -0500, Johannes Weiner 
>  wrote:
> > @@ -1971,10 +1978,14 @@ int cgroup_setup_root(struct cgroup_root *root, u16 
> > ss_mask)
> > if (ret)
> > goto destroy_root;
> >  
> > -   ret = rebind_subsystems(root, ss_mask);
> > +   ret = cgroup_rstat_init(root_cgrp);
> Would it make sense to do cgroup_rstat_init() only if there's a subsys
> in ss_mask that makes use of rstat?
> (On legacy systems there could be individual hierarchy for each
> controller so the rstat space can be saved.)

It's possible, but I don't think worth the trouble.

It would have to be done from rebind_subsystems(), as remount can add
more subsystems to an existing cgroup1 root. That in turn means we'd
have to have separate init paths for cgroup1 and cgroup2.

While we split cgroup1 and cgroup2 paths where necessary in the code,
it's a significant maintenance burden and a not unlikely source of
subtle errors (see the recent 'fix swap undercounting in cgroup2').

In this case, we're talking about a relatively small data structure
and the overhead is per mountpoint. Comparatively, we're allocating
the full vmstats structures for cgroup1 groups which barely use them,
and cgroup1 softlimit tree structures for each cgroup2 group.

So I don't think it's a good tradeoff. Subtle bugs that require kernel
patches are more disruptive to the user experience than the amount of
memory in question here.

> > @@ -285,8 +285,6 @@ void __init cgroup_rstat_boot(void)
> >  
> > for_each_possible_cpu(cpu)
> > raw_spin_lock_init(per_cpu_ptr(_rstat_cpu_lock, cpu));
> > -
> > -   BUG_ON(cgroup_rstat_init(_dfl_root.cgrp));
> >  }
> Regardless of the suggestion above, this removal obsoletes the comment
> cgroup_rstat_init:
> 
>  int cpu;
>  
> -/* the root cgrp has rstat_cpu preallocated */
>  if (!cgrp->rstat_cpu) {
>  cgrp->rstat_cpu = alloc_percpu(struct cgroup_rstat_cpu);

Oh, I'm not removing the init call, I'm merely moving it from
cgroup_rstat_boot() to cgroup_setup_root().

The default root group has statically preallocated percpu data before
and after this patch. See cgroup.c:

  static DEFINE_PER_CPU(struct cgroup_rstat_cpu, cgrp_dfl_root_rstat_cpu);

  /* the default hierarchy */
  struct cgroup_root cgrp_dfl_root = { .cgrp.rstat_cpu = 
_dfl_root_rstat_cpu };
  EXPORT_SYMBOL_GPL(cgrp_dfl_root);


  1   2   3   4   5   6   7   8   9   10   >