from:"Shakeel Butt"

Re: 回复：general protection fault in refill_obj_stock

2024-04-13 Thread Shakeel Butt

On Tue, Apr 02, 2024 at 09:50:54AM +0800, Ubisectech Sirius wrote:
> > On Mon, Apr 01, 2024 at 03:04:46PM +0800, Ubisectech Sirius wrote:
> > Hello.
> > We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> > Recently, our team has discovered a issue in Linux kernel 6.7. Attached to 
> > the email were a PoC file of the issue.
> 
> > Thank you for the report!
> 
> > I tried to compile and run your test program for about half an hour
> > on a virtual machine running 6.7 with enabled KASAN, but wasn't able
> > to reproduce the problem.
> 
> > Can you, please, share a bit more information? How long does it take
> > to reproduce? Do you mind sharing your kernel config? Is there anything 
> > special
> > about your setup? What are exact steps to reproduce the problem?
> > Is this problem reproducible on 6.6?
> 
> Hi. 
>The .config of linux kernel 6.7 has send to you as attachment. And The 
> problem is reproducible on 6.6.

Can you please share the reproducer of this issue?

> 
> > It's interesting that the problem looks like use-after-free for the objcg 
> > pointer
> > but happens in the context of udev-systemd, which I believe should be 
> > fairly stable
> > and it's cgroup is not going anywhere.
> 
> > Thanks!
>

Re: [PATCH net-next v3 2/3] net: introduce abstraction for network memory

2024-01-10 Thread Shakeel Butt

On Thu, Jan 4, 2024 at 1:44 PM Jakub Kicinski  wrote:
>
[...]
>
> You seem to be trying hard to make struct netmem a thing.
> Perhaps you have a reason I'm not getting?

Mina already went with your suggestion and that is fine. To me, struct
netmem is more aesthetically aligned with the existing struct
encoded_page approach, but I don't have a strong opinion one way or
the other. However it seems like you have a stronger preference for
__bitwise approach. Is there a technical reason or just aesthetic?

Re: [PATCH net-next v3 3/3] net: add netmem_ref to skb_frag_t

2023-12-21 Thread Shakeel Butt

On Wed, Dec 20, 2023 at 01:45:02PM -0800, Mina Almasry wrote:
> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> index 65d1f6755f98..3180a54b2c68 100644
> --- a/net/kcm/kcmsock.c
> +++ b/net/kcm/kcmsock.c
> @@ -636,9 +636,15 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
>   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
>   msize += skb_shinfo(skb)->frags[i].bv_len;

Don't you need the above to cast to bio_vec to get bv_len? skb_frag_t
does not have bv_len anymore.

>  
> + /* The cast to struct bio_vec* here assumes the frags are
> +  * struct page based. WARN if there is no page in this skb.
> +  */
> + DEBUG_NET_WARN_ON_ONCE(
> + !skb_frag_page(_shinfo(skb)->frags[0]));
> +
>   iov_iter_bvec(_iter, ITER_SOURCE,
> -   skb_shinfo(skb)->frags, skb_shinfo(skb)->nr_frags,
> -   msize);
> +   (const struct bio_vec *)skb_shinfo(skb)->frags,
> +   skb_shinfo(skb)->nr_frags, msize);
>   iov_iter_advance(_iter, txm->frag_offset);
>  
>   do {
> -- 
> 2.43.0.472.g3155946c3a-goog
>

Re: [PATCH net-next v3 2/3] net: introduce abstraction for network memory

2023-12-21 Thread Shakeel Butt

On Wed, Dec 20, 2023 at 01:45:01PM -0800, Mina Almasry wrote:
> Add the netmem_ref type, an abstraction for network memory.
> 
> To add support for new memory types to the net stack, we must first
> abstract the current memory type. Currently parts of the net stack
> use struct page directly:
> 
> - page_pool
> - drivers
> - skb_frag_t
> 
> Originally the plan was to reuse struct page* for the new memory types,
> and to set the LSB on the page* to indicate it's not really a page.
> However, for compiler type checking we need to introduce a new type.
> 
> netmem_ref is introduced to abstract the underlying memory type. Currently
> it's a no-op abstraction that is always a struct page underneath. In
> parallel there is an undergoing effort to add support for devmem to the
> net stack:
> 
> https://lore.kernel.org/netdev/20231208005250.2910004-1-almasrym...@google.com/
> 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> v3:
> 
> - Modify struct netmem from a union of struct page + new types to an opaque
>   netmem_ref type.  I went with:
> 
>   +typedef void *__bitwise netmem_ref;
> 
>   rather than this that Jakub recommended:
> 
>   +typedef unsigned long __bitwise netmem_ref;
> 
>   Because with the latter the compiler issues warnings to cast NULL to
>   netmem_ref. I hope that's ok.
> 

Can you share what the warning was? You might just need __force
attribute. However you might need this __force a lot. I wonder if you
can just follow struct encoded_page example verbatim here.

Re: [PATCH net-next v3 1/3] vsock/virtio: use skb_frag_*() helpers

2023-12-21 Thread Shakeel Butt

On Wed, Dec 20, 2023 at 1:45 PM Mina Almasry  wrote:
>
> Minor fix for virtio: code wanting to access the fields inside an skb
> frag should use the skb_frag_*() helpers, instead of accessing the
> fields directly. This allows for extensions where the underlying
> memory is not a page.
>
> Signed-off-by: Mina Almasry 

Reviewed-by: Shakeel Butt

Re: [RFC] memory reserve for userspace oom-killer

2021-04-20 Thread Shakeel Butt

On Mon, Apr 19, 2021 at 11:46 PM Michal Hocko  wrote:
>
> On Mon 19-04-21 18:44:02, Shakeel Butt wrote:
[...]
> > memory.min. However a new allocation from userspace oom-killer can
> > still get stuck in the reclaim and policy rich oom-killer do trigger
> > new allocations through syscalls or even heap.
>
> Can you be more specific please?
>

To decide when to kill, the oom-killer has to read a lot of metrics.
It has to open a lot of files to read them and there will definitely
be new allocations involved in those operations. For example reading
memory.stat does a page size allocation. Similarly, to perform action
the oom-killer may have to read cgroup.procs file which again has
allocation inside it.

Regarding sophisticated oom policy, I can give one example of our
cluster level policy. For robustness, many user facing jobs run a lot
of instances in a cluster to handle failures. Such jobs are tolerant
to some amount of failures but they still have requirements to not let
the number of running instances below some threshold. Normally killing
such jobs is fine but we do want to make sure that we do not violate
their cluster level agreement. So, the userspace oom-killer may
dynamically need to confirm if such a job can be killed.

[...]
> > To reliably solve this problem, we need to give guaranteed memory to
> > the userspace oom-killer.
>
> There is nothing like that. Even memory reserves are a finite resource
> which can be consumed as it is sharing those reserves with other users
> who are not necessarily coordinated. So before we start discussing
> making this even more muddy by handing over memory reserves to the
> userspace we should really examine whether pre-allocation is something
> that will not work.
>

We actually explored if we can restrict the syscalls for the
oom-killer which does not do memory allocations. We concluded that is
not practical and not maintainable. Whatever the list we can come up
with will be outdated soon. In addition, converting all the must-have
syscalls to not do allocations is not possible/practical.

> > At the moment we are contemplating between
> > the following options and I would like to get some feedback.
> >
> > 1. prctl(PF_MEMALLOC)
> >
> > The idea is to give userspace oom-killer (just one thread which is
> > finding the appropriate victims and will be sending SIGKILLs) access
> > to MEMALLOC reserves. Most of the time the preallocation, mlock and
> > memory.min will be good enough but for rare occasions, when the
> > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> > protect it from reclaim and let the allocation dip into the memory
> > reserves.
>
> I do not think that handing over an unlimited ticket to the memory
> reserves to userspace is a good idea. Even the in kernel oom killer is
> bound to a partial access to reserves. So if we really want this then
> it should be in sync with and bound by the ALLOC_OOM.
>

Makes sense.

> > The misuse of this feature would be risky but it can be limited to
> > privileged applications. Userspace oom-killer is the only appropriate
> > user of this feature. This option is simple to implement.
> >
> > 2. Mempool
> >
> > The idea is to preallocate mempool with a given amount of memory for
> > userspace oom-killer. Preferably this will be per-thread and
> > oom-killer can preallocate mempool for its specific threads. The core
> > page allocator can check before going to the reclaim path if the task
> > has private access to the mempool and return page from it if yes.
>
> Could you elaborate some more on how this would be controlled from the
> userspace? A dedicated syscall? A driver?
>

I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool
to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to
free the mempool.

> > This option would be more complicated than the previous option as the
> > lifecycle of the page from the mempool would be more sophisticated.
> > Additionally the current mempool does not handle higher order pages
> > and we might need to extend it to allow such allocations. Though this
> > feature might have more use-cases and it would be less risky than the
> > previous option.
>
> I would tend to agree.
>
> > Another idea I had was to use kthread based oom-killer and provide the
> > policies through eBPF program. Though I am not sure how to make it
> > monitor arbitrary metrics and if that can be done without any
> > allocations.
>
> A kernel module or eBPF to implement oom decisions has already been
> discussed few years back. But I am afraid this would be hard to wire in
> for anything except for the victim selection. I am not sure it is
> maint

[RFC] memory reserve for userspace oom-killer

2021-04-19 Thread Shakeel Butt

Proposal: Provide memory guarantees to userspace oom-killer.

Background:

Issues with kernel oom-killer:
1. Very conservative and prefer to reclaim. Applications can suffer
for a long time.
2. Borrows the context of the allocator which can be resource limited
(low sched priority or limited CPU quota).
3. Serialized by global lock.
4. Very simplistic oom victim selection policy.

These issues are resolved through userspace oom-killer by:
1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
early detect suffering.
2. Independent process context which can be given dedicated CPU quota
and high scheduling priority.
3. Can be more aggressive as required.
4. Can implement sophisticated business logic/policies.

Android's LMKD and Facebook's oomd are the prime examples of userspace
oom-killers. One of the biggest challenges for userspace oom-killers
is to potentially function under intense memory pressure and are prone
to getting stuck in memory reclaim themselves. Current userspace
oom-killers aim to avoid this situation by preallocating user memory
and protecting themselves from global reclaim by either mlocking or
memory.min. However a new allocation from userspace oom-killer can
still get stuck in the reclaim and policy rich oom-killer do trigger
new allocations through syscalls or even heap.

Our attempt of userspace oom-killer faces similar challenges.
Particularly at the tail on the very highly utilized machines we have
observed userspace oom-killer spectacularly failing in many possible
ways in the direct reclaim. We have seen oom-killer stuck in direct
reclaim throttling, stuck in reclaim and allocations from interrupts
keep stealing reclaimed memory. We have even observed systems where
all the processes were stuck in throttle_direct_reclaim() and only
kswapd was running and the interrupts kept stealing the memory
reclaimed by kswapd.

To reliably solve this problem, we need to give guaranteed memory to
the userspace oom-killer. At the moment we are contemplating between
the following options and I would like to get some feedback.

1. prctl(PF_MEMALLOC)

The idea is to give userspace oom-killer (just one thread which is
finding the appropriate victims and will be sending SIGKILLs) access
to MEMALLOC reserves. Most of the time the preallocation, mlock and
memory.min will be good enough but for rare occasions, when the
userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
protect it from reclaim and let the allocation dip into the memory
reserves.

The misuse of this feature would be risky but it can be limited to
privileged applications. Userspace oom-killer is the only appropriate
user of this feature. This option is simple to implement.

2. Mempool

The idea is to preallocate mempool with a given amount of memory for
userspace oom-killer. Preferably this will be per-thread and
oom-killer can preallocate mempool for its specific threads. The core
page allocator can check before going to the reclaim path if the task
has private access to the mempool and return page from it if yes.

This option would be more complicated than the previous option as the
lifecycle of the page from the mempool would be more sophisticated.
Additionally the current mempool does not handle higher order pages
and we might need to extend it to allow such allocations. Though this
feature might have more use-cases and it would be less risky than the
previous option.

Another idea I had was to use kthread based oom-killer and provide the
policies through eBPF program. Though I am not sure how to make it
monitor arbitrary metrics and if that can be done without any
allocations.

Please do provide feedback on these approaches.

thanks,
Shakeel

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-19 Thread Shakeel Butt

On Mon, Apr 19, 2021 at 8:43 AM Ilias Apalodimas
 wrote:
>
[...]
> > Pages mapped into the userspace have their refcnt elevated, so the
> > page_ref_count() check by the drivers indicates to not reuse such
> > pages.
> >
>
> When tcp_zerocopy_receive() is invoked it will call 
> tcp_zerocopy_vm_insert_batch()
> which will end up doing a get_page().
> What you are saying is that once the zerocopy is done though, 
> skb_release_data()
> won't be called, but instead put_page() will be? If that's the case then we 
> are
> indeed leaking DMA mappings and memory. That sounds weird though, since the
> refcnt will be one in that case (zerocopy will do +1/-1 once it's done), so 
> who
> eventually frees the page?
> If kfree_skb() (or any wrapper that calls skb_release_data()) is called
> eventually, we'll end up properly recycling the page into our pool.
>

>From what I understand (Eric, please correct me if I'm wrong) for
simple cases there are 3 page references taken. One by the driver,
second by skb and third by page table.

In tcp_zerocopy_receive(), tcp_zerocopy_vm_insert_batch() gets one
page ref through insert_page_into_pte_locked(). However before
returning from tcp_zerocopy_receive(), the skb references are dropped
through tcp_recv_skb(). So, whenever the user unmaps the page and
drops the page ref only then that page can be reused by the driver.

In my understanding, for zerocopy rx the skb_release_data() is called
on the pages while they are still mapped into the userspace. So,
skb_release_data() might not be the right place to recycle the page
for zerocopy. The email chain at [1] has some discussion on how to
bundle the recycling of pages with their lifetime.

[1] 
https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.k...@gmail.com/

Re: [PATCH v4 1/5] mm/memcg: Move mod_objcg_state() to memcontrol.c

2021-04-19 Thread Shakeel Butt

On Sun, Apr 18, 2021 at 5:00 PM Waiman Long  wrote:
>
> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
> so that further optimization can be done to it in later patches without
> exposing unnecessary details to other mm components.
>
> Signed-off-by: Waiman Long 

Reviewed-by: Shakeel Butt

Re: [External] [PATCH v4 5/5] mm/memcg: Improve refill_obj_stock() performance

2021-04-19 Thread Shakeel Butt

On Sun, Apr 18, 2021 at 11:07 PM Muchun Song  wrote:
>
> On Mon, Apr 19, 2021 at 8:01 AM Waiman Long  wrote:
> >
> > There are two issues with the current refill_obj_stock() code. First of
> > all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
> > atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
> > and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
> > be used again which leads to another call to drain_obj_stock() and
> > obj_cgroup_get() as well as atomically retrieve the available byte from
> > obj_cgroup. That is costly. Instead, we should just uncharge the excess
> > pages, reduce the stock bytes and be done with it. The drain_obj_stock()
> > function should only be called when obj_cgroup changes.
> >
> > Secondly, when charging an object of size not less than a page in
> > obj_cgroup_charge(), it is possible that the remaining bytes to be
> > refilled to the stock will overflow a page and cause refill_obj_stock()
> > to uncharge 1 page. To avoid the additional uncharge in this case,
> > a new overfill flag is added to refill_obj_stock() which will be set
> > when called from obj_cgroup_charge().
> >
> > Signed-off-by: Waiman Long 
> > ---
> >  mm/memcontrol.c | 23 +--
> >  1 file changed, 17 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a6dd18f6d8a8..d13961352eef 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct 
> > memcg_stock_pcp *stock,
> > return false;
> >  }
> >
> > -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int 
> > nr_bytes)
> > +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int 
> > nr_bytes,
> > +bool overfill)
> >  {
> > unsigned long flags;
> > struct obj_stock *stock = get_obj_stock();
> > +   unsigned int nr_pages = 0;
> >
> > if (stock->cached_objcg != objcg) { /* reset if necessary */
> > -   drain_obj_stock(stock);
> > +   if (stock->cached_objcg)
> > +   drain_obj_stock(stock);
> > obj_cgroup_get(objcg);
> > stock->cached_objcg = objcg;
> > stock->nr_bytes = atomic_xchg(>nr_charged_bytes, 0);
> > }
> > stock->nr_bytes += nr_bytes;
> >
> > -   if (stock->nr_bytes > PAGE_SIZE)
> > -   drain_obj_stock(stock);
> > +   if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
> > +   nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> > +   stock->nr_bytes &= (PAGE_SIZE - 1);
> > +   }
> >
> > put_obj_stock(flags);
> > +
> > +   if (nr_pages) {
> > +   rcu_read_lock();
> > +   __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> > +   rcu_read_unlock();
> > +   }
>
> It is not safe to call __memcg_kmem_uncharge() under rcu lock
> and without holding a reference to memcg. More details can refer
> to the following link.
>
> https://lore.kernel.org/linux-mm/20210319163821.20704-2-songmuc...@bytedance.com/
>
> In the above patchset, we introduce obj_cgroup_uncharge_pages to
> uncharge some pages from object cgroup. You can use this safe
> API.
>

I would recommend just rebase the patch series over the latest mm tree.

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-19 Thread Shakeel Butt

On Sun, Apr 18, 2021 at 10:12 PM Ilias Apalodimas
 wrote:
>
> On Wed, Apr 14, 2021 at 01:09:47PM -0700, Shakeel Butt wrote:
> > On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
> >  wrote:
> > >
> > [...]
> > > > >
> > > > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > > > can not be used.
> > > >
> > > > Yes it can, since it's going to be used as your default allocator for
> > > > payloads, which might end up on an SKB.
> > >
> > > I'm not sure we want or should "allow" page_pool be used for TCP RX
> > > zerocopy.
> > > For several reasons.
> > >
> > > (1) This implies mapping these pages page to userspace, which AFAIK
> > > means using page->mapping and page->index members (right?).
> > >
> >
> > No, only page->_mapcount is used.
> >
>
> I am not sure I like leaving out TCP RX zerocopy. Since we want driver to
> adopt the recycling mechanism we should try preserving the current
> functionality of the network stack.
>
> The question is how does it work with the current drivers that already have an
> internal page recycling mechanism.
>

I think the current drivers check page_ref_count(page) to decide to
reuse (or not) the already allocated pages.

Some examples from the drivers:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:ixgbe_can_reuse_rx_page()
drivers/net/ethernet/intel/igb/igb_main.c:igb_can_reuse_rx_page()
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:mlx5e_rx_cache_get()

> > > (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> > > device) and also map this page into userspace.
> > >
> >
> > I think this is already the case i.e pages still DMA-mapped and also
> > mapped into userspace.
> >
> > > (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> > > zerocopy will bump the refcnt, which means the page_pool will not
> > > recycle the page when it see the elevated refcnt (it will instead
> > > release its DMA-mapping).
> >
> > Yes this is right but the userspace might have already consumed and
> > unmapped the page before the driver considers to recycle the page.
>
> Same question here. I'll have a closer look in a few days and make sure we are
> not breaking anything wrt zerocopy.
>

Pages mapped into the userspace have their refcnt elevated, so the
page_ref_count() check by the drivers indicates to not reuse such
pages.

> >
> > >
> > > (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> > > page->private for tricks.  And our patch [3/5] use page->private for
> > > storing xdp_mem_info.
> > >
> > > IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> > > call page_pool_release_page() to release its DMA-mapping.
> > >
> >
> > I will let TCP RX zerocopy experts respond to this but from my high
> > level code inspection, I didn't see page->private usage.
>
> Shakeel are you aware of any 'easy' way I can have rx zerocopy running?
>

I would recommend tools/testing/selftests/net/tcp_mmap.c.

Re: [PATCH v2 5/8] mm: memcontrol: rename lruvec_holds_page_lru_lock to page_matches_lruvec

2021-04-16 Thread Shakeel Butt

On Thu, Apr 15, 2021 at 10:16 PM Muchun Song  wrote:
>
> lruvec_holds_page_lru_lock() doesn't check anything about locking and is
> used to check whether the page belongs to the lruvec. So rename it to
> page_matches_lruvec().
>
> Signed-off-by: Muchun Song 

Reviewed-by: Shakeel Butt

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-14 Thread Shakeel Butt

On Wed, Apr 14, 2021 at 12:42 PM Jesper Dangaard Brouer
 wrote:
>
[...]
> > >
> > > Can this page_pool be used for TCP RX zerocopy? If yes then PageType
> > > can not be used.
> >
> > Yes it can, since it's going to be used as your default allocator for
> > payloads, which might end up on an SKB.
>
> I'm not sure we want or should "allow" page_pool be used for TCP RX
> zerocopy.
> For several reasons.
>
> (1) This implies mapping these pages page to userspace, which AFAIK
> means using page->mapping and page->index members (right?).
>

No, only page->_mapcount is used.

> (2) It feels wrong (security wise) to keep the DMA-mapping (for the
> device) and also map this page into userspace.
>

I think this is already the case i.e pages still DMA-mapped and also
mapped into userspace.

> (3) The page_pool is optimized for refcnt==1 case, and AFAIK TCP-RX
> zerocopy will bump the refcnt, which means the page_pool will not
> recycle the page when it see the elevated refcnt (it will instead
> release its DMA-mapping).

Yes this is right but the userspace might have already consumed and
unmapped the page before the driver considers to recycle the page.

>
> (4) I remember vaguely that this code path for (TCP RX zerocopy) uses
> page->private for tricks.  And our patch [3/5] use page->private for
> storing xdp_mem_info.
>
> IMHO when the SKB travel into this TCP RX zerocopy code path, we should
> call page_pool_release_page() to release its DMA-mapping.
>

I will let TCP RX zerocopy experts respond to this but from my high
level code inspection, I didn't see page->private usage.

Re: [page-reclaim] Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Shakeel Butt

On Wed, Apr 14, 2021 at 6:52 AM Rik van Riel  wrote:
>
> On Wed, 2021-04-14 at 16:27 +0800, Huang, Ying wrote:
> > Yu Zhao  writes:
> >
> > > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying 
> > > wrote:
> > > >
> > > NUMA Optimization
> > > -
> > > Support NUMA policies and per-node RSS counters.
> > >
> > > We only can move forward one step at a time. Fair?
> >
> > You don't need to implement that now definitely.  But we can discuss
> > the
> > possible solution now.
>
> That was my intention, too. I want to make sure we don't
> end up "painting ourselves into a corner" by moving in some
> direction we have no way to get out of.
>
> The patch set looks promising, but we need some plan to
> avoid the worst case behaviors that forced us into rmap
> based scanning initially.
>
> > Note that it's possible that only some processes are bound to some
> > NUMA
> > nodes, while other processes aren't bound.
>
> For workloads like PostgresQL or Oracle, it is common
> to have maybe 70% of memory in a large shared memory
> segment, spread between all the NUMA nodes, and mapped
> into hundreds, if not thousands, of processes in the
> system.
>
> Now imagine we have an 8 node system, and memory
> pressure in the DMA32 zone of node 0.
>
> How will the current VM behave?
>
> Wha
> t will the virtual scanning need to do?
>
> If we can come up with a solution to make virtual
> scanning scale for that kind of workload, great.
>
> If not ... if it turns out most of the benefits of
> the multigeneratinal LRU framework come from sorting
> the pages into multiple LRUs, and from being able
> to easily reclaim unmapped pages before having to
> scan mapped ones, could it be an idea to implement
> that first, independently from virtual scanning?
>
> I am all for improving
> our page reclaim system, I
> just want to make sure we don't revisit the old traps
> that forced us where we are today :)
>

One potential idea is to take the hybrid 'of rmap and virtual
scanning' approach. If the number of pages that are targeted to be
scanned is below some threshold, do rmap otherwise virtual scanning. I
think we can experimentally find good value for that threshold.

Re: [PATCH 6/7] mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock

2021-04-13 Thread Shakeel Butt

On Mon, Apr 12, 2021 at 11:58 PM Muchun Song  wrote:
>
> The css_set_lock is used to guard the list of inherited objcgs. So there
> is no need to uncharge kernel memory under css_set_lock. Just move it
> out of the lock.
>
> Signed-off-by: Muchun Song 

Reviewed-by: Shakeel Butt

Re: [PATCH 5/7] mm: memcontrol: simplify the logic of objcg pinning memcg

2021-04-13 Thread Shakeel Butt

On Mon, Apr 12, 2021 at 11:58 PM Muchun Song  wrote:
>
> The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by
> the css_set_lock. We do not need to care about objcg->memcg being
> released in the process of obj_cgroup_release(). So there is no need
> to pin memcg before releasing objcg. Remove those pinning logic to
> simplfy the code.
>
> There are only two places that modifies the objcg->memcg. One is the
> initialization to objcg->memcg in the memcg_online_kmem(), another
> is objcgs reparenting in the memcg_reparent_objcgs(). It is also
> impossible for the two to run in parallel. So xchg() is unnecessary
> and it is enough to use WRITE_ONCE().
>
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

Reviewed-by: Shakeel Butt

Re: [PATCH 4/7] mm: memcontrol: simplify lruvec_holds_page_lru_lock

2021-04-13 Thread Shakeel Butt

On Mon, Apr 12, 2021 at 11:57 PM Muchun Song  wrote:
>
> We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> to simplify the code. We can have a single definition for this function
> that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> CONFIG_MEMCG.
>
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

Reviewed-by: Shakeel Butt

Re: [PATCH 3/7] mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec

2021-04-13 Thread Shakeel Butt

On Mon, Apr 12, 2021 at 11:57 PM Muchun Song  wrote:
>
> All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page)
> as the 2nd parameter to it (except isolate_migratepages_block()). But
> for isolate_migratepages_block(), the page_pgdat(page) is also equal
> to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not
> need the pgdat parameter. Just remove it to simplify the code.
>
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

Reviewed-by: Shakeel Butt

Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access

2021-04-12 Thread Shakeel Butt

On Mon, Apr 12, 2021 at 4:10 PM Shakeel Butt  wrote:
>
> On Mon, Apr 12, 2021 at 3:55 PM Waiman Long  wrote:
> >
> > Most kmem_cache_alloc() calls are from user context. With instrumentation
> > enabled, the measured amount of kmem_cache_alloc() calls from non-task
> > context was about 0.01% of the total.
> >
> > The irq disable/enable sequence used in this case to access content
> > from object stock is slow.  To optimize for user context access, there
> > are now two object stocks for task context and interrupt context access
> > respectively.
> >
> > The task context object stock can be accessed after disabling preemption
> > which is cheap in non-preempt kernel. The interrupt context object stock
> > can only be accessed after disabling interrupt. User context code can
> > access interrupt object stock, but not vice versa.
> >
> > The mod_objcg_state() function is also modified to make sure that memcg
> > and lruvec stat updates are done with interrupted disabled.
> >
> > The downside of this change is that there are more data stored in local
> > object stocks and not reflected in the charge counter and the vmstat
> > arrays.  However, this is a small price to pay for better performance.
> >
> > Signed-off-by: Waiman Long 
> > Acked-by: Roman Gushchin 
> > ---
> >  mm/memcontrol.c | 73 +++--
> >  1 file changed, 59 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 69f728383efe..29f2df76644a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2229,7 +2229,8 @@ struct obj_stock {
> >  struct memcg_stock_pcp {
> > struct mem_cgroup *cached; /* this never be root cgroup */
> > unsigned int nr_pages;
> > -   struct obj_stock obj;
> > +   struct obj_stock task_obj;
> > +   struct obj_stock irq_obj;
> >
> > struct work_struct work;
> > unsigned long flags;
> > @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct 
> > memcg_stock_pcp *stock,
> >  }
> >  #endif
> >
> > +/*
> > + * Most kmem_cache_alloc() calls are from user context. The irq 
> > disable/enable
> > + * sequence used in this case to access content from object stock is slow.
> > + * To optimize for user context access, there are now two object stocks for
> > + * task context and interrupt context access respectively.
> > + *
> > + * The task context object stock can be accessed by disabling preemption 
> > only
> > + * which is cheap in non-preempt kernel. The interrupt context object stock
> > + * can only be accessed after disabling interrupt. User context code can
> > + * access interrupt object stock, but not vice versa.
> > + */
> >  static inline struct obj_stock *current_obj_stock(void)
> >  {
> > struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
> >
> > -   return >obj;
> > +   return in_task() ? >task_obj : >irq_obj;
> > +}
> > +
> > +#define get_obj_stock(flags)   \
> > +({ \
> > +   struct memcg_stock_pcp *stock;  \
> > +   struct obj_stock *obj_stock;\
> > +   \
> > +   if (in_task()) {\
> > +   preempt_disable();  \
> > +   (flags) = -1L;  \
> > +   stock = this_cpu_ptr(_stock); \
>
> The above line was missing in the previous version.
>
> > +   obj_stock = >task_obj;   \
> > +   } else {\
> > +   local_irq_save(flags);  \
> > +   stock = this_cpu_ptr(_stock); \
> > +   obj_stock = >irq_obj;\
> > +   }   \
> > +   obj_stock;  \
> > +})
> > +
> > +static inline void put_obj_stock(unsigned long flags)
> > +{
> > +   if (flags == -1L)
> > +   preempt_enable();
> > +   else
> > +   local_irq_restore(flags);
> >  }
> >
> >  /**
> > @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct 
> > *dummy)
> > local_irq_save(flags);
> >
> > stock = this_cpu_ptr(_stock);
> >

Re: [PATCH v2 5/5] mm/memcg: Optimize user context object stock access

2021-04-12 Thread Shakeel Butt

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long  wrote:
>
> Most kmem_cache_alloc() calls are from user context. With instrumentation
> enabled, the measured amount of kmem_cache_alloc() calls from non-task
> context was about 0.01% of the total.
>
> The irq disable/enable sequence used in this case to access content
> from object stock is slow.  To optimize for user context access, there
> are now two object stocks for task context and interrupt context access
> respectively.
>
> The task context object stock can be accessed after disabling preemption
> which is cheap in non-preempt kernel. The interrupt context object stock
> can only be accessed after disabling interrupt. User context code can
> access interrupt object stock, but not vice versa.
>
> The mod_objcg_state() function is also modified to make sure that memcg
> and lruvec stat updates are done with interrupted disabled.
>
> The downside of this change is that there are more data stored in local
> object stocks and not reflected in the charge counter and the vmstat
> arrays.  However, this is a small price to pay for better performance.
>
> Signed-off-by: Waiman Long 
> Acked-by: Roman Gushchin 
> ---
>  mm/memcontrol.c | 73 +++--
>  1 file changed, 59 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 69f728383efe..29f2df76644a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2229,7 +2229,8 @@ struct obj_stock {
>  struct memcg_stock_pcp {
> struct mem_cgroup *cached; /* this never be root cgroup */
> unsigned int nr_pages;
> -   struct obj_stock obj;
> +   struct obj_stock task_obj;
> +   struct obj_stock irq_obj;
>
> struct work_struct work;
> unsigned long flags;
> @@ -2254,11 +2255,48 @@ static bool obj_stock_flush_required(struct 
> memcg_stock_pcp *stock,
>  }
>  #endif
>
> +/*
> + * Most kmem_cache_alloc() calls are from user context. The irq 
> disable/enable
> + * sequence used in this case to access content from object stock is slow.
> + * To optimize for user context access, there are now two object stocks for
> + * task context and interrupt context access respectively.
> + *
> + * The task context object stock can be accessed by disabling preemption only
> + * which is cheap in non-preempt kernel. The interrupt context object stock
> + * can only be accessed after disabling interrupt. User context code can
> + * access interrupt object stock, but not vice versa.
> + */
>  static inline struct obj_stock *current_obj_stock(void)
>  {
> struct memcg_stock_pcp *stock = this_cpu_ptr(_stock);
>
> -   return >obj;
> +   return in_task() ? >task_obj : >irq_obj;
> +}
> +
> +#define get_obj_stock(flags)   \
> +({ \
> +   struct memcg_stock_pcp *stock;  \
> +   struct obj_stock *obj_stock;\
> +   \
> +   if (in_task()) {\
> +   preempt_disable();  \
> +   (flags) = -1L;  \
> +   stock = this_cpu_ptr(_stock); \

The above line was missing in the previous version.

> +   obj_stock = >task_obj;   \
> +   } else {\
> +   local_irq_save(flags);  \
> +   stock = this_cpu_ptr(_stock); \
> +   obj_stock = >irq_obj;\
> +   }   \
> +   obj_stock;  \
> +})
> +
> +static inline void put_obj_stock(unsigned long flags)
> +{
> +   if (flags == -1L)
> +   preempt_enable();
> +   else
> +   local_irq_restore(flags);
>  }
>
>  /**
> @@ -2327,7 +2365,9 @@ static void drain_local_stock(struct work_struct *dummy)
> local_irq_save(flags);
>
> stock = this_cpu_ptr(_stock);
> -   drain_obj_stock(>obj);
> +   drain_obj_stock(>irq_obj);
> +   if (in_task())
> +   drain_obj_stock(>task_obj);
> drain_stock(stock);
> clear_bit(FLUSHING_CACHED_CHARGE, >flags);
>
> @@ -3183,7 +3223,7 @@ static inline void mod_objcg_state(struct obj_cgroup 
> *objcg,
> memcg = obj_cgroup_memcg(objcg);
> if (pgdat)
> lruvec = mem_cgroup_lruvec(memcg, pgdat);
> -   __mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> +   mod_memcg_lruvec_state(memcg, lruvec, idx, nr);
> rcu_read_unlock();
>  }
>
> @@ -3193,7 +3233,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, 
> unsigned int nr_bytes)
> unsigned long flags;
> bool ret = false;
>
> -   local_irq_save(flags);
> +   stock = get_obj_stock(flags);
>
> stock = current_obj_stock();

The above is redundant.

>

Re: [PATCH v2 4/5] mm/memcg: Separate out object stock data into its own struct

2021-04-12 Thread Shakeel Butt

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long  wrote:
>
> The object stock data stored in struct memcg_stock_pcp are independent
> of the other page based data stored there. Separating them out into
> their own struct to highlight the independency.
>
> Signed-off-by: Waiman Long 
> Acked-by: Roman Gushchin 

Reviewed-by: Shakeel Butt

Re: [PATCH v2 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

2021-04-12 Thread Shakeel Butt

On Mon, Apr 12, 2021 at 3:55 PM Waiman Long  wrote:
>
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
>
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change.
>
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 17% (946796 out of 5515184) of the
> time when __mod_obj_stock_state() is called leads to an actual call to
> mod_objcg_state() after initial boot. When doing parallel kernel build,
> the figure was about 16% (21894614 out of 139780628). So caching the
> vmstat data reduces the number of calls to mod_objcg_state() by more
> than 80%.
>
> Signed-off-by: Waiman Long 

Reviewed-by: Shakeel Butt

Re: [PATCH 1/5] mm/memcg: Pass both memcg and lruvec to mod_memcg_lruvec_state()

2021-04-12 Thread Shakeel Butt

On Fri, Apr 9, 2021 at 4:19 PM Waiman Long  wrote:
>
> The caller of mod_memcg_lruvec_state() has both memcg and lruvec readily
> available. So both of them are now passed to mod_memcg_lruvec_state()
> and __mod_memcg_lruvec_state(). The __mod_memcg_lruvec_state() is
> updated to allow either of the two parameters to be set to null. This
> makes mod_memcg_lruvec_state() equivalent to mod_memcg_state() if lruvec
> is null.
>
> Signed-off-by: Waiman Long 

Similar to Roman's suggestion: instead of what this patch is doing the
'why' would be better in the changelog.

Reviewed-by: Shakeel Butt

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-12 Thread Shakeel Butt

On Fri, Apr 9, 2021 at 4:26 PM Tim Chen  wrote:
>
>
> On 4/8/21 4:52 AM, Michal Hocko wrote:
>
> >> The top tier memory used is reported in
> >>
> >> memory.toptier_usage_in_bytes
> >>
> >> The amount of top tier memory usable by each cgroup without
> >> triggering page reclaim is controlled by the
> >>
> >> memory.toptier_soft_limit_in_bytes
> >
>
> Michal,
>
> Thanks for your comments.  I will like to take a step back and
> look at the eventual goal we envision: a mechanism to partition the
> tiered memory between the cgroups.
>
> A typical use case may be a system with two set of tasks.
> One set of task is very latency sensitive and we desire instantaneous
> response from them. Another set of tasks will be running batch jobs
> were latency and performance is not critical.   In this case,
> we want to carve out enough top tier memory such that the working set
> of the latency sensitive tasks can fit entirely in the top tier memory.
> The rest of the top tier memory can be assigned to the background tasks.
>
> To achieve such cgroup based tiered memory management, we probably want
> something like the following.
>
> For generalization let's say that there are N tiers of memory t_0, t_1 ... 
> t_N-1,
> where tier t_0 sits at the top and demotes to the lower tier.
> We envision for this top tier memory t0 the following knobs and counters
> in the cgroup memory controller
>
> memory_t0.current   Current usage of tier 0 memory by the cgroup.
>
> memory_t0.min   If tier 0 memory used by the cgroup falls below this 
> low
> boundary, the memory will not be subjected to demotion
> to lower tiers to free up memory at tier 0.
>
> memory_t0.low   Above this boundary, the tier 0 memory will be 
> subjected
> to demotion.  The demotion pressure will be 
> proportional
> to the overage.
>
> memory_t0.high  If tier 0 memory used by the cgroup exceeds this high
> boundary, allocation of tier 0 memory by the cgroup 
> will
> be throttled. The tier 0 memory used by this cgroup
> will also be subjected to heavy demotion.
>
> memory_t0.max   This will be a hard usage limit of tier 0 memory on 
> the cgroup.
>
> If needed, memory_t[12...].current/min/low/high for additional tiers can be 
> added.
> This follows closely with the design of the general memory controller 
> interface.
>
> Will such an interface looks sane and acceptable with everyone?
>

I have a couple of questions. Let's suppose we have a two socket
system. Node 0 (DRAM+CPUs), Node 1 (DRAM+CPUs), Node 2 (PMEM on socket
0 along with Node 0) and Node 3 (PMEM on socket 1 along with Node 1).
Based on the tier definition of this patch series, tier_0: {node_0,
node_1} and tier_1: {node_2, node_3}.

My questions are:

1) Can we assume that the cost of access within a tier will always be
less than the cost of access from the tier? (node_0 <-> node_1 vs
node_0 <-> node_2)
2) If yes to (1), is that assumption future proof? Will the future
systems with DRAM over CXL support have the same characteristics?
3) Will the cost of access from tier_0 to tier_1 be uniform? (node_0
<-> node_2 vs node_0 <-> node_3). For jobs running on node_0, node_3
might be third tier and similarly for jobs running on node_1, node_2
might be third tier.

The reason I am asking these questions is that the statically
partitioning memory nodes into tiers will inherently add platform
specific assumptions in the user API.

Assumptions like:
1) Access within tier is always cheaper than across tier.
2) Access from tier_i to tier_i+1 has uniform cost.

The reason I am more inclined towards having numa centric control is
that we don't have to make these assumptions. Though the usability
will be more difficult. Greg (CCed) has some ideas on making it better
and we will share our proposal after polishing it a bit more.

Re: [PATCH 2/5] mm/memcg: Introduce obj_cgroup_uncharge_mod_state()

2021-04-12 Thread Shakeel Butt

On Fri, Apr 9, 2021 at 4:19 PM Waiman Long  wrote:
>
> In memcg_slab_free_hook()/pcpu_memcg_free_hook(), obj_cgroup_uncharge()
> is followed by mod_objcg_state()/mod_memcg_state(). Each of these
> function call goes through a separate irq_save/irq_restore cycle. That
> is inefficient.  Introduce a new function obj_cgroup_uncharge_mod_state()
> that combines them with a single irq_save/irq_restore cycle.
>
> Signed-off-by: Waiman Long 

Reviewed-by: Shakeel Butt

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-12 Thread Shakeel Butt

On Thu, Apr 8, 2021 at 4:52 AM Michal Hocko  wrote:
>
[...]
>
> What I am trying to say (and I have brought that up when demotion has been
> discussed at LSFMM) is that the implementation shouldn't be PMEM aware.
> The specific technology shouldn't be imprinted into the interface.
> Fundamentally you are trying to balance memory among NUMA nodes as we do
> not have other abstraction to use. So rather than talking about top,
> secondary, nth tier we have different NUMA nodes with different
> characteristics and you want to express your "priorities" for them.
>

I am also inclined towards NUMA based approach. It makes the solution
more general and even existing systems with multiple numa nodes and
DRAM can take advantage of this approach (if it makes sense).

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-12 Thread Shakeel Butt

On Thu, Apr 8, 2021 at 1:50 PM Yang Shi  wrote:
>
[...]

> >
> > The low and min limits have semantics similar to the v1's soft limit
> > for this situation i.e. letting the low priority job occupy top tier
> > memory and depending on reclaim to take back the excess top tier
> > memory use of such jobs.
>
> I don't get why low priority jobs can *not* use top tier memory?

I am saying low priority jobs can use top tier memory. The only
difference is to limit them upfront (using limits) or reclaim from
them later (using min/low/soft-limit).

> I can
> think of it may incur latency overhead for high priority jobs. If it
> is not allowed, it could be restricted by cpuset without introducing
> in any new interfaces.
>
> I'm supposed the memory utilization could be maximized by allowing all
> jobs allocate memory from all applicable nodes, then let reclaimer (or
> something new if needed)

Most probably something new as we do want to consider unevictable
memory as well.

> do the job to migrate the memory to proper
> nodes by time. We could achieve some kind of balance between memory
> utilization and resource isolation.
>

Tradeoff between utilization and isolation should be decided by the user/admin.

Re: [External] Re: [RFC PATCH v2 09/18] mm: vmscan: remove noinline_for_stack

2021-04-10 Thread Shakeel Butt

On Fri, Apr 9, 2021 at 9:34 PM Muchun Song  wrote:
>
> On Sat, Apr 10, 2021 at 2:31 AM Johannes Weiner  wrote:
> >
> > On Fri, Apr 09, 2021 at 08:29:50PM +0800, Muchun Song wrote:
> > > The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> > > set up pagevec as late as possible in shrink_inactive_list()"), its
> > > purpose is to delay the allocation of pagevec as late as possible to
> > > save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> > > reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> > > not need noinline_for_stack, just remove it (let the compiler decide
> > > whether to inline).
> > >
> > > Signed-off-by: Muchun Song 
> >
> > Good catch.
> >
> > Acked-by: Johannes Weiner 
> >
> > Since this patch is somewhat independent of the rest of the series,
> > you may want to put it in the very beginning, or even submit it
> > separately, to keep the main series as compact as possible. Reviewers
> > can be more hesitant to get involved with larger series ;)
>
> OK. I will gather all the cleanup patches into a separate series.
> Thanks for your suggestion.

That would be best.

For this patch:

Reviewed-by: Shakeel Butt

Re: [RFC PATCH v2 02/18] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm

2021-04-10 Thread Shakeel Butt

On Fri, Apr 9, 2021 at 5:32 AM Muchun Song  wrote:
>
> When mm is NULL, we do not need to hold rcu lock and call css_tryget for
> the root memcg. And we also do not need to check !mm in every loop of
> while. So bail out early when !mm.
>
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

Reviewed-by: Shakeel Butt

Re: [RFC PATCH v2 01/18] mm: memcontrol: fix page charging in page replacement

2021-04-10 Thread Shakeel Butt

On Fri, Apr 9, 2021 at 5:32 AM Muchun Song  wrote:
>
> The pages aren't accounted at the root level, so do not charge the page
> to the root memcg in page replacement. Although we do not display the
> value (mem_cgroup_usage) so there shouldn't be any actual problem, but
> there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it
> will trigger? So it is better to fix it.
>
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

Reviewed-by: Shakeel Butt

Re: [PATCH net-next v3 2/5] mm: add a signature in struct page

2021-04-10 Thread Shakeel Butt

On Sat, Apr 10, 2021 at 9:16 AM Ilias Apalodimas
 wrote:
>
> Hi Matthew
>
> On Sat, Apr 10, 2021 at 04:48:24PM +0100, Matthew Wilcox wrote:
> > On Sat, Apr 10, 2021 at 12:37:58AM +0200, Matteo Croce wrote:
> > > This is needed by the page_pool to avoid recycling a page not allocated
> > > via page_pool.
> >
> > Is the PageType mechanism more appropriate to your needs?  It wouldn't
> > be if you use page->_mapcount (ie mapping it to userspace).
>
> Interesting!
> Please keep in mind this was written ~2018 and was stale on my branches for
> quite some time.  So back then I did try to use PageType, but had not free
> bits.  Looking at it again though, it's cleaned up.  So yes I think this can
> be much much cleaner.  Should we go and define a new PG_pagepool?
>
>

Can this page_pool be used for TCP RX zerocopy? If yes then PageType
can not be used.

There is a recent discussion [1] on memcg accounting of TCP RX
zerocopy and I am wondering if this work can somehow help in that
regard. I will take a look at the series.

[1] 
https://lore.kernel.org/linux-mm/20210316013003.25271-1-arjunroy.k...@gmail.com/

Re: [RFC bpf-next 0/1] bpf: Add page cache iterator

2021-04-08 Thread Shakeel Butt

On Wed, Apr 7, 2021 at 2:47 PM Daniel Xu  wrote:
>
> There currently does not exist a way to answer the question: "What is in
> the page cache?". There are various heuristics and counters but nothing
> that can tell you anything like:
>
>   * 3M from /home/dxu/foo.txt
>   * 5K from ...
>   * etc.
>

So, you need the list of inodes for a filesystem that are in memory.
With that list, you can open, map and do mincore on them?

I don't know of a way to get that inode list.

> The answer to the question is particularly useful in the stacked
> container world. Stacked containers implies multiple containers are run
> on the same physical host. Memory is precious resource on some (if not
> most) of these systems. On these systems, it's useful to know how much
> duplicated data is in the page cache. Once you know the answer, you can
> do something about it. One possible technique would be bind mount common
> items from the root host into each container.
>

Oh you want to share page cache between different containers/jobs?
That would complicate charging.

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-08 Thread Shakeel Butt

On Thu, Apr 8, 2021 at 11:01 AM Yang Shi  wrote:
>
> On Thu, Apr 8, 2021 at 10:19 AM Shakeel Butt  wrote:
> >
> > Hi Tim,
> >
> > On Mon, Apr 5, 2021 at 11:08 AM Tim Chen  wrote:
> > >
> > > Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> > > others NUMA wise, but a byte of media has about the same cost whether it
> > > is close or far.  But, with new memory tiers such as Persistent Memory
> > > (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> > > PMEM.
> > >
> > > The fast/expensive memory lives in the top tier of the memory hierachy.
> > >
> > > Previously, the patchset
> > > [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> > > https://lore.kernel.org/linux-mm/20210401183216.443c4...@viggo.jf.intel.com/
> > > provides a mechanism to demote cold pages from DRAM node into PMEM.
> > >
> > > And the patchset
> > > [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory 
> > > tiering system
> > > https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.hu...@intel.com/
> > > provides a mechanism to promote hot pages in PMEM to the DRAM node
> > > leveraging autonuma.
> > >
> > > The two patchsets together keep the hot pages in DRAM and colder pages
> > > in PMEM.
> >
> > Thanks for working on this as this is becoming more and more important
> > particularly in the data centers where memory is a big portion of the
> > cost.
> >
> > I see you have responded to Michal and I will add my more specific
> > response there. Here I wanted to give my high level concern regarding
> > using v1's soft limit like semantics for top tier memory.
> >
> > This patch series aims to distribute/partition top tier memory between
> > jobs of different priorities. We want high priority jobs to have
> > preferential access to the top tier memory and we don't want low
> > priority jobs to hog the top tier memory.
> >
> > Using v1's soft limit like behavior can potentially cause high
> > priority jobs to stall to make enough space on top tier memory on
> > their allocation path and I think this patchset is aiming to reduce
> > that impact by making kswapd do that work. However I think the more
> > concerning issue is the low priority job hogging the top tier memory.
> >
> > The possible ways the low priority job can hog the top tier memory are
> > by allocating non-movable memory or by mlocking the memory. (Oh there
> > is also pinning the memory but I don't know if there is a user api to
> > pin memory?) For the mlocked memory, you need to either modify the
> > reclaim code or use a different mechanism for demoting cold memory.
>
> Do you mean long term pin? RDMA should be able to simply pin the
> memory for weeks. A lot of transient pins come from Direct I/O. They
> should be less concerned.
>
> The low priority jobs should be able to be restricted by cpuset, for
> example, just keep them on second tier memory nodes. Then all the
> above problems are gone.
>

Yes that's an extreme way to overcome the issue but we can do less
extreme by just (hard) limiting the top tier usage of low priority
jobs.

> >
> > Basically I am saying we should put the upfront control (limit) on the
> > usage of top tier memory by the jobs.
>
> This sounds similar to what I talked about in LSFMM 2019
> (https://lwn.net/Articles/787418/). We used to have some potential
> usecase which divides DRAM:PMEM ratio for different jobs or memcgs
> when I was with Alibaba.
>
> In the first place I thought about per NUMA node limit, but it was
> very hard to configure it correctly for users unless you know exactly
> about your memory usage and hot/cold memory distribution.
>
> I'm wondering, just off the top of my head, if we could extend the
> semantic of low and min limit. For example, just redefine low and min
> to "the limit on top tier memory". Then we could have low priority
> jobs have 0 low/min limit.
>

The low and min limits have semantics similar to the v1's soft limit
for this situation i.e. letting the low priority job occupy top tier
memory and depending on reclaim to take back the excess top tier
memory use of such jobs.

I have some thoughts on NUMA node limits which I will share in the other thread.

Re: [RFC PATCH v1 00/11] Manage the top tier memory in a tiered memory

2021-04-08 Thread Shakeel Butt

Hi Tim,

On Mon, Apr 5, 2021 at 11:08 AM Tim Chen  wrote:
>
> Traditionally, all memory is DRAM.  Some DRAM might be closer/faster than
> others NUMA wise, but a byte of media has about the same cost whether it
> is close or far.  But, with new memory tiers such as Persistent Memory
> (PMEM).  there is a choice between fast/expensive DRAM and slow/cheap
> PMEM.
>
> The fast/expensive memory lives in the top tier of the memory hierachy.
>
> Previously, the patchset
> [PATCH 00/10] [v7] Migrate Pages in lieu of discard
> https://lore.kernel.org/linux-mm/20210401183216.443c4...@viggo.jf.intel.com/
> provides a mechanism to demote cold pages from DRAM node into PMEM.
>
> And the patchset
> [PATCH 0/6] [RFC v6] NUMA balancing: optimize memory placement for memory 
> tiering system
> https://lore.kernel.org/linux-mm/20210311081821.138467-1-ying.hu...@intel.com/
> provides a mechanism to promote hot pages in PMEM to the DRAM node
> leveraging autonuma.
>
> The two patchsets together keep the hot pages in DRAM and colder pages
> in PMEM.

Thanks for working on this as this is becoming more and more important
particularly in the data centers where memory is a big portion of the
cost.

I see you have responded to Michal and I will add my more specific
response there. Here I wanted to give my high level concern regarding
using v1's soft limit like semantics for top tier memory.

This patch series aims to distribute/partition top tier memory between
jobs of different priorities. We want high priority jobs to have
preferential access to the top tier memory and we don't want low
priority jobs to hog the top tier memory.

Using v1's soft limit like behavior can potentially cause high
priority jobs to stall to make enough space on top tier memory on
their allocation path and I think this patchset is aiming to reduce
that impact by making kswapd do that work. However I think the more
concerning issue is the low priority job hogging the top tier memory.

The possible ways the low priority job can hog the top tier memory are
by allocating non-movable memory or by mlocking the memory. (Oh there
is also pinning the memory but I don't know if there is a user api to
pin memory?) For the mlocked memory, you need to either modify the
reclaim code or use a different mechanism for demoting cold memory.

Basically I am saying we should put the upfront control (limit) on the
usage of top tier memory by the jobs.

Re: [PATCH] mm: page_counter: mitigate consequences of a page_counter underflow

2021-04-08 Thread Shakeel Butt

On Thu, Apr 8, 2021 at 7:31 AM Johannes Weiner  wrote:
>
> When the unsigned page_counter underflows, even just by a few pages, a
> cgroup will not be able to run anything afterwards and trigger the OOM
> killer in a loop.
>
> Underflows shouldn't happen, but when they do in practice, we may just
> be off by a small amount that doesn't interfere with the normal
> operation - consequences don't need to be that dire.
>
> Reset the page_counter to 0 upon underflow. We'll issue a warning that
> the accounting will be off and then try to keep limping along.
>
> [ We used to do this with the original res_counter, where it was a
>   more straight-forward correction inside the spinlock section. I
>   didn't carry it forward into the lockless page counters for
>   simplicity, but it turns out this is quite useful in practice. ]
>
> Signed-off-by: Johannes Weiner 

Reviewed-by: Shakeel Butt

Re: High kmalloc-32 slab cache consumption with 10k containers

2021-04-07 Thread Shakeel Butt

On Wed, Apr 7, 2021 at 4:55 AM Michal Hocko  wrote:
>
> On Mon 05-04-21 11:18:48, Bharata B Rao wrote:
> > Hi,
> >
> > When running 1 (more-or-less-empty-)containers on a bare-metal Power9
> > server(160 CPUs, 2 NUMA nodes, 256G memory), it is seen that memory
> > consumption increases quite a lot (around 172G) when the containers are
> > running. Most of it comes from slab (149G) and within slab, the majority of
> > it comes from kmalloc-32 cache (102G)
>
> Is this 10k cgroups a testing enviroment or does anybody really use that
> in production? I would be really curious to hear how that behaves when
> those containers are not idle. E.g. global memory reclaim iterating over
> 10k memcgs will likely be very visible. I do remember playing with
> similar setups few years back and the overhead was very high.
> --

I can tell about our environment. Couple of thousands of memcgs (~2k)
are very normal on our machines as machines can be running 100+ jobs
(and each job can manage their own sub-memcgs). However each job can
have a high number of mounts. There is no local disk and each package
of the job is remotely mounted (a bit more complicated).

We do have issues with global memory reclaim but mostly the proactive
reclaim makes the global reclaim a tail issue (and at tail it often
does create havoc).

Re: [PATCH] mm: memcontrol: fix forget to obtain the ref to objcg in split_page_memcg

2021-04-02 Thread Shakeel Butt

On Fri, Apr 2, 2021 at 6:04 PM Andrew Morton  wrote:
>
> On Wed, 31 Mar 2021 20:35:02 -0700 Roman Gushchin  wrote:
>
> > On Thu, Apr 01, 2021 at 11:31:16AM +0800, Miaohe Lin wrote:
> > > On 2021/4/1 11:01, Muchun Song wrote:
> > > > Christian Borntraeger reported a warning about "percpu ref
> > > > (obj_cgroup_release) <= 0 (-1) after switching to atomic".
> > > > Because we forgot to obtain the reference to the objcg and
> > > > wrongly obtain the reference of memcg.
> > > >
> > > > Reported-by: Christian Borntraeger 
> > > > Signed-off-by: Muchun Song 
> > >
> > > Thanks for the patch.
> > > Is a Fixes tag needed?
> >
> > No, as the original patch hasn't been merged into the Linus's tree yet.
> > So the fix can be simply squashed.
>
> Help.  Which is "the original patch"?

"mm: memcontrol: use obj_cgroup APIs to charge kmem pages"

Re: [PATCH v7 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-04-01 Thread Shakeel Butt

CC: Hugh Dickins

On Wed, Mar 31, 2021 at 9:37 PM Alistair Popple  wrote:
>
> On Wednesday, 31 March 2021 10:57:46 PM AEDT Jason Gunthorpe wrote:
> > On Wed, Mar 31, 2021 at 03:15:47PM +1100, Alistair Popple wrote:
> > > On Wednesday, 31 March 2021 2:56:38 PM AEDT John Hubbard wrote:
> > > > On 3/30/21 3:56 PM, Alistair Popple wrote:
> > > > ...
> > > > >> +1 for renaming "munlock*" items to "mlock*", where applicable. good
> > > grief.
> > > > >
> > > > > At least the situation was weird enough to prompt further
> investigation :)
> > > > >
> > > > > Renaming to mlock* doesn't feel like the right solution to me either
> > > though. I
> > > > > am not sure if you saw me responding to myself earlier but I am
> thinking
> > > > > renaming try_to_munlock() -> page_mlocked() and try_to_munlock_one() -
> >
> > > > > page_mlock_one() might be better. Thoughts?
> > > > >
> > > >
> > > > Quite confused by this naming idea. Because: try_to_munlock() returns
> > > > void, so a boolean-style name such as "page_mlocked()" is already not a
> > > > good fit.
> > > >
> > > > Even more important, though, is that try_to_munlock() is mlock-ing the
> > > > page, right? Is there some subtle point I'm missing? It really is doing
> > > > an mlock to the best of my knowledge here. Although the kerneldoc
> > > > comment for try_to_munlock() seems questionable too:
> > >
> > > It's mlocking the page if it turns out it still needs to be locked after
> > > unlocking it. But I don't think you're missing anything.
> >
> > It is really searching all VMA's to see if the VMA flag is set and if
> > any are found then it mlocks the page.
> >
> > But presenting this rountine in its simplified form raises lots of
> > questions:
> >
> >  - What locking is being used to read the VMA flag?
> >  - Why do we need to manipulate global struct page flags under the
> >page table locks of a single VMA?
>
> I was wondering that and questioned it in an earlier version of this series. I
> have done some digging and the commit log for b87537d9e2fe ("mm: rmap use pte
> lock not mmap_sem to set PageMlocked") provides the original justification.
>
> It's fairly long so I won't quote it here but the summary seems to be that
> among other things the combination of page lock and ptl makes this safe. I
> have yet to verify if everything there still holds and is sensible, but the
> last paragraph certainly is :-)
>
> "Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
> on this occasion, but that's probably the sensible next step - with a
> rename, given that try_to_munlock()'s business is to try to set Mlocked."
>
> >  - Why do we need to check for huge pages inside the VMA loop, not
> >before going to the rmap? PageTransCompoundHead() is not sensitive to
> >the PTEs. (and what happens if the huge page breaks up concurrently?)
> >  - Why do we clear the mlock bit then run around to try and set it?
>
> I don't have an answer for that as I'm not (yet) across all the mlock code
> paths, but I'm hoping this patch at least won't change anything.
>

It would be good to ask the person who has the most answers?

Hugh, the thread started at
https://lore.kernel.org/dri-devel/20210326000805.2518-4-apop...@nvidia.com/

Re: [External] Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-04-01 Thread Shakeel Butt

On Thu, Apr 1, 2021 at 9:08 AM Muchun Song  wrote:
>
[...]
> > The zombie issue is a pretty urgent concern that has caused several
> > production emergencies now. It needs a fix sooner rather than later.
>
> Thank you very much for clarifying the problem for me. I do agree
> with you. This issue should be fixed ASAP. Using objcg is a good
> choice. Dying objcg should not be a problem. Because the size of
> objcg is so small compared to memcg.
>

Just wanted to say out loud that yes this patchset will reduce the
memcg zombie issue but this is not the final destination. We should
continue the discussions on sharing/reusing scenarios.

Muchun, can you please also CC Hugh Dickins and Alex Shi in the next
version of your patchset?

Re: [PATCH] mm: memcontrol: fix forget to obtain the ref to objcg in split_page_memcg

2021-03-31 Thread Shakeel Butt

On Wed, Mar 31, 2021 at 8:02 PM Muchun Song  wrote:
>
> Christian Borntraeger reported a warning about "percpu ref
> (obj_cgroup_release) <= 0 (-1) after switching to atomic".
> Because we forgot to obtain the reference to the objcg and
> wrongly obtain the reference of memcg.
>
> Reported-by: Christian Borntraeger 
> Signed-off-by: Muchun Song 

Looks good to me.

Reviewed-by: Shakeel Butt

Re: [PATCH mmotm] mm: vmscan: fix shrinker_rwsem in free_shrinker_info()

2021-03-31 Thread Shakeel Butt

On Tue, Mar 30, 2021 at 4:44 PM Hugh Dickins  wrote:
>
> Lockdep warns mm/vmscan.c: suspicious rcu_dereference_protected() usage!
> when free_shrinker_info() is called from mem_cgroup_css_free(): there it
> is called with no locking, whereas alloc_shrinker_info() calls it with
> down_write of shrinker_rwsem - which seems appropriate.  Rearrange that
> so free_shrinker_info() can manage the shrinker_rwsem for itself.
>
> Link: https://lkml.kernel.org/r/20210317140615.GB28839@xsang-OptiPlex-9020
> Reported-by: kernel test robot 
> Signed-off-by: Hugh Dickins 
> Cc: Yang Shi 
> ---
> Sorry, I've made no attempt to work out precisely where in the series
> the locking went missing, nor tried to fit this in as a fix on top of
> mm-vmscan-add-shrinker_info_protected-helper.patch
> which Oliver reported (and which you notated in mmotm's "series" file).
> This patch just adds the fix to the end of the series, after
> mm-vmscan-shrink-deferred-objects-proportional-to-priority.patch

The patch "mm: vmscan: add shrinker_info_protected() helper" replaces
rcu_dereference_protected(shrinker_info, true) with
rcu_dereference_protected(shrinker_info,
lockdep_is_held(_rwsem)).

I think we don't really need shrinker_rwsem in free_shrinker_info()
which is called from css_free(). The bits of the map have already been
'reparented' in css_offline. I think we can remove
lockdep_is_held(_rwsem) for free_shrinker_info().

>
>  mm/vmscan.c |   10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> --- mmotm/mm/vmscan.c   2021-03-28 17:26:54.935553064 -0700
> +++ linux/mm/vmscan.c   2021-03-30 15:55:13.374459559 -0700
> @@ -249,18 +249,20 @@ void free_shrinker_info(struct mem_cgrou
> struct shrinker_info *info;
> int nid;
>
> +   down_write(_rwsem);
> for_each_node(nid) {
> pn = memcg->nodeinfo[nid];
> info = shrinker_info_protected(memcg, nid);
> kvfree(info);
> rcu_assign_pointer(pn->shrinker_info, NULL);
> }
> +   up_write(_rwsem);
>  }
>
>  int alloc_shrinker_info(struct mem_cgroup *memcg)
>  {
> struct shrinker_info *info;
> -   int nid, size, ret = 0;
> +   int nid, size;
> int map_size, defer_size = 0;
>
> down_write(_rwsem);
> @@ -270,9 +272,9 @@ int alloc_shrinker_info(struct mem_cgrou
> for_each_node(nid) {
> info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
> if (!info) {
> +   up_write(_rwsem);
> free_shrinker_info(memcg);
> -   ret = -ENOMEM;
> -   break;
> +   return -ENOMEM;
> }
> info->nr_deferred = (atomic_long_t *)(info + 1);
> info->map = (void *)info->nr_deferred + defer_size;
> @@ -280,7 +282,7 @@ int alloc_shrinker_info(struct mem_cgrou
> }
> up_write(_rwsem);
>
> -   return ret;
> +   return 0;
>  }
>
>  static inline bool need_expand(int nr_max)

Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-03-30 Thread Shakeel Butt

On Tue, Mar 30, 2021 at 2:10 PM Johannes Weiner  wrote:
>
[...]
> > The main concern I have with *just* reparenting LRU pages is that for
> > the long running systems, the root memcg will become a dumping ground.
> > In addition a job running multiple times on a machine will see
> > inconsistent memory usage if it re-accesses the file pages which were
> > reparented to the root memcg.
>
> I don't understand how Muchun's patches are supposed to *change* the
> behavior the way you are describing it. This IS today's behavior.
>
> We have hierarchical accounting, and a page that belongs to a leaf
> cgroup will automatically belong to all its parents.
>
> Further, if you delete a cgroup today, the abandoned cache will stay
> physically linked to that cgroup, but that zombie group no longer acts
> as a control structure: it imposes no limit and no protection; the
> pages will be reclaimed as if it WERE linked to the parent.
>
> For all intents and purposes, when you delete a cgroup today, its
> remaining pages ARE dumped onto the parent.
>
> The only difference is that today they pointlessly pin the leaf cgroup
> as a holding vessel - which is then round-robin'd from the parent
> during reclaim in order to pretend that all these child pages actually
> ARE linked to the parent's LRU list.
>
> Remember how we used to have every page physically linked to multiple
> lrus? The leaf cgroup and the root?
>
> All pages always belong to the (virtual) LRU list of all ancestor
> cgroups. The only thing Muchun changes is that they no longer pin a
> cgroup that has no semantical meaning anymore (because it's neither
> visible to the user nor exerts any contol over the pages anymore).
>

Indeed you are right. Even if the physical representation of the tree
has changed, the logical picture remains the same.

[Subconsciously I was sad that we will lose the information about the
origin memcg of the page for debugging purposes but then I thought if
we really need it then we can just add that metadata in the obj_cgroup
object. So, never mind.]

> Maybe I'm missing something that either you or Roman can explain to
> me. But this series looks like a (rare) pure win.
>
> Whether you like the current semantics is a separate discussion IMO.
>
> > Please note that I do agree with the mentioned problem and we do see
> > this issue in our fleet. Internally we have a "memcg mount option"
> > feature which couples a file system with a memcg and all file pages
> > allocated on that file system will be charged to that memcg. Multiple
> > instances (concurrent or subsequent) of the job will use that file
> > system (with a dedicated memcg) without leaving the zombies behind. I
> > am not pushing for this solution as it comes with its own intricacies
> > (e.g. if memcg coupled with a file system has a limit, the oom
> > behavior would be awkward and therefore internally we don't put a
> > limit on such memcgs). Though I want this to be part of discussion.
>
> Right, you disconnect memory from the tasks that are allocating it,
> and so you can't assign culpability when you need to.
>
> OOM is one thing, but there are also CPU cycles and IO bandwidth
> consumed during reclaim.
>

We didn't really have any issue regarding CPU or IO but that might be
due to our unique setup (i.e. no local disk).

> > I think the underlying reasons behind this issue are:
> >
> > 1) Filesystem shared by disjoint jobs.
> > 2) For job dedicated filesystems, the lifetime of the filesystem is
> > different from the lifetime of the job.
>
> There is also the case of deleting a cgroup just to recreate it right
> after for the same job. Many job managers do this on restart right now
> - like systemd, and what we're using in our fleet. This seems
> avoidable by recycling a group for another instance of the same job.

I was bundling the scenario you mentioned with (2) i.e. the filesystem
persists across multiple subsequent instances of the same job.

>
> Sharing is a more difficult discussion. If you access a page that you
> share with another cgroup, it may or may not be subject to your own or
> your buddy's memory limits. The only limit it is guaranteed to be
> subjected to is that of your parent. So One thing I could imagine is,
> instead of having a separate cgroup outside the hierarchy, we would
> reparent live pages the second they are accessed from a foreign
> cgroup. And reparent them until you reach the first common ancestor.
>
> This way, when you mount a filesystem shared by two jobs, you can put
> them into a joint subtree, and the root level of this subtree captures
> all the memory (as well as the reclaim CPU and IO) used by the two
> jobs - the private portions and the shared portions - and doesn't make
> them the liability of jobs in the system that DON'T share the same fs.

I will give more thought on this idea and see where it goes.

>
> But again, this is a useful discussion to have, but I don't quite see
> why it's relevant to Muchun's patches. They're

Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages

2021-03-30 Thread Shakeel Butt

On Tue, Mar 30, 2021 at 3:20 AM Muchun Song  wrote:
>
> Since the following patchsets applied. All the kernel memory are charged
> with the new APIs of obj_cgroup.
>
> [v17,00/19] The new cgroup slab memory controller
> [v5,0/7] Use obj_cgroup APIs to charge kmem pages
>
> But user memory allocations (LRU pages) pinning memcgs for a long time -
> it exists at a larger scale and is causing recurring problems in the real
> world: page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted into
> a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> and make page reclaim very inefficient.
>
> We can convert LRU pages and most other raw memcg pins to the objcg direction
> to fix this problem, and then the LRU pages will not pin the memcgs.
>
> This patchset aims to make the LRU pages to drop the reference to memory
> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> of the dying cgroups will not increase if we run the following test script.
>
> ```bash
> #!/bin/bash
>
> cat /proc/cgroups | grep memory
>
> cd /sys/fs/cgroup/memory
>
> for i in range{1..500}
> do
> mkdir test
> echo $$ > test/cgroup.procs
> sleep 60 &
> echo $$ > cgroup.procs
> echo `cat test/cgroup.procs` > cgroup.procs
> rmdir test
> done
>
> cat /proc/cgroups | grep memory
> ```
>
> Patch 1 aims to fix page charging in page replacement.
> Patch 2-5 are code cleanup and simplification.
> Patch 6-15 convert LRU pages pin to the objcg direction.

The main concern I have with *just* reparenting LRU pages is that for
the long running systems, the root memcg will become a dumping ground.
In addition a job running multiple times on a machine will see
inconsistent memory usage if it re-accesses the file pages which were
reparented to the root memcg.

Please note that I do agree with the mentioned problem and we do see
this issue in our fleet. Internally we have a "memcg mount option"
feature which couples a file system with a memcg and all file pages
allocated on that file system will be charged to that memcg. Multiple
instances (concurrent or subsequent) of the job will use that file
system (with a dedicated memcg) without leaving the zombies behind. I
am not pushing for this solution as it comes with its own intricacies
(e.g. if memcg coupled with a file system has a limit, the oom
behavior would be awkward and therefore internally we don't put a
limit on such memcgs). Though I want this to be part of discussion.

I think the underlying reasons behind this issue are:

1) Filesystem shared by disjoint jobs.
2) For job dedicated filesystems, the lifetime of the filesystem is
different from the lifetime of the job.

For now, we have two potential solutions to the
zombies-due-to-offlined-LRU-pages i.e. (1) reparent and (2) pairing
memcg and filesystem. For reparenting, the cons are inconsistent
memory usage and root memcg potentially becoming dumping ground. For
pairing, the oom behavior is awkward which is the same for any type of
remote charging.

I am wondering how we can resolve the cons for each. To resolve the
root-memcg-dump issue in the reparenting, maybe we uncharge the page
when it reaches root and the next accesser will be charged. For
inconsistent memory usage, either users accept the inconsistency or
force reclaim before offline which will reduce the benefit of sharing
filesystem with subsequent instances of the job. For the pairing,
maybe we can punt to the user/admin to not set a limit on such memcg
to avoid awkward oom situations.

Thoughts?

Re: [External] [PATCH 2/3] mm: Charge active memcg when no mm is set

2021-03-29 Thread Shakeel Butt

On Mon, Mar 29, 2021 at 9:13 AM Muchun Song  wrote:
>
> On Mon, Mar 29, 2021 at 10:49 PM Dan Schatzberg
>  wrote:
[...]
>
> Since remote memcg must hold a reference, we do not
> need to do something like get_active_memcg() does.
> Just use css_get to obtain a ref, it is simpler. Just
> Like below.
>
> +   if (unlikely(!mm)) {
> +   memcg = active_memcg();
> +   if (unlikely(memcg)) {
> +   /* remote memcg must hold a ref. */
> +   css_get(memcg);
> +   return memcg;
> +   }
>

I second Muchun's suggestion.

Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

2021-03-24 Thread Shakeel Butt

On Wed, Mar 24, 2021 at 1:39 PM Arjun Roy  wrote:
>
> On Wed, Mar 24, 2021 at 2:12 AM Michal Hocko  wrote:
> >
> > On Tue 23-03-21 11:47:54, Arjun Roy wrote:
> > > On Tue, Mar 23, 2021 at 7:34 AM Michal Hocko  wrote:
> > > >
> > > > On Wed 17-03-21 18:12:55, Johannes Weiner wrote:
> > > > [...]
> > > > > Here is an idea of how it could work:
> > > > >
> > > > > struct page already has
> > > > >
> > > > > struct {/* page_pool used by netstack */
> > > > > /**
> > > > >  * @dma_addr: might require a 64-bit value 
> > > > > even on
> > > > >  * 32-bit architectures.
> > > > >  */
> > > > > dma_addr_t dma_addr;
> > > > > };
> > > > >
> > > > > and as you can see from its union neighbors, there is quite a bit more
> > > > > room to store private data necessary for the page pool.
> > > > >
> > > > > When a page's refcount hits zero and it's a networking page, we can
> > > > > feed it back to the page pool instead of the page allocator.
> > > > >
> > > > > From a first look, we should be able to use the PG_owner_priv_1 page
> > > > > flag for network pages (see how this flag is overloaded, we can add a
> > > > > PG_network alias). With this, we can identify the page in __put_page()
> > > > > and __release_page(). These functions are already aware of different
> > > > > types of pages and do their respective cleanup handling. We can
> > > > > similarly make network a first-class citizen and hand pages back to
> > > > > the network allocator from in there.
> > > >
> > > > For compound pages we have a concept of destructors. Maybe we can extend
> > > > that for order-0 pages as well. The struct page is heavily packed and
> > > > compound_dtor shares the storage without other metadata
> > > > intpages;/*16 4 
> > > > */
> > > > unsigned char compound_dtor; /*16 1 
> > > > */
> > > > atomic_t   hpage_pinned_refcount; /*16 
> > > > 4 */
> > > > pgtable_t  pmd_huge_pte; /*16 8 
> > > > */
> > > > void * zone_device_data; /*16 8 
> > > > */
> > > >
> > > > But none of those should really require to be valid when a page is freed
> > > > unless I am missing something. It would really require to check their
> > > > users whether they can leave the state behind. But if we can establish a
> > > > contract that compound_dtor can be always valid when a page is freed
> > > > this would be really a nice and useful abstraction because you wouldn't
> > > > have to care about the specific type of page.
> > > >
> > > > But maybe I am just overlooking the real complexity there.
> > > > --
> > >
> > > For now probably the easiest way is to have network pages be first
> > > class with a specific flag as previously discussed and have concrete
> > > handling for it, rather than trying to establish the contract across
> > > page types.
> >
> > If you are going to claim a page flag then it would be much better to
> > have it more generic. Flags are really scarce and if all you care about
> > is PageHasDestructor() and provide one via page->dtor then the similar
> > mechanism can be reused by somebody else. Or does anything prevent that?
>
> The way I see it - the fundamental want here is, for some arbitrary
> page that we are dropping a reference on, to be able to tell that the
> provenance of the page is some network driver's page pool. If we added
> an enum target to compound_dtor, if we examine that offset in the page
> and look at that value, what guarantee do we have that the page isn't
> instead some other kind of page, and the byte value there was just
> coincidentally the one we were looking for (but it wasn't a network
> driver pool page)?
>
> Existing users of compound_dtor seem to check first that a
> PageCompound() or PageHead() return true - the specific scenario here,
> of receiving network packets, those pages will tend to not be compound
> (and more specifically, compound pages are explicitly disallowed for
> TCP receive zerocopy).
>
> Given that's the case, the options seem to be:
> 1) Use a page flag - with the downside that they are a severely
> limited resource,
> 2) Use some bits inside page->memcg_data - this I believe Johannes had
> reasons against, and it isn't always the case that MEMCG support is
> enabled.
> 3) Use compound_dtor - but I think this would have problems for the
> prior reasons.

I don't think Michal is suggesting to use PageCompound() or
PageHead(). He is suggesting to add a more general page flag
(PageHasDestructor) and corresponding page->dtor, so other potential
users can use it too.

Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

2021-03-24 Thread Shakeel Butt

On Tue, Mar 23, 2021 at 11:42 AM Arjun Roy  wrote:
>
[...]
>
> To summarize then, it seems to me that we're on the same page now.
> I'll put together a tentative v3 such that:
> 1. It uses pre-charging, as previously discussed.
> 2. It uses a page flag to delineate pages of a certain networking sort
> (ie. this mechanism).
> 3. It avails itself of up to 4 words of data inside struct page,
> inside the networking specific struct.
> 4. And it sets up this opt-in lifecycle notification for drivers that
> choose to use it, falling back to existing behaviour without.
>

Arjun, if you don't mind, can you explain how the lifetime of such a
page will look like?

For example:

Driver:
page = dev_alloc_page()
/* page has 1 ref */
dev_map_page(page)
/* I don't think dev_map_page() takes a ref on page, so the ref remains 1. */

On incoming traffic the page goes to skb and which then gets assigned
to a struct sock. Does the kernel increase refcnt of the page on these
operations?

The page gets mapped into user space which increments its refcnt.

After processing the data, the application unmaps the page and its
refcnt will be decremented.

__put_page() will be called when refcnt reaches 0, so, the initial
refcnt which the driver has acquired, has to be transferred to the
next layer. So, I am trying to understand how that will work?

[tip: sched/core] psi: Reduce calls to sched_clock() in psi

2021-03-23 Thread tip-bot2 for Shakeel Butt

The following commit has been merged into the sched/core branch of tip:

Commit-ID: df77430639c9cf73559bac0f25084518bf9a812d
Gitweb:
https://git.kernel.org/tip/df77430639c9cf73559bac0f25084518bf9a812d
Author:Shakeel Butt 
AuthorDate:Sun, 21 Mar 2021 13:51:56 -07:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 23 Mar 2021 16:01:58 +01:00

psi: Reduce calls to sched_clock() in psi

We noticed that the cost of psi increases with the increase in the
levels of the cgroups. Particularly the cost of cpu_clock() sticks out
as the kernel calls it multiple times as it traverses up the cgroup
tree. This patch reduces the calls to cpu_clock().

Performed perf bench on Intel Broadwell with 3 levels of cgroup.

Before the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

 Total time: 0.747 [sec]

 # Running sched/pipe benchmark...
 # Executed 100 pipe operations between two processes

 Total time: 3.516 [sec]

   3.516689 usecs/op
 284358 ops/sec

After the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

 Total time: 0.640 [sec]

 # Running sched/pipe benchmark...
 # Executed 100 pipe operations between two processes

 Total time: 3.329 [sec]

   3.329820 usecs/op
 300316 ops/sec

Signed-off-by: Shakeel Butt 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Johannes Weiner 
Link: https://lkml.kernel.org/r/20210321205156.4186483-1-shake...@google.com
---
 kernel/sched/psi.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index c8480d7..b1b00e9 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -644,12 +644,10 @@ static void poll_timer_fn(struct timer_list *t)
wake_up_interruptible(>poll_wait);
 }
 
-static void record_times(struct psi_group_cpu *groupc, int cpu)
+static void record_times(struct psi_group_cpu *groupc, u64 now)
 {
u32 delta;
-   u64 now;
 
-   now = cpu_clock(cpu);
delta = now - groupc->state_start;
groupc->state_start = now;
 
@@ -676,7 +674,7 @@ static void record_times(struct psi_group_cpu *groupc, int 
cpu)
 }
 
 static void psi_group_change(struct psi_group *group, int cpu,
-unsigned int clear, unsigned int set,
+unsigned int clear, unsigned int set, u64 now,
 bool wake_clock)
 {
struct psi_group_cpu *groupc;
@@ -696,7 +694,7 @@ static void psi_group_change(struct psi_group *group, int 
cpu,
 */
write_seqcount_begin(>seq);
 
-   record_times(groupc, cpu);
+   record_times(groupc, now);
 
for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
if (!(m & (1 << t)))
@@ -788,12 +786,14 @@ void psi_task_change(struct task_struct *task, int clear, 
int set)
struct psi_group *group;
bool wake_clock = true;
void *iter = NULL;
+   u64 now;
 
if (!task->pid)
return;
 
psi_flags_change(task, clear, set);
 
+   now = cpu_clock(cpu);
/*
 * Periodic aggregation shuts off if there is a period of no
 * task changes, so we wake it back up if necessary. However,
@@ -806,7 +806,7 @@ void psi_task_change(struct task_struct *task, int clear, 
int set)
wake_clock = false;
 
while ((group = iterate_groups(task, )))
-   psi_group_change(group, cpu, clear, set, wake_clock);
+   psi_group_change(group, cpu, clear, set, now, wake_clock);
 }
 
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -815,6 +815,7 @@ void psi_task_switch(struct task_struct *prev, struct 
task_struct *next,
struct psi_group *group, *common = NULL;
int cpu = task_cpu(prev);
void *iter;
+   u64 now = cpu_clock(cpu);
 
if (next->pid) {
bool identical_state;
@@ -836,7 +837,7 @@ void psi_task_switch(struct task_struct *prev, struct 
task_struct *next,
break;
}
 
-   psi_group_change(group, cpu, 0, TSK_ONCPU, true);
+   psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
}
}
 
@@ -858,7 +859,7 @@ void psi_task_switch(struct task_struct *prev, struct 
task_struct *next,
 
iter = NULL;
while ((group = iterate_groups(prev, )) && group != common)
-   psi_group_change(group, cpu, clear, set, true);
+   psi_group_change(group, cpu, clear, set, now, true);
 
/*
 * TSK_ONCPU is handled up to the common ancestor. If we're 
tasked
@@ -867,7 +868,7 @@

[PATCH] psi: reduce calls to sched_clock() in psi

2021-03-21 Thread Shakeel Butt

We noticed that the cost of psi increases with the increase in the
levels of the cgroups. Particularly the cost of cpu_clock() sticks out
as the kernel calls it multiple times as it traverses up the cgroup
tree. This patch reduces the calls to cpu_clock().

Performed perf bench on Intel Broadwell with 3 levels of cgroup.

Before the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

 Total time: 0.747 [sec]

 # Running sched/pipe benchmark...
 # Executed 100 pipe operations between two processes

 Total time: 3.516 [sec]

   3.516689 usecs/op
 284358 ops/sec

After the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

 Total time: 0.640 [sec]

 # Running sched/pipe benchmark...
 # Executed 100 pipe operations between two processes

 Total time: 3.329 [sec]

   3.329820 usecs/op
 300316 ops/sec

Signed-off-by: Shakeel Butt 
---
 kernel/sched/psi.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ee3c5b48622f..16348b269713 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -644,12 +644,10 @@ static void poll_timer_fn(struct timer_list *t)
wake_up_interruptible(>poll_wait);
 }
 
-static void record_times(struct psi_group_cpu *groupc, int cpu)
+static void record_times(struct psi_group_cpu *groupc, u64 now)
 {
u32 delta;
-   u64 now;
 
-   now = cpu_clock(cpu);
delta = now - groupc->state_start;
groupc->state_start = now;
 
@@ -676,7 +674,7 @@ static void record_times(struct psi_group_cpu *groupc, int 
cpu)
 }
 
 static void psi_group_change(struct psi_group *group, int cpu,
-unsigned int clear, unsigned int set,
+unsigned int clear, unsigned int set, u64 now,
 bool wake_clock)
 {
struct psi_group_cpu *groupc;
@@ -696,7 +694,7 @@ static void psi_group_change(struct psi_group *group, int 
cpu,
 */
write_seqcount_begin(>seq);
 
-   record_times(groupc, cpu);
+   record_times(groupc, now);
 
for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
if (!(m & (1 << t)))
@@ -788,12 +786,14 @@ void psi_task_change(struct task_struct *task, int clear, 
int set)
struct psi_group *group;
bool wake_clock = true;
void *iter = NULL;
+   u64 now;
 
if (!task->pid)
return;
 
psi_flags_change(task, clear, set);
 
+   now = cpu_clock(cpu);
/*
 * Periodic aggregation shuts off if there is a period of no
 * task changes, so we wake it back up if necessary. However,
@@ -806,7 +806,7 @@ void psi_task_change(struct task_struct *task, int clear, 
int set)
wake_clock = false;
 
while ((group = iterate_groups(task, )))
-   psi_group_change(group, cpu, clear, set, wake_clock);
+   psi_group_change(group, cpu, clear, set, now, wake_clock);
 }
 
 void psi_task_switch(struct task_struct *prev, struct task_struct *next,
@@ -815,6 +815,7 @@ void psi_task_switch(struct task_struct *prev, struct 
task_struct *next,
struct psi_group *group, *common = NULL;
int cpu = task_cpu(prev);
void *iter;
+   u64 now = cpu_clock(cpu);
 
if (next->pid) {
bool identical_state;
@@ -836,7 +837,7 @@ void psi_task_switch(struct task_struct *prev, struct 
task_struct *next,
break;
}
 
-   psi_group_change(group, cpu, 0, TSK_ONCPU, true);
+   psi_group_change(group, cpu, 0, TSK_ONCPU, now, true);
}
}
 
@@ -858,7 +859,7 @@ void psi_task_switch(struct task_struct *prev, struct 
task_struct *next,
 
iter = NULL;
while ((group = iterate_groups(prev, )) && group != common)
-   psi_group_change(group, cpu, clear, set, true);
+   psi_group_change(group, cpu, clear, set, now, true);
 
/*
 * TSK_ONCPU is handled up to the common ancestor. If we're 
tasked
@@ -867,7 +868,7 @@ void psi_task_switch(struct task_struct *prev, struct 
task_struct *next,
if (sleep) {
clear &= ~TSK_ONCPU;
for (; group; group = iterate_groups(prev, ))
-   psi_group_change(group, cpu, clear, set, true);
+   psi_group_change(group, cpu, clear, set, now, 
true);
}
}
 }
-- 
2.31.0.291.g576ba9dcdaf-goog

Re: [PATCH v5 6/7] mm: memcontrol: inline __memcg_kmem_{un}charge() into obj_cgroup_{un}charge_pages()

2021-03-19 Thread Shakeel Butt

On Fri, Mar 19, 2021 at 9:39 AM Muchun Song  wrote:
>
> There is only one user of __memcg_kmem_charge(), so manually inline
> __memcg_kmem_charge() to obj_cgroup_charge_pages(). Similarly manually
> inline __memcg_kmem_uncharge() into obj_cgroup_uncharge_pages() and
> call obj_cgroup_uncharge_pages() in obj_cgroup_release().
>
> This is just code cleanup without any functionality changes.
>
> Signed-off-by: Muchun Song 

Reviewed-by: Shakeel Butt

Re: [PATCH v5 5/7] mm: memcontrol: use obj_cgroup APIs to charge kmem pages

2021-03-19 Thread Shakeel Butt

On Fri, Mar 19, 2021 at 9:39 AM Muchun Song  wrote:
>
> Since Roman series "The new cgroup slab memory controller" applied. All
> slab objects are charged via the new APIs of obj_cgroup. The new APIs
> introduce a struct obj_cgroup to charge slab objects. It prevents
> long-living objects from pinning the original memory cgroup in the memory.
> But there are still some corner objects (e.g. allocations larger than
> order-1 page on SLUB) which are not charged via the new APIs. Those
> objects (include the pages which are allocated from buddy allocator
> directly) are charged as kmem pages which still hold a reference to
> the memory cgroup.
>
> We want to reuse the obj_cgroup APIs to charge the kmem pages.
> If we do that, we should store an object cgroup pointer to
> page->memcg_data for the kmem pages.
>
> Finally, page->memcg_data will have 3 different meanings.
>
>   1) For the slab pages, page->memcg_data points to an object cgroups
>  vector.
>
>   2) For the kmem pages (exclude the slab pages), page->memcg_data
>  points to an object cgroup.
>
>   3) For the user pages (e.g. the LRU pages), page->memcg_data points
>  to a memory cgroup.
>
> We do not change the behavior of page_memcg() and page_memcg_rcu().
> They are also suitable for LRU pages and kmem pages. Why?
>
> Because memory allocations pinning memcgs for a long time - it exists
> at a larger scale and is causing recurring problems in the real world:
> page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted
> into a new cgroup every time. Unreclaimable dying cgroups pile up,
> waste memory, and make page reclaim very inefficient.
>
> We can convert LRU pages and most other raw memcg pins to the objcg
> direction to fix this problem, and then the page->memcg will always
> point to an object cgroup pointer. At that time, LRU pages and kmem
> pages will be treated the same. The implementation of page_memcg()
> will remove the kmem page check.
>
> This patch aims to charge the kmem pages by using the new APIs of
> obj_cgroup. Finally, the page->memcg_data of the kmem page points to
> an object cgroup. We can use the __page_objcg() to get the object
> cgroup associated with a kmem page. Or we can use page_memcg()
> to get the memory cgroup associated with a kmem page, but caller must
> ensure that the returned memcg won't be released (e.g. acquire the
> rcu_read_lock or css_set_lock).
>
> Signed-off-by: Muchun Song 
> Acked-by: Johannes Weiner 

Reviewed-by: Shakeel Butt

Re: [PATCH v5 1/7] mm: memcontrol: slab: fix obtain a reference to a freeing memcg

2021-03-19 Thread Shakeel Butt

On Fri, Mar 19, 2021 at 9:38 AM Muchun Song  wrote:
>
> The rcu_read_lock/unlock only can guarantee that the memcg will not be
> freed, but it cannot guarantee the success of css_get (which is in the
> refill_stock when cached memcg changed) to memcg.
>
>   rcu_read_lock()
>   memcg = obj_cgroup_memcg(old)
>   __memcg_kmem_uncharge(memcg)
>   refill_stock(memcg)
>   if (stock->cached != memcg)
>   // css_get can change the ref counter from 0 back to 1.
>   css_get(>css)
>   rcu_read_unlock()
>
> This fix is very like the commit:
>
>   eefbfa7fd678 ("mm: memcg/slab: fix use after free in obj_cgroup_charge")
>
> Fix this by holding a reference to the memcg which is passed to the
> __memcg_kmem_uncharge() before calling __memcg_kmem_uncharge().
>
> Fixes: 3de7d4f25a74 ("mm: memcg/slab: optimize objcg stock draining")
> Signed-off-by: Muchun Song 

Good catch.

Reviewed-by: Shakeel Butt

Re: [PATCH 2/2] mm: memcontrol: deprecate swapaccounting=0 mode

2021-03-19 Thread Shakeel Butt

On Fri, Mar 19, 2021 at 10:36 AM Johannes Weiner  wrote:
>
> On Fri, Mar 19, 2021 at 06:49:55AM -0700, Shakeel Butt wrote:
> > On Thu, Mar 18, 2021 at 10:49 PM Johannes Weiner  wrote:
> > >
> > > The swapaccounting= commandline option already does very little
> > > today. To close a trivial containment failure case, the swap ownership
> > > tracking part of the swap controller has recently become mandatory
> > > (see commit 2d1c498072de ("mm: memcontrol: make swap tracking an
> > > integral part of memory control") for details), which makes up the
> > > majority of the work during swapout, swapin, and the swap slot map.
> > >
> > > The only thing left under this flag is the page_counter operations and
> > > the visibility of the swap control files in the first place, which are
> > > rather meager savings. There also aren't many scenarios, if any, where
> > > controlling the memory of a cgroup while allowing it unlimited access
> > > to a global swap space is a workable resource isolation stragegy.
> >
> > *strategy
>
> Will fix :)
>
> > > On the other hand, there have been several bugs and confusion around
> > > the many possible swap controller states (cgroup1 vs cgroup2 behavior,
> > > memory accounting without swap accounting, memcg runtime disabled).
> > >
> > > This puts the maintenance overhead of retaining the toggle above its
> > > practical benefits. Deprecate it.
> > >
> > > Suggested-by: Shakeel Butt 
> > > Signed-off-by: Johannes Weiner 
> > [...]
> > >
> > >  static int __init setup_swap_account(char *s)
> > >  {
> > > -   if (!strcmp(s, "1"))
> > > -   cgroup_memory_noswap = false;
> > > -   else if (!strcmp(s, "0"))
> > > -   cgroup_memory_noswap = true;
> > > -   return 1;
> > > +   pr_warn_once("The swapaccount= commandline option is deprecated. "
> > > +"Please report your usecase to linux...@kvack.org if 
> > > you "
> > > +"depend on this functionality.\n");
> > > +   return 0;
> >
> > What's the difference between returning 0 or 1 here?
>
> It signals whether the parameter is "recognized" by the kernel as a
> valid thing to pass, or whether it's unknown. If it's unknown, it'll
> be passed on to the init system (which ignores it), so this resembles
> the behavior we'll have when we remove the __setup in the future.
>
> If somebody can make an argument why we should go with one over the
> other, I'll happily go with that.
>
> > >  __setup("swapaccount=", setup_swap_account);
> > >
> > > @@ -7291,27 +7287,13 @@ static struct cftype memsw_files[] = {
> > > { },/* terminate */
> > >  };
> > >
> > > -/*
> > > - * If mem_cgroup_swap_init() is implemented as a subsys_initcall()
> > > - * instead of a core_initcall(), this could mean cgroup_memory_noswap 
> > > still
> > > - * remains set to false even when memcg is disabled via 
> > > "cgroup_disable=memory"
> > > - * boot parameter. This may result in premature OOPS inside
> > > - * mem_cgroup_get_nr_swap_pages() function in corner cases.
> > > - */
> > >  static int __init mem_cgroup_swap_init(void)
> > >  {
> > > -   /* No memory control -> no swap control */
> > > -   if (mem_cgroup_disabled())
> > > -   cgroup_memory_noswap = true;
> > > -
> > > -   if (cgroup_memory_noswap)
> > > -   return 0;
> > > -
> >
> > Do we need a mem_cgroup_disabled() check here?
>
> cgroup_add_cftypes() implies this check from the cgroup side and will
> just do nothing and return success. So we don't need it now.
>
> But it is something we'd have to remember to add if we do add more
> code to this function later on.
>
> Either way works for me. I can add it if people think it's better.
>

I am fine with either way. For this patch:

Reviewed-by: Shakeel Butt

Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

2021-03-19 Thread Shakeel Butt

On Fri, Mar 19, 2021 at 8:51 AM Dan Schatzberg  wrote:
>
> On Thu, Mar 18, 2021 at 05:56:28PM -0700, Shakeel Butt wrote:
> >
> > We need something similar for mem_cgroup_swapin_charge_page() as well.
> >
> > It is better to take this series in mm tree and Jens is ok with that [1].
> >
> > [1] 
> > https://lore.kernel.org/linux-next/4fea89a5-0e18-0791-18a8-4c5907b0d...@kernel.dk/
>
> It sounds like there are no concerns about the loop-related work in
> the patch series. I'll rebase on the mm tree and resubmit.

One suggestion would be to make get_mem_cgroup_from_mm() more generic
(i.e. handle !mm && active_memcg() case) and avoid
get_mem_cgroup_from_current() as it might go away.

Re: [PATCH 1/2] mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled

2021-03-19 Thread Shakeel Butt

On Thu, Mar 18, 2021 at 10:49 PM Johannes Weiner  wrote:
>
> Since commit 2d1c498072de ("mm: memcontrol: make swap tracking an
> integral part of memory control"), the cgroup swap arrays are used to
> track memory ownership at the time of swap readahead and swapoff, even
> if swap space *accounting* has been turned off by the user via
> swapaccount=0 (which sets cgroup_memory_noswap).
>
> However, the patch was overzealous: by simply dropping the
> cgroup_memory_noswap conditionals in the swapon, swapoff and uncharge
> path, it caused the cgroup arrays being allocated even when the memory
> controller as a whole is disabled. This is a waste of that memory.
>
> Restore mem_cgroup_disabled() checks, implied previously by
> cgroup_memory_noswap, in the swapon, swapoff, and swap_entry_free
> callbacks.
>
> Fixes: 2d1c498072de ("mm: memcontrol: make swap tracking an integral part of 
> memory control")
> Reported-by: Hugh Dickins 
> Signed-off-by: Johannes Weiner 

Reviewed-by: Shakeel Butt

Re: [External] Re: [PATCH v4 4/5] mm: memcontrol: use obj_cgroup APIs to charge kmem pages

2021-03-19 Thread Shakeel Butt

On Thu, Mar 18, 2021 at 9:05 PM Muchun Song  wrote:
>
> On Fri, Mar 19, 2021 at 11:40 AM Shakeel Butt  wrote:
> >
> > On Thu, Mar 18, 2021 at 4:08 AM Muchun Song  
> > wrote:
> > >
> > [...]
> > >
> > > +static inline struct mem_cgroup *get_obj_cgroup_memcg(struct obj_cgroup 
> > > *objcg)
> >
> > I would prefer get_mem_cgroup_from_objcg().
>
> Inspired by obj_cgroup_memcg() which returns the memcg from objcg.
> So I introduce get_obj_cgroup_memcg() which obtains a reference of
> memcg on the basis of obj_cgroup_memcg().
>
> So that the names are more consistent. Just my thought.
>
> So should I rename it to get_mem_cgroup_from_objcg?
>

If you look at other functions which get reference on mem_cgroup, they
have the format of get_mem_cgroup_*. Similarly the current function to
get a reference on obj_cgroup is get_obj_cgroup_from_current().

So, from the name get_obj_cgroup_memcg(), it seems like we are getting
reference on obj_cgroup but the function is getting reference on
mem_cgroup.

> >
> > > +{
> > > +   struct mem_cgroup *memcg;
> > > +
> > > +   rcu_read_lock();
> > > +retry:
> > > +   memcg = obj_cgroup_memcg(objcg);
> > > +   if (unlikely(!css_tryget(>css)))
> > > +   goto retry;
> > > +   rcu_read_unlock();
> > > +
> > > +   return memcg;
> > > +}
> > > +
> > >  #ifdef CONFIG_MEMCG_KMEM
> > >  int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
> > >  gfp_t gfp, bool new_page)
> > > @@ -3070,15 +3088,8 @@ static int obj_cgroup_charge_pages(struct 
> > > obj_cgroup *objcg, gfp_t gfp,
> > > struct mem_cgroup *memcg;
> > > int ret;
> > >
> > > -   rcu_read_lock();
> > > -retry:
> > > -   memcg = obj_cgroup_memcg(objcg);
> > > -   if (unlikely(!css_tryget(>css)))
> > > -   goto retry;
> > > -   rcu_read_unlock();
> > > -
> > > +   memcg = get_obj_cgroup_memcg(objcg);
> > > ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> >
> > Why not manually inline __memcg_kmem_charge() here? This is the only user.
> >
> > Similarly manually inline __memcg_kmem_uncharge() into
> > obj_cgroup_uncharge_pages() and call obj_cgroup_uncharge_pages() in
> > obj_cgroup_release().
>
> Good point. I will do this.
>
> >
> > > -
> > > css_put(>css);
> > >
> > > return ret;
> > > @@ -3143,18 +3154,18 @@ static void __memcg_kmem_uncharge(struct 
> > > mem_cgroup *memcg, unsigned int nr_page
> > >   */
> > >  int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> > >  {
> > > -   struct mem_cgroup *memcg;
> > > +   struct obj_cgroup *objcg;
> > > int ret = 0;
> > >
> > > -   memcg = get_mem_cgroup_from_current();
> >
> > This was the only use of get_mem_cgroup_from_current(). Why not remove it?
>
> I saw a potential user.
>
> [PATCH v10 0/3] Charge loop device i/o to issuing cgroup
>
> To avoid reintroducing them. So I did not remove it.
>

Don't worry about that. Most probably that user would be changing this
function, so it would to better to introduce from scratch.

Re: [PATCH 2/2] mm: memcontrol: deprecate swapaccounting=0 mode

2021-03-19 Thread Shakeel Butt

On Thu, Mar 18, 2021 at 10:49 PM Johannes Weiner  wrote:
>
> The swapaccounting= commandline option already does very little
> today. To close a trivial containment failure case, the swap ownership
> tracking part of the swap controller has recently become mandatory
> (see commit 2d1c498072de ("mm: memcontrol: make swap tracking an
> integral part of memory control") for details), which makes up the
> majority of the work during swapout, swapin, and the swap slot map.
>
> The only thing left under this flag is the page_counter operations and
> the visibility of the swap control files in the first place, which are
> rather meager savings. There also aren't many scenarios, if any, where
> controlling the memory of a cgroup while allowing it unlimited access
> to a global swap space is a workable resource isolation stragegy.

*strategy

>
> On the other hand, there have been several bugs and confusion around
> the many possible swap controller states (cgroup1 vs cgroup2 behavior,
> memory accounting without swap accounting, memcg runtime disabled).
>
> This puts the maintenance overhead of retaining the toggle above its
> practical benefits. Deprecate it.
>
> Suggested-by: Shakeel Butt 
> Signed-off-by: Johannes Weiner 
[...]
>
>  static int __init setup_swap_account(char *s)
>  {
> -   if (!strcmp(s, "1"))
> -   cgroup_memory_noswap = false;
> -   else if (!strcmp(s, "0"))
> -   cgroup_memory_noswap = true;
> -   return 1;
> +   pr_warn_once("The swapaccount= commandline option is deprecated. "
> +"Please report your usecase to linux...@kvack.org if you 
> "
> +"depend on this functionality.\n");
> +   return 0;

What's the difference between returning 0 or 1 here?

>  }
>  __setup("swapaccount=", setup_swap_account);
>
> @@ -7291,27 +7287,13 @@ static struct cftype memsw_files[] = {
> { },/* terminate */
>  };
>
> -/*
> - * If mem_cgroup_swap_init() is implemented as a subsys_initcall()
> - * instead of a core_initcall(), this could mean cgroup_memory_noswap still
> - * remains set to false even when memcg is disabled via 
> "cgroup_disable=memory"
> - * boot parameter. This may result in premature OOPS inside
> - * mem_cgroup_get_nr_swap_pages() function in corner cases.
> - */
>  static int __init mem_cgroup_swap_init(void)
>  {
> -   /* No memory control -> no swap control */
> -   if (mem_cgroup_disabled())
> -   cgroup_memory_noswap = true;
> -
> -   if (cgroup_memory_noswap)
> -   return 0;
> -

Do we need a mem_cgroup_disabled() check here?

> WARN_ON(cgroup_add_dfl_cftypes(_cgrp_subsys, swap_files));
> WARN_ON(cgroup_add_legacy_cftypes(_cgrp_subsys, memsw_files));
>
> return 0;
>  }
> -core_initcall(mem_cgroup_swap_init);
> +subsys_initcall(mem_cgroup_swap_init);
>
>  #endif /* CONFIG_MEMCG_SWAP */
> --
> 2.30.1
>

Re: [PATCH v4 4/5] mm: memcontrol: use obj_cgroup APIs to charge kmem pages

2021-03-18 Thread Shakeel Butt

On Thu, Mar 18, 2021 at 4:08 AM Muchun Song  wrote:
>
[...]
>
> +static inline struct mem_cgroup *get_obj_cgroup_memcg(struct obj_cgroup 
> *objcg)

I would prefer get_mem_cgroup_from_objcg().

> +{
> +   struct mem_cgroup *memcg;
> +
> +   rcu_read_lock();
> +retry:
> +   memcg = obj_cgroup_memcg(objcg);
> +   if (unlikely(!css_tryget(>css)))
> +   goto retry;
> +   rcu_read_unlock();
> +
> +   return memcg;
> +}
> +
>  #ifdef CONFIG_MEMCG_KMEM
>  int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
>  gfp_t gfp, bool new_page)
> @@ -3070,15 +3088,8 @@ static int obj_cgroup_charge_pages(struct obj_cgroup 
> *objcg, gfp_t gfp,
> struct mem_cgroup *memcg;
> int ret;
>
> -   rcu_read_lock();
> -retry:
> -   memcg = obj_cgroup_memcg(objcg);
> -   if (unlikely(!css_tryget(>css)))
> -   goto retry;
> -   rcu_read_unlock();
> -
> +   memcg = get_obj_cgroup_memcg(objcg);
> ret = __memcg_kmem_charge(memcg, gfp, nr_pages);

Why not manually inline __memcg_kmem_charge() here? This is the only user.

Similarly manually inline __memcg_kmem_uncharge() into
obj_cgroup_uncharge_pages() and call obj_cgroup_uncharge_pages() in
obj_cgroup_release().

> -
> css_put(>css);
>
> return ret;
> @@ -3143,18 +3154,18 @@ static void __memcg_kmem_uncharge(struct mem_cgroup 
> *memcg, unsigned int nr_page
>   */
>  int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>  {
> -   struct mem_cgroup *memcg;
> +   struct obj_cgroup *objcg;
> int ret = 0;
>
> -   memcg = get_mem_cgroup_from_current();

This was the only use of get_mem_cgroup_from_current(). Why not remove it?

> -   if (memcg && !mem_cgroup_is_root(memcg)) {
> -   ret = __memcg_kmem_charge(memcg, gfp, 1 << order);
> +   objcg = get_obj_cgroup_from_current();
> +   if (objcg) {
> +   ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
> if (!ret) {
> -   page->memcg_data = (unsigned long)memcg |
> +   page->memcg_data = (unsigned long)objcg |
> MEMCG_DATA_KMEM;
> return 0;
> }
> -   css_put(>css);
> +   obj_cgroup_put(objcg);
> }
> return ret;
>  }
[...]
>  static void uncharge_page(struct page *page, struct uncharge_gather *ug)
>  {
> unsigned long nr_pages;
> +   struct mem_cgroup *memcg;
> +   struct obj_cgroup *objcg;
>
> VM_BUG_ON_PAGE(PageLRU(page), page);
>
> -   if (!page_memcg(page))
> -   return;
> -
> /*
>  * Nobody should be changing or seriously looking at
> -* page_memcg(page) at this point, we have fully
> +* page memcg or objcg at this point, we have fully
>  * exclusive access to the page.
>  */
> +   if (PageMemcgKmem(page)) {
> +   objcg = __page_objcg(page);
> +   memcg = get_obj_cgroup_memcg(objcg);

Can you add a comment that this get matches the put at the end of the
function and kmem pages do not hold memcg references anymore.

> +   } else {
> +   memcg = __page_memcg(page);
> +   }
> +
> +   if (!memcg)
> +   return;
>
> -   if (ug->memcg != page_memcg(page)) {
> +   if (ug->memcg != memcg) {
> if (ug->memcg) {
> uncharge_batch(ug);
> uncharge_gather_clear(ug);
> }
> -   ug->memcg = page_memcg(page);
> +   ug->memcg = memcg;
> ug->dummy_page = page;
>
> /* pairs with css_put in uncharge_batch */
> -   css_get(>memcg->css);
> +   css_get(>css);
> }
>
> nr_pages = compound_nr(page);
> -   ug->nr_pages += nr_pages;
>
> -   if (PageMemcgKmem(page))
> +   if (PageMemcgKmem(page)) {
> +   ug->nr_memory += nr_pages;
> ug->nr_kmem += nr_pages;
> -   else
> +
> +   page->memcg_data = 0;
> +   obj_cgroup_put(objcg);
> +   } else {
> +   /* LRU pages aren't accounted at the root level */
> +   if (!mem_cgroup_is_root(memcg))
> +   ug->nr_memory += nr_pages;
> ug->pgpgout++;
>
> -   page->memcg_data = 0;
> -   css_put(>memcg->css);
> +   page->memcg_data = 0;
> +   }
> +
> +   css_put(>css);
>  }
>
>  /**
> --
> 2.11.0
>

Re: [PATCH v4 3/5] mm: memcontrol: change ug->dummy_page only if memcg changed

2021-03-18 Thread Shakeel Butt

On Thu, Mar 18, 2021 at 4:08 AM Muchun Song  wrote:
>
> Just like assignment to ug->memcg, we only need to update ug->dummy_page
> if memcg changed. So move it to there. This is a very small optimization.
>
> Signed-off-by: Muchun Song 

Reviewed-by: Shakeel Butt

Re: [PATCH v4 2/5] mm: memcontrol: directly access page->memcg_data in mm/page_alloc.c

2021-03-18 Thread Shakeel Butt

On Thu, Mar 18, 2021 at 4:08 AM Muchun Song  wrote:
>
> The page_memcg() is not suitable for use by page_expected_state() and
> page_bad_reason(). Because it can BUG_ON() for the slab pages when
> CONFIG_DEBUG_VM is enabled. As neither lru, nor kmem, nor slab page
> should have anything left in there by the time the page is freed, what
> we care about is whether the value of page->memcg_data is 0. So just
> directly access page->memcg_data here.
>
> Signed-off-by: Muchun Song 

Reviewed-by: Shakeel Butt

Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

2021-03-18 Thread Shakeel Butt

On Thu, Mar 18, 2021 at 4:46 PM Andrew Morton  wrote:
>
> On Thu, 18 Mar 2021 10:00:17 -0600 Jens Axboe  wrote:
>
> > On 3/18/21 9:53 AM, Shakeel Butt wrote:
> > > On Wed, Mar 17, 2021 at 3:30 PM Jens Axboe  wrote:
> > >>
> > >> On 3/16/21 9:36 AM, Dan Schatzberg wrote:
> > >>> No major changes, just rebasing and resubmitting
> > >>
> > >> Applied for 5.13, thanks.
> > >>
> > >
> > > I have requested a couple of changes in the patch series. Can this
> > > applied series still be changed or new patches are required?
> >
> > I have nothing sitting on top of it for now, so as far as I'm concerned
> > we can apply a new series instead. Then we can also fold in that fix
> > from Colin that he posted this morning...
>
> The collision in memcontrol.c is a pain, but I guess as this is mainly
> a loop patch, the block tree is an appropriate route.
>
> Here's the collision between "mm: Charge active memcg when no mm is
> set" and Shakeels's
> https://lkml.kernel.org/r/20210305212639.775498-1-shake...@google.com
>
>
> --- mm/memcontrol.c
> +++ mm/memcontrol.c
> @@ -6728,8 +6730,15 @@ int mem_cgroup_charge(struct page *page, struct 
> mm_struct *mm, gfp_t gfp_mask)
> rcu_read_unlock();
> }
>
> -   if (!memcg)
> -   memcg = get_mem_cgroup_from_mm(mm);
> +   if (!memcg) {
> +   if (!mm) {
> +   memcg = get_mem_cgroup_from_current();
> +   if (!memcg)
> +   memcg = get_mem_cgroup_from_mm(current->mm);
> +   } else {
> +   memcg = get_mem_cgroup_from_mm(mm);
> +   }
> +   }
>
> ret = try_charge(memcg, gfp_mask, nr_pages);
> if (ret)
>
>
> Which I resolved thusly:
>
> int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> {
> struct mem_cgroup *memcg;
> int ret;
>
> if (mem_cgroup_disabled())
> return 0;
>
> if (!mm) {
> memcg = get_mem_cgroup_from_current();
> (!memcg)
> memcg = get_mem_cgroup_from_mm(current->mm);
> } else {
> memcg = get_mem_cgroup_from_mm(mm);
> }
>
> ret = __mem_cgroup_charge(page, memcg, gfp_mask);
> css_put(>css);
>
> return ret;
> }
>

We need something similar for mem_cgroup_swapin_charge_page() as well.

It is better to take this series in mm tree and Jens is ok with that [1].

[1] 
https://lore.kernel.org/linux-next/4fea89a5-0e18-0791-18a8-4c5907b0d...@kernel.dk/

Re: linux-next: manual merge of the akpm-current tree with the block tree

2021-03-18 Thread Shakeel Butt

On Wed, Mar 17, 2021 at 11:17 PM Stephen Rothwell  wrote:
>
> Hi all,
>
> Today's linux-next merge of the akpm-current tree got a conflict in:
>
>   mm/memcontrol.c
>
> between commit:
>
>   06d69d4c8669 ("mm: Charge active memcg when no mm is set")
>
> from the block tree and commit:
>
>   674788258a66 ("memcg: charge before adding to swapcache on swapin")
>
> from the akpm-current tree.
>
> I fixed it up (I think - see below) and can carry the fix as necessary.
> This is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
>
> --
> Cheers,
> Stephen Rothwell
>
> diff --cc mm/memcontrol.c
> index f05501669e29,668d1d7c2645..
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@@ -6691,65 -6549,73 +6550,80 @@@ out
>* @gfp_mask: reclaim mode
>*
>* Try to charge @page to the memcg that @mm belongs to, reclaiming
>  - * pages according to @gfp_mask if necessary.
>  + * pages according to @gfp_mask if necessary. if @mm is NULL, try to
>  + * charge to the active memcg.
>*
> +  * Do not use this for pages allocated for swapin.
> +  *
>* Returns 0 on success. Otherwise, an error code is returned.
>*/
>   int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t 
> gfp_mask)
>   {
> -   unsigned int nr_pages = thp_nr_pages(page);
> -   struct mem_cgroup *memcg = NULL;
> -   int ret = 0;
> +   struct mem_cgroup *memcg;
> +   int ret;
>
> if (mem_cgroup_disabled())
> -   goto out;
> +   return 0;
>
> -   if (PageSwapCache(page)) {
> -   swp_entry_t ent = { .val = page_private(page), };
> -   unsigned short id;
>  -  memcg = get_mem_cgroup_from_mm(mm);
> ++  if (!mm) {
> ++  memcg = get_mem_cgroup_from_current();
> ++  if (!memcg)
> ++  memcg = get_mem_cgroup_from_mm(current->mm);
> ++  } else {
> ++  memcg = get_mem_cgroup_from_mm(mm);
> ++  }
> +   ret = __mem_cgroup_charge(page, memcg, gfp_mask);
> +   css_put(>css);

Things are more complicated than this. First we need a similar change
in mem_cgroup_swapin_charge_page() but I am thinking of making
get_mem_cgroup_from_mm() more general and not make any changes in
these two functions.

Is it possible to get Dan's patch series in mm tree? More specifically
the above two patches in the same tree then one of us can make their
patch rebase over the other (I am fine with doing this myself).

Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

2021-03-18 Thread Shakeel Butt

On Wed, Mar 17, 2021 at 3:30 PM Jens Axboe  wrote:
>
> On 3/16/21 9:36 AM, Dan Schatzberg wrote:
> > No major changes, just rebasing and resubmitting
>
> Applied for 5.13, thanks.
>

I have requested a couple of changes in the patch series. Can this
applied series still be changed or new patches are required?

[PATCH] memcg: set page->private before calling swap_readpage

2021-03-17 Thread Shakeel Butt

The function swap_readpage() (and other functions it call) extracts swap
entry from page->private. However for SWP_SYNCHRONOUS_IO, the kernel
skips the swapcache and thus we need to manually set the page->private
with the swap entry before calling swap_readpage().

Signed-off-by: Shakeel Butt 
Reported-by: Heiko Carstens 
---

Andrew, please squash this into "memcg: charge before adding to
swapcache on swapin" patch.

 mm/memory.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index aefd158ae1ea..b6f3410b5902 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3324,7 +3324,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
workingset_refault(page, shadow);
 
lru_cache_add(page);
+
+   /* To provide entry to swap_readpage() */
+   set_page_private(page, entry.val);
swap_readpage(page, true);
+   set_page_private(page, 0);
}
} else {
page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-- 
2.31.0.rc2.261.g7f71774620-goog

Re: [PATCH 3/3] loop: Charge i/o to mem and blk cg

2021-03-16 Thread Shakeel Butt

On Tue, Mar 16, 2021 at 8:37 AM Dan Schatzberg  wrote:
>
[...]
>
>  /* Support for loadable transfer modules */
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0c04d39a7967..fd5dd961d01f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1187,6 +1187,17 @@ static inline struct mem_cgroup 
> *get_mem_cgroup_from_mm(struct mm_struct *mm)
> return NULL;
>  }
>
> +static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
> +{
> +   return NULL;
> +}

The above function has been removed.

Re: [PATCH 2/3] mm: Charge active memcg when no mm is set

2021-03-16 Thread Shakeel Butt

On Tue, Mar 16, 2021 at 8:37 AM Dan Schatzberg  wrote:
>
> memalloc_use_memcg() worked for kernel allocations but was silently
> ignored for user pages.

set_active_memcg()

>
> This patch establishes a precedence order for who gets charged:
>
> 1. If there is a memcg associated with the page already, that memcg is
>charged. This happens during swapin.
>
> 2. If an explicit mm is passed, mm->memcg is charged. This happens
>during page faults, which can be triggered in remote VMs (eg gup).
>
> 3. Otherwise consult the current process context. If it has configured
>a current->active_memcg, use that. Otherwise, current->mm->memcg.

It's a bit more sophisticated than current->active_memcg. It has been
extended to work in interrupt context as well.

>
> Previously, if a NULL mm was passed to mem_cgroup_try_charge (case 3) it

mem_cgroup_charge()

> would always charge the root cgroup. Now it looks up the current
> active_memcg first (falling back to charging the root cgroup if not
> set).
>
> Signed-off-by: Dan Schatzberg 
> Acked-by: Johannes Weiner 
> Acked-by: Tejun Heo 
> Acked-by: Chris Down 
> Reviewed-by: Shakeel Butt 
> ---
>  mm/filemap.c|  2 +-
>  mm/memcontrol.c | 14 +++---
>  mm/shmem.c  |  4 ++--
>  3 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 43700480d897..5135f330f05c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -843,7 +843,7 @@ noinline int __add_to_page_cache_locked(struct page *page,
> page->index = offset;
>
> if (!huge) {
> -   error = mem_cgroup_charge(page, current->mm, gfp);
> +   error = mem_cgroup_charge(page, NULL, gfp);
> if (error)
> goto error;
> charged = true;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e064ac0d850a..9a1b23ed3412 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6690,7 +6690,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup 
> *root,
>   * @gfp_mask: reclaim mode
>   *
>   * Try to charge @page to the memcg that @mm belongs to, reclaiming
> - * pages according to @gfp_mask if necessary.
> + * pages according to @gfp_mask if necessary. if @mm is NULL, try to
> + * charge to the active memcg.
>   *
>   * Returns 0 on success. Otherwise, an error code is returned.
>   */
> @@ -6726,8 +6727,15 @@ int mem_cgroup_charge(struct page *page, struct 
> mm_struct *mm, gfp_t gfp_mask)
> rcu_read_unlock();
> }
>
> -   if (!memcg)
> -   memcg = get_mem_cgroup_from_mm(mm);
> +   if (!memcg) {
> +   if (!mm) {
> +   memcg = get_mem_cgroup_from_current();
> +   if (!memcg)
> +   memcg = get_mem_cgroup_from_mm(current->mm);
> +   } else {
> +   memcg = get_mem_cgroup_from_mm(mm);
> +   }
> +   }

You will need to rebase to the latest mm tree. This code has changed.

Re: [mm, net-next v2] mm: net: memcg accounting for TCP rx zerocopy

2021-03-15 Thread Shakeel Butt

On Mon, Mar 15, 2021 at 9:20 PM Arjun Roy  wrote:
>
[...]
> >
>
> Apologies for the spam - looks like I forgot to rebase the first time
> I sent this out.
>
> Actually, on a related note, it's not 100% clear to me whether this
> patch (which in its current form, applies to net-next) should instead
> be based on the mm branch - but the most recent (clean) checkout of mm
> fails to build for me so net-next for now.
>

It is due to "mm: page-writeback: simplify memcg handling in
test_clear_page_writeback()" patch in the mm tree. You would need to
reintroduce the lock_page_memcg() which returns the memcg and make
__unlock_page_memcg() non-static.

Re: [PATCH] mm: memcontrol: switch to rstat fix

2021-03-15 Thread Shakeel Butt

On Mon, Mar 15, 2021 at 4:41 PM Johannes Weiner  wrote:
>
> Fix a sleep in atomic section problem: wb_writeback() takes a spinlock
> and calls wb_over_bg_thresh() -> mem_cgroup_wb_stats, but the regular
> rstat flushing function called from in there does lockbreaking and may
> sleep. Switch to the atomic variant, cgroup_rstat_irqsafe().
>
> To be consistent with other memcg flush calls, but without adding
> another memcg wrapper, inline and drop memcg_flush_vmstats() instead.
>
> Signed-off-by: Johannes Weiner 

Reviewed-by: Shakeel Butt

Re: [PATCH] vmscan: retry without cache trim mode if nothing scanned

2021-03-14 Thread Shakeel Butt

On Thu, Mar 11, 2021 at 12:52 AM Huang, Ying  wrote:
>
> Hi, Butt,
>
> Shakeel Butt  writes:
>
> > On Wed, Mar 10, 2021 at 4:47 PM Huang, Ying  wrote:
> >>
> >> From: Huang Ying 
> >>
> >> In shrink_node(), to determine whether to enable cache trim mode, the
> >> LRU size is gotten via lruvec_page_state().  That gets the value from
> >> a per-CPU counter (mem_cgroup_per_node->lruvec_stat[]).  The error of
> >> the per-CPU counter from CPU local counting and the descendant memory
> >> cgroups may cause some issues.  We run into this in 0-Day performance
> >> test.
> >>
> >> 0-Day uses the RAM file system as root file system, so the number of
> >> the reclaimable file pages is very small.  In the swap testing, the
> >> inactive file LRU list will become almost empty soon.  But the size of
> >> the inactive file LRU list gotten from the per-CPU counter may keep a
> >> much larger value (say, 33, 50, etc.).  This will enable cache trim
> >> mode, but nothing can be scanned in fact.  The following pattern
> >> repeats for long time in the test,
> >>
> >> priorityinactive_file_size  cache_trim_mode
> >> 12  33  0
> >> 11  33  0
> >> ...
> >> 6   33  0
> >> 5   33  1
> >> ...
> >> 1   33  1
> >>
> >> That is, the cache_trim_mode will be enabled wrongly when the scan
> >> priority decreases to 5.  And the problem will not be recovered for
> >> long time.
> >>
> >> It's hard to get the more accurate size of the inactive file list
> >> without much more overhead.  And it's hard to estimate the error of
> >> the per-CPU counter too, because there may be many descendant memory
> >> cgroups.  But after the actual scanning, if nothing can be scanned
> >> with the cache trim mode, it should be wrong to enable the cache trim
> >> mode.  So we can retry with the cache trim mode disabled.  This patch
> >> implement this policy.
> >
> > Instead of playing with the already complicated heuristics, we should
> > improve the accuracy of the lruvec stats. Johannes already fixed the
> > memcg stats using rstat infrastructure and Tejun has suggestions on
> > how to use rstat infrastructure efficiently for lruvec stats at
> > https://lore.kernel.org/linux-mm/ycfgr300eriez...@slm.duckdns.org/.
>
> Thanks for your information!  It should be better if we can improve the
> accuracy of lruvec stats without much overhead.  But that may be not a
> easy task.
>
> If my understanding were correct, what Tejun suggested is to add a fast
> read interface to rstat to be used in hot path.  And its accuracy is
> similar as that of traditional per-CPU counter.  But if we can regularly
> update the lruvec rstat with something like vmstat_update(), that should
> be OK for the issue described in this patch.
>

This is also my understanding. Tejun, please correct us if we misunderstood you.

BTW Johannes was working on rstat-based lruvec stats patch. Johannes,
are you planning to work on the optimization Tejun has suggested.

Re: [External] Re: [PATCH v3 2/4] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-12 Thread Shakeel Butt

On Fri, Mar 12, 2021 at 3:07 PM Johannes Weiner  wrote:
>
> On Fri, Mar 12, 2021 at 02:42:45PM -0800, Shakeel Butt wrote:
> > Hi Johannes,
> >
> > On Fri, Mar 12, 2021 at 11:23 AM Johannes Weiner  wrote:
> > >
> > [...]
> > >
> > > Longer term we most likely need it there anyway. The issue you are
> > > describing in the cover letter - allocations pinning memcgs for a long
> > > time - it exists at a larger scale and is causing recurring problems
> > > in the real world: page cache doesn't get reclaimed for a long time,
> > > or is used by the second, third, fourth, ... instance of the same job
> > > that was restarted into a new cgroup every time. Unreclaimable dying
> > > cgroups pile up, waste memory, and make page reclaim very inefficient.
> > >
> >
> > For the scenario described above, do we really want to reparent the
> > page cache pages? Shouldn't we recharge the pages to the second,
> > third, fourth and so on, memcgs? My concern is that we will see a big
> > chunk of page cache pages charged to root and will only get reclaimed
> > on global pressure.
>
> Sorry, I'm proposing to reparent to the ancestor, not root. It's an
> optimization, not a change in user-visible behavior.
>
> As far as the user can tell, the pages already belong to the parent
> after deletion: they'll show up in the parent's stats, naturally, and
> they will get reclaimed as part of the parent being reclaimed.
>
> The dead cgroup doesn't even have its own limit anymore after
> .css_reset() has run. And we already physically reparent slab objects
> in memcg_reparent_objcgs() and memcg_drain_all_list_lrus().
>
> I'm just saying we should do the same thing for LRU pages.

I understand the proposal and I agree it makes total sense when a job
is recycling sub-job/sub-container.

I was talking about the (recycling of the) top level cgroups. Though
for that to be an issue, I suppose the file system has to be shared
between the jobs on the system. I was wondering if a page cache
reaches the root memcg on multiple reparenting, should the next access
cause that page to be charged to the accessor?

Re: [External] Re: [PATCH v3 2/4] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-12 Thread Shakeel Butt

Hi Johannes,

On Fri, Mar 12, 2021 at 11:23 AM Johannes Weiner  wrote:
>
[...]
>
> Longer term we most likely need it there anyway. The issue you are
> describing in the cover letter - allocations pinning memcgs for a long
> time - it exists at a larger scale and is causing recurring problems
> in the real world: page cache doesn't get reclaimed for a long time,
> or is used by the second, third, fourth, ... instance of the same job
> that was restarted into a new cgroup every time. Unreclaimable dying
> cgroups pile up, waste memory, and make page reclaim very inefficient.
>

For the scenario described above, do we really want to reparent the
page cache pages? Shouldn't we recharge the pages to the second,
third, fourth and so on, memcgs? My concern is that we will see a big
chunk of page cache pages charged to root and will only get reclaimed
on global pressure.

Re: [PATCH v3 4/4] mm: memcontrol: move PageMemcgKmem to the scope of CONFIG_MEMCG_KMEM

2021-03-11 Thread Shakeel Butt

On Tue, Mar 9, 2021 at 2:09 AM Muchun Song  wrote:
>
> The page only can be marked as kmem when CONFIG_MEMCG_KMEM is enabled.
> So move PageMemcgKmem() to the scope of the CONFIG_MEMCG_KMEM.
>
> As a bonus, on !CONFIG_MEMCG_KMEM build some code can be compiled out.
>
> Signed-off-by: Muchun Song 

Reviewed-by: Shakeel Butt

Re: [PATCH v3 2/4] mm: memcontrol: make page_memcg{_rcu} only applicable for non-kmem page

2021-03-11 Thread Shakeel Butt

On Tue, Mar 9, 2021 at 2:09 AM Muchun Song  wrote:
>
> We want to reuse the obj_cgroup APIs to charge the kmem pages.
> If we do that, we should store an object cgroup pointer to
> page->memcg_data for the kmem pages.
>
> Finally, page->memcg_data can have 3 different meanings.

replace 'can' with 'will'

Other than that I think Roman and Johannes have already given very
good feedback.

Re: [PATCH v3 1/4] mm: memcontrol: introduce obj_cgroup_{un}charge_pages

2021-03-11 Thread Shakeel Butt

On Tue, Mar 9, 2021 at 2:09 AM Muchun Song  wrote:
>
> We know that the unit of slab object charging is bytes, the unit of
> kmem page charging is PAGE_SIZE. If we want to reuse obj_cgroup APIs
> to charge the kmem pages, we should pass PAGE_SIZE (as third parameter)
> to obj_cgroup_charge(). Because the size is already PAGE_SIZE, we can
> skip touch the objcg stock. And obj_cgroup_{un}charge_pages() are
> introduced to charge in units of page level.
>
> In the later patch, we also can reuse those two helpers to charge or
> uncharge a number of kernel pages to a object cgroup. This is just
> a code movement without any functional changes.
>
> Signed-off-by: Muchun Song 
> Acked-by: Roman Gushchin 

Reviewed-by: Shakeel Butt

Re: [PATCH] vmscan: retry without cache trim mode if nothing scanned

2021-03-10 Thread Shakeel Butt

On Wed, Mar 10, 2021 at 4:47 PM Huang, Ying  wrote:
>
> From: Huang Ying 
>
> In shrink_node(), to determine whether to enable cache trim mode, the
> LRU size is gotten via lruvec_page_state().  That gets the value from
> a per-CPU counter (mem_cgroup_per_node->lruvec_stat[]).  The error of
> the per-CPU counter from CPU local counting and the descendant memory
> cgroups may cause some issues.  We run into this in 0-Day performance
> test.
>
> 0-Day uses the RAM file system as root file system, so the number of
> the reclaimable file pages is very small.  In the swap testing, the
> inactive file LRU list will become almost empty soon.  But the size of
> the inactive file LRU list gotten from the per-CPU counter may keep a
> much larger value (say, 33, 50, etc.).  This will enable cache trim
> mode, but nothing can be scanned in fact.  The following pattern
> repeats for long time in the test,
>
> priorityinactive_file_size  cache_trim_mode
> 12  33  0
> 11  33  0
> ...
> 6   33  0
> 5   33  1
> ...
> 1   33  1
>
> That is, the cache_trim_mode will be enabled wrongly when the scan
> priority decreases to 5.  And the problem will not be recovered for
> long time.
>
> It's hard to get the more accurate size of the inactive file list
> without much more overhead.  And it's hard to estimate the error of
> the per-CPU counter too, because there may be many descendant memory
> cgroups.  But after the actual scanning, if nothing can be scanned
> with the cache trim mode, it should be wrong to enable the cache trim
> mode.  So we can retry with the cache trim mode disabled.  This patch
> implement this policy.

Instead of playing with the already complicated heuristics, we should
improve the accuracy of the lruvec stats. Johannes already fixed the
memcg stats using rstat infrastructure and Tejun has suggestions on
how to use rstat infrastructure efficiently for lruvec stats at
https://lore.kernel.org/linux-mm/ycfgr300eriez...@slm.duckdns.org/.

Re: [v9 PATCH 13/13] mm: vmscan: shrink deferred objects proportional to priority

2021-03-10 Thread Shakeel Butt

On Wed, Mar 10, 2021 at 1:41 PM Yang Shi  wrote:
>
> On Wed, Mar 10, 2021 at 1:08 PM Shakeel Butt  wrote:
> >
> > On Wed, Mar 10, 2021 at 10:54 AM Yang Shi  wrote:
> > >
> > > On Wed, Mar 10, 2021 at 10:24 AM Shakeel Butt  wrote:
> > > >
> > > > On Wed, Mar 10, 2021 at 9:46 AM Yang Shi  wrote:
> > > > >
> > > > > The number of deferred objects might get windup to an absurd number, 
> > > > > and it
> > > > > results in clamp of slab objects.  It is undesirable for sustaining 
> > > > > workingset.
> > > > >
> > > > > So shrink deferred objects proportional to priority and cap 
> > > > > nr_deferred to twice
> > > > > of cache items.
> > > > >
> > > > > The idea is borrowed from Dave Chinner's patch:
> > > > > https://lore.kernel.org/linux-xfs/20191031234618.15403-13-da...@fromorbit.com/
> > > > >
> > > > > Tested with kernel build and vfs metadata heavy workload in our 
> > > > > production
> > > > > environment, no regression is spotted so far.
> > > >
> > > > Did you run both of these workloads in the same cgroup or separate 
> > > > cgroups?
> > >
> > > Both are covered.
> > >
> >
> > Have you tried just this patch i.e. without the first 12 patches?
>
> No. It could be applied without the first 12 patches, but I didn't
> test this combination specifically since I don't think it would have
> any difference from with the first 12 patches. I tested running the
> test case under root memcg, it seems equal to w/o the first 12 patches
> and the only difference is where to get nr_deferred.

I am trying to measure the impact of this patch independently. One
point I can think of is the global reclaim. The first 12 patches do
not aim to improve the global reclaim but this patch will. I am just
wondering what would be negative if any of this patch.

Re: [v9 PATCH 13/13] mm: vmscan: shrink deferred objects proportional to priority

2021-03-10 Thread Shakeel Butt

On Wed, Mar 10, 2021 at 10:54 AM Yang Shi  wrote:
>
> On Wed, Mar 10, 2021 at 10:24 AM Shakeel Butt  wrote:
> >
> > On Wed, Mar 10, 2021 at 9:46 AM Yang Shi  wrote:
> > >
> > > The number of deferred objects might get windup to an absurd number, and 
> > > it
> > > results in clamp of slab objects.  It is undesirable for sustaining 
> > > workingset.
> > >
> > > So shrink deferred objects proportional to priority and cap nr_deferred 
> > > to twice
> > > of cache items.
> > >
> > > The idea is borrowed from Dave Chinner's patch:
> > > https://lore.kernel.org/linux-xfs/20191031234618.15403-13-da...@fromorbit.com/
> > >
> > > Tested with kernel build and vfs metadata heavy workload in our production
> > > environment, no regression is spotted so far.
> >
> > Did you run both of these workloads in the same cgroup or separate cgroups?
>
> Both are covered.
>

Have you tried just this patch i.e. without the first 12 patches?

Re: [v9 PATCH 13/13] mm: vmscan: shrink deferred objects proportional to priority

2021-03-10 Thread Shakeel Butt

On Wed, Mar 10, 2021 at 9:46 AM Yang Shi  wrote:
>
> The number of deferred objects might get windup to an absurd number, and it
> results in clamp of slab objects.  It is undesirable for sustaining 
> workingset.
>
> So shrink deferred objects proportional to priority and cap nr_deferred to 
> twice
> of cache items.
>
> The idea is borrowed from Dave Chinner's patch:
> https://lore.kernel.org/linux-xfs/20191031234618.15403-13-da...@fromorbit.com/
>
> Tested with kernel build and vfs metadata heavy workload in our production
> environment, no regression is spotted so far.

Did you run both of these workloads in the same cgroup or separate cgroups?

>
> Signed-off-by: Yang Shi 
> ---
>  mm/vmscan.c | 46 +++---
>  1 file changed, 11 insertions(+), 35 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9a2dfeaa79f4..6a0a91b23597 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -662,7 +662,6 @@ static unsigned long do_shrink_slab(struct shrink_control 
> *shrinkctl,
>  */
> nr = xchg_nr_deferred(shrinker, shrinkctl);
>
> -   total_scan = nr;
> if (shrinker->seeks) {
> delta = freeable >> priority;
> delta *= 4;
> @@ -676,37 +675,9 @@ static unsigned long do_shrink_slab(struct 
> shrink_control *shrinkctl,
> delta = freeable / 2;
> }
>
> +   total_scan = nr >> priority;
> total_scan += delta;
> -   if (total_scan < 0) {
> -   pr_err("shrink_slab: %pS negative objects to delete nr=%ld\n",
> -  shrinker->scan_objects, total_scan);
> -   total_scan = freeable;
> -   next_deferred = nr;
> -   } else
> -   next_deferred = total_scan;
> -
> -   /*
> -* We need to avoid excessive windup on filesystem shrinkers
> -* due to large numbers of GFP_NOFS allocations causing the
> -* shrinkers to return -1 all the time. This results in a large
> -* nr being built up so when a shrink that can do some work
> -* comes along it empties the entire cache due to nr >>>
> -* freeable. This is bad for sustaining a working set in
> -* memory.
> -*
> -* Hence only allow the shrinker to scan the entire cache when
> -* a large delta change is calculated directly.
> -*/
> -   if (delta < freeable / 4)
> -   total_scan = min(total_scan, freeable / 2);
> -
> -   /*
> -* Avoid risking looping forever due to too large nr value:
> -* never try to free more than twice the estimate number of
> -* freeable entries.
> -*/
> -   if (total_scan > freeable * 2)
> -   total_scan = freeable * 2;
> +   total_scan = min(total_scan, (2 * freeable));
>
> trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
>freeable, delta, total_scan, priority);
> @@ -745,10 +716,15 @@ static unsigned long do_shrink_slab(struct 
> shrink_control *shrinkctl,
> cond_resched();
> }
>
> -   if (next_deferred >= scanned)
> -   next_deferred -= scanned;
> -   else
> -   next_deferred = 0;
> +   /*
> +* The deferred work is increased by any new work (delta) that wasn't
> +* done, decreased by old deferred work that was done now.
> +*
> +* And it is capped to two times of the freeable items.
> +*/
> +   next_deferred = max_t(long, (nr + delta - scanned), 0);
> +   next_deferred = min(next_deferred, (2 * freeable));
> +
> /*
>  * move the unused scan count back into the shrinker in a
>  * manner that handles concurrent updates.
> --
> 2.26.2
>

Re: [v8 PATCH 12/13] mm: memcontrol: reparent nr_deferred when memcg offline

2021-03-08 Thread Shakeel Butt

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
>
> Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to 
> parent's
> corresponding nr_deferred when memcg offline.
>
> Acked-by: Vlastimil Babka 
> Acked-by: Kirill Tkhai 
> Acked-by: Roman Gushchin 
> Signed-off-by: Yang Shi 

Reviewed-by: Shakeel Butt

Re: [v8 PATCH 11/13] mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers

2021-03-08 Thread Shakeel Butt

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
>
> Now nr_deferred is available on per memcg level for memcg aware shrinkers, so 
> don't need
> allocate shrinker->nr_deferred for such shrinkers anymore.
>
> The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or memcg 
> is disabled
> by kernel command line, then shrinker's SHRINKER_MEMCG_AWARE flag would be 
> cleared.
> This makes the implementation of this patch simpler.
>
> Acked-by: Vlastimil Babka 
> Reviewed-by: Kirill Tkhai 
> Acked-by: Roman Gushchin 
> Signed-off-by: Yang Shi 

Reviewed-by: Shakeel Butt

Re: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

2021-03-08 Thread Shakeel Butt

On Mon, Mar 8, 2021 at 12:22 PM Yang Shi  wrote:
>
> On Mon, Mar 8, 2021 at 8:49 AM Roman Gushchin  wrote:
> >
> > On Sun, Mar 07, 2021 at 10:13:04PM -0800, Shakeel Butt wrote:
> > > On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
> > > >
> > > > Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
> > > > We don't have to define a dedicated callback for call_rcu() anymore.
> > > >
> > > > Signed-off-by: Yang Shi 
> > > > ---
> > > >  mm/vmscan.c | 7 +--
> > > >  1 file changed, 1 insertion(+), 6 deletions(-)
> > > >
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 2e753c2516fa..c2a309acd86b 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
> > > > return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned 
> > > > long));
> > > >  }
> > > >
> > > > -static void free_shrinker_map_rcu(struct rcu_head *head)
> > > > -{
> > > > -   kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> > > > -}
> > > > -
> > > >  static int expand_one_shrinker_map(struct mem_cgroup *memcg,
> > > >int size, int old_size)
> > > >  {
> > > > @@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct 
> > > > mem_cgroup *memcg,
> > > > memset((void *)new->map + old_size, 0, size - old_size);
> > > >
> > > > rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, 
> > > > new);
> > > > -   call_rcu(>rcu, free_shrinker_map_rcu);
> > > > +   kvfree_rcu(old);
> > >
> > > Please use kvfree_rcu(old, rcu) instead of kvfree_rcu(old). The single
> > > param can call synchronize_rcu().
> >
> > Oh, I didn't know about this difference. Thank you for noticing!
>
> BTW, I think I could keep you and Kirill's acked-by with this change
> (using two params form kvfree_rcu) since the change seems trivial.

Once you change, you can add:

Reviewed-by: Shakeel Butt

Re: [v8 PATCH 09/13] mm: vmscan: add per memcg shrinker nr_deferred

2021-03-08 Thread Shakeel Butt

On Mon, Mar 8, 2021 at 12:30 PM Yang Shi  wrote:
>
> On Mon, Mar 8, 2021 at 11:12 AM Shakeel Butt  wrote:
> >
> > On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
> > >
> > > Currently the number of deferred objects are per shrinker, but some 
> > > slabs, for example,
> > > vfs inode/dentry cache are per memcg, this would result in poor isolation 
> > > among memcgs.
> > >
> > > The deferred objects typically are generated by __GFP_NOFS allocations, 
> > > one memcg with
> > > excessive __GFP_NOFS allocations may blow up deferred objects, then other 
> > > innocent memcgs
> > > may suffer from over shrink, excessive reclaim latency, etc.
> > >
> > > For example, two workloads run in memcgA and memcgB respectively, 
> > > workload in B is vfs
> > > heavy workload.  Workload in A generates excessive deferred objects, then 
> > > B's vfs cache
> > > might be hit heavily (drop half of caches) by B's limit reclaim or global 
> > > reclaim.
> > >
> > > We observed this hit in our production environment which was running vfs 
> > > heavy workload
> > > shown as the below tracing log:
> > >
> > > <...>-409454 [016]  28286961.747146: mm_shrink_slab_start: 
> > > super_cache_scan+0x0/0x1a0 9a83046f3458:
> > > nid: 1 objects to shrink 3641681686040 gfp_flags 
> > > GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> > > cache items 246404277 delta 31345 total_scan 123202138
> > > <...>-409454 [022]  28287105.928018: mm_shrink_slab_end: 
> > > super_cache_scan+0x0/0x1a0 9a83046f3458:
> > > nid: 1 unused scan count 3641681686040 new scan count 3641798379189 
> > > total_scan 602
> > > last shrinker return val 123186855
> > >
> > > The vfs cache and page cache ratio was 10:1 on this machine, and half of 
> > > caches were dropped.
> > > This also resulted in significant amount of page caches were dropped due 
> > > to inodes eviction.
> > >
> > > Make nr_deferred per memcg for memcg aware shrinkers would solve the 
> > > unfairness and bring
> > > better isolation.
> > >
> > > When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the 
> > > shrinker's nr_deferred
> > > would be used.  And non memcg aware shrinkers use shrinker's nr_deferred 
> > > all the time.
> > >
> > > Signed-off-by: Yang Shi 
> > > ---
> > >  include/linux/memcontrol.h |  7 +++--
> > >  mm/vmscan.c| 60 ++
> > >  2 files changed, 46 insertions(+), 21 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 4c9253896e25..c457fc7bc631 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -93,12 +93,13 @@ struct lruvec_stat {
> > >  };
> > >
> > >  /*
> > > - * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> > > - * which have elements charged to this memcg.
> > > + * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
> > > + * shrinkers, which have elements charged to this memcg.
> > >   */
> > >  struct shrinker_info {
> > > struct rcu_head rcu;
> > > -   unsigned long map[];
> > > +   atomic_long_t *nr_deferred;
> > > +   unsigned long *map;
> > >  };
> > >
> > >  /*
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index a1047ea60ecf..fcb399e18fc3 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -187,11 +187,17 @@ static DECLARE_RWSEM(shrinker_rwsem);
> > >  #ifdef CONFIG_MEMCG
> > >  static int shrinker_nr_max;
> > >
> > > +/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
> > >  static inline int shrinker_map_size(int nr_items)
> > >  {
> > > return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned 
> > > long));
> > >  }
> > >
> > > +static inline int shrinker_defer_size(int nr_items)
> > > +{
> > > +   return (round_up(nr_items, BITS_PER_LONG) * 
> > > sizeof(atomic_long_t));
> > > +}
> > > +
> > >  static struct shrinker_info *shrinker_info_protected(struct mem_cgroup 
> > > *memcg,
> > >  int nid)
> > >

Re: [v8 PATCH 10/13] mm: vmscan: use per memcg nr_deferred of shrinker

2021-03-08 Thread Shakeel Butt

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
>
> Use per memcg's nr_deferred for memcg aware shrinkers.  The shrinker's 
> nr_deferred
> will be used in the following cases:
> 1. Non memcg aware shrinkers
> 2. !CONFIG_MEMCG
> 3. memcg is disabled by boot parameter
>
> Signed-off-by: Yang Shi 

Reviewed-by: Shakeel Butt

Re: [v8 PATCH 09/13] mm: vmscan: add per memcg shrinker nr_deferred

2021-03-08 Thread Shakeel Butt

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
>
> Currently the number of deferred objects are per shrinker, but some slabs, 
> for example,
> vfs inode/dentry cache are per memcg, this would result in poor isolation 
> among memcgs.
>
> The deferred objects typically are generated by __GFP_NOFS allocations, one 
> memcg with
> excessive __GFP_NOFS allocations may blow up deferred objects, then other 
> innocent memcgs
> may suffer from over shrink, excessive reclaim latency, etc.
>
> For example, two workloads run in memcgA and memcgB respectively, workload in 
> B is vfs
> heavy workload.  Workload in A generates excessive deferred objects, then B's 
> vfs cache
> might be hit heavily (drop half of caches) by B's limit reclaim or global 
> reclaim.
>
> We observed this hit in our production environment which was running vfs 
> heavy workload
> shown as the below tracing log:
>
> <...>-409454 [016]  28286961.747146: mm_shrink_slab_start: 
> super_cache_scan+0x0/0x1a0 9a83046f3458:
> nid: 1 objects to shrink 3641681686040 gfp_flags 
> GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721
> cache items 246404277 delta 31345 total_scan 123202138
> <...>-409454 [022]  28287105.928018: mm_shrink_slab_end: 
> super_cache_scan+0x0/0x1a0 9a83046f3458:
> nid: 1 unused scan count 3641681686040 new scan count 3641798379189 
> total_scan 602
> last shrinker return val 123186855
>
> The vfs cache and page cache ratio was 10:1 on this machine, and half of 
> caches were dropped.
> This also resulted in significant amount of page caches were dropped due to 
> inodes eviction.
>
> Make nr_deferred per memcg for memcg aware shrinkers would solve the 
> unfairness and bring
> better isolation.
>
> When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's 
> nr_deferred
> would be used.  And non memcg aware shrinkers use shrinker's nr_deferred all 
> the time.
>
> Signed-off-by: Yang Shi 
> ---
>  include/linux/memcontrol.h |  7 +++--
>  mm/vmscan.c| 60 ++
>  2 files changed, 46 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 4c9253896e25..c457fc7bc631 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -93,12 +93,13 @@ struct lruvec_stat {
>  };
>
>  /*
> - * Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
> - * which have elements charged to this memcg.
> + * Bitmap and deferred work of shrinker::id corresponding to memcg-aware
> + * shrinkers, which have elements charged to this memcg.
>   */
>  struct shrinker_info {
> struct rcu_head rcu;
> -   unsigned long map[];
> +   atomic_long_t *nr_deferred;
> +   unsigned long *map;
>  };
>
>  /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a1047ea60ecf..fcb399e18fc3 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -187,11 +187,17 @@ static DECLARE_RWSEM(shrinker_rwsem);
>  #ifdef CONFIG_MEMCG
>  static int shrinker_nr_max;
>
> +/* The shrinker_info is expanded in a batch of BITS_PER_LONG */
>  static inline int shrinker_map_size(int nr_items)
>  {
> return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned 
> long));
>  }
>
> +static inline int shrinker_defer_size(int nr_items)
> +{
> +   return (round_up(nr_items, BITS_PER_LONG) * sizeof(atomic_long_t));
> +}
> +
>  static struct shrinker_info *shrinker_info_protected(struct mem_cgroup 
> *memcg,
>  int nid)
>  {
> @@ -200,10 +206,12 @@ static struct shrinker_info 
> *shrinker_info_protected(struct mem_cgroup *memcg,
>  }
>
>  static int expand_one_shrinker_info(struct mem_cgroup *memcg,
> -   int size, int old_size)
> +   int map_size, int defer_size,
> +   int old_map_size, int old_defer_size)
>  {
> struct shrinker_info *new, *old;
> int nid;
> +   int size = map_size + defer_size;
>
> for_each_node(nid) {
> old = shrinker_info_protected(memcg, nid);
> @@ -215,9 +223,16 @@ static int expand_one_shrinker_info(struct mem_cgroup 
> *memcg,
> if (!new)
> return -ENOMEM;
>
> -   /* Set all old bits, clear all new bits */
> -   memset(new->map, (int)0xff, old_size);
> -   memset((void *)new->map + old_size, 0, size - old_size);
> +   new->nr_deferred = (atomic_long_t *)(new + 1);
> +   new->map = (void *)new->nr_deferred + defer_size;
> +
> +   /* map: set all old bits, clear all new bits */
> +   memset(new->map, (int)0xff, old_map_size);
> +   memset((void *)new->map + old_map_size, 0, map_size - 
> old_map_size);
> +   /* nr_deferred: copy old values, clear all new values */
> +   memcpy(new->nr_deferred, old->nr_deferred,

Re: [v8 PATCH 08/13] mm: vmscan: use a new flag to indicate shrinker is registered

2021-03-08 Thread Shakeel Butt

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
>
> Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> This approach is fine with nr_deferred at the shrinker level, but the 
> following
> patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> shrinker->nr_deferred would always be NULL.  This would prevent the shrinkers
> from unregistering correctly.
>
> Remove SHRINKER_REGISTERING since we could check if shrinker is registered
> successfully by the new flag.
>
> Acked-by: Kirill Tkhai 
> Acked-by: Vlastimil Babka 
> Signed-off-by: Yang Shi 

Reviewed-by: Shakeel Butt

Re: [v8 PATCH 07/13] mm: vmscan: add shrinker_info_protected() helper

2021-03-07 Thread Shakeel Butt

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
>
> The shrinker_info is dereferenced in a couple of places via 
> rcu_dereference_protected
> with different calling conventions, for example, using mem_cgroup_nodeinfo 
> helper
> or dereferencing memcg->nodeinfo[nid]->shrinker_info.  And the later patch
> will add more dereference places.
>
> So extract the dereference into a helper to make the code more readable.  No
> functional change.
>
> Acked-by: Roman Gushchin 
> Acked-by: Kirill Tkhai 
> Acked-by: Vlastimil Babka 
> Signed-off-by: Yang Shi 

Reviewed-by: Shakeel Butt

Re: [v8 PATCH 06/13] mm: memcontrol: rename shrinker_map to shrinker_info

2021-03-07 Thread Shakeel Butt

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
>
> The following patch is going to add nr_deferred into shrinker_map, the change 
> will
> make shrinker_map not only include map anymore, so rename it to 
> "memcg_shrinker_info".
> And this should make the patch adding nr_deferred cleaner and readable and 
> make
> review easier.  Also remove the "memcg_" prefix.
>
> Acked-by: Vlastimil Babka 
> Acked-by: Kirill Tkhai 
> Acked-by: Roman Gushchin 
> Signed-off-by: Yang Shi 

Reviewed-by: Shakeel Butt

Re: [v8 PATCH 04/13] mm: vmscan: remove memcg_shrinker_map_size

2021-03-07 Thread Shakeel Butt

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
>
> Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but actually 
> the
> map size can be calculated via shrinker_nr_max, so it seems unnecessary to 
> keep both.
> Remove memcg_shrinker_map_size since shrinker_nr_max is also used by 
> iterating the
> bit map.
>
> Acked-by: Kirill Tkhai 
> Acked-by: Roman Gushchin 
> Acked-by: Vlastimil Babka 
> Signed-off-by: Yang Shi 

Reviewed-by: Shakeel Butt

Re: [v8 PATCH 03/13] mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation

2021-03-07 Thread Shakeel Butt

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
>
> Since memcg_shrinker_map_size just can be changed under holding shrinker_rwsem
> exclusively, the read side can be protected by holding read lock, so it sounds
> superfluous to have a dedicated mutex.
>
> Kirill Tkhai suggested use write lock since:
>
>   * We want the assignment to shrinker_maps is visible for 
> shrink_slab_memcg().
>   * The rcu_dereference_protected() dereferrencing in shrink_slab_memcg(), but
> in case of we use READ lock in alloc_shrinker_maps(), the dereferrencing
> is not actually protected.
>   * READ lock makes alloc_shrinker_info() racy against memory allocation fail.
> alloc_shrinker_info()->free_shrinker_info() may free memory right after
> shrink_slab_memcg() dereferenced it. You may say
> shrink_slab_memcg()->mem_cgroup_online() protects us from it? Yes, sure,
> but this is not the thing we want to remember in the future, since this
> spreads modularity.
>
> And a test with heavy paging workload didn't show write lock makes things 
> worse.
>
> Acked-by: Vlastimil Babka 
> Acked-by: Kirill Tkhai 
> Acked-by: Roman Gushchin 
> Signed-off-by: Yang Shi 

Reviewed-by: Shakeel Butt

Re: [v8 PATCH 05/13] mm: vmscan: use kvfree_rcu instead of call_rcu

2021-03-07 Thread Shakeel Butt

On Tue, Feb 16, 2021 at 4:13 PM Yang Shi  wrote:
>
> Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu().
> We don't have to define a dedicated callback for call_rcu() anymore.
>
> Signed-off-by: Yang Shi 
> ---
>  mm/vmscan.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2e753c2516fa..c2a309acd86b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -192,11 +192,6 @@ static inline int shrinker_map_size(int nr_items)
> return (DIV_ROUND_UP(nr_items, BITS_PER_LONG) * sizeof(unsigned 
> long));
>  }
>
> -static void free_shrinker_map_rcu(struct rcu_head *head)
> -{
> -   kvfree(container_of(head, struct memcg_shrinker_map, rcu));
> -}
> -
>  static int expand_one_shrinker_map(struct mem_cgroup *memcg,
>int size, int old_size)
>  {
> @@ -219,7 +214,7 @@ static int expand_one_shrinker_map(struct mem_cgroup 
> *memcg,
> memset((void *)new->map + old_size, 0, size - old_size);
>
> rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
> -   call_rcu(>rcu, free_shrinker_map_rcu);
> +   kvfree_rcu(old);

Please use kvfree_rcu(old, rcu) instead of kvfree_rcu(old). The single
param can call synchronize_rcu().

Re: [PATCH v4] memcg: charge before adding to swapcache on swapin

2021-03-05 Thread Shakeel Butt

On Fri, Mar 5, 2021 at 1:26 PM Shakeel Butt  wrote:
>
> Currently the kernel adds the page, allocated for swapin, to the
> swapcache before charging the page. This is fine but now we want a
> per-memcg swapcache stat which is essential for folks who wants to
> transparently migrate from cgroup v1's memsw to cgroup v2's memory and
> swap counters. In addition charging a page before exposing it to other
> parts of the kernel is a step in the right direction.
>
> To correctly maintain the per-memcg swapcache stat, this patch has
> adopted to charge the page before adding it to swapcache. One
> challenge in this option is the failure case of add_to_swap_cache() on
> which we need to undo the mem_cgroup_charge(). Specifically undoing
> mem_cgroup_uncharge_swap() is not simple.
>
> To resolve the issue, this patch introduces transaction like interface
> to charge a page for swapin. The function mem_cgroup_charge_swapin_page()
> initiates the charging of the page and mem_cgroup_finish_swapin_page()
> completes the charging process. So, the kernel starts the charging
> process of the page for swapin with mem_cgroup_charge_swapin_page(),
> adds the page to the swapcache and on success completes the charging
> process with mem_cgroup_finish_swapin_page().

And of course I forgot to update the commit message.

Andrew, please replace the third paragraph with the following para:

To resolve the issue, this patch decouples the charging for swapin pages from
mem_cgroup_charge(). Two new functions are introduced,
mem_cgroup_swapin_charge_page() for just charging the swapin page and
mem_cgroup_swapin_uncharge_swap() for uncharging the swap slot once the
page has been successfully added to the swapcache.

[PATCH v4] memcg: charge before adding to swapcache on swapin

2021-03-05 Thread Shakeel Butt

Currently the kernel adds the page, allocated for swapin, to the
swapcache before charging the page. This is fine but now we want a
per-memcg swapcache stat which is essential for folks who wants to
transparently migrate from cgroup v1's memsw to cgroup v2's memory and
swap counters. In addition charging a page before exposing it to other
parts of the kernel is a step in the right direction.

To correctly maintain the per-memcg swapcache stat, this patch has
adopted to charge the page before adding it to swapcache. One
challenge in this option is the failure case of add_to_swap_cache() on
which we need to undo the mem_cgroup_charge(). Specifically undoing
mem_cgroup_uncharge_swap() is not simple.

To resolve the issue, this patch introduces transaction like interface
to charge a page for swapin. The function mem_cgroup_charge_swapin_page()
initiates the charging of the page and mem_cgroup_finish_swapin_page()
completes the charging process. So, the kernel starts the charging
process of the page for swapin with mem_cgroup_charge_swapin_page(),
adds the page to the swapcache and on success completes the charging
process with mem_cgroup_finish_swapin_page().

Signed-off-by: Shakeel Butt 
Acked-by: Johannes Weiner 
Acked-by: Hugh Dickins 
---
Changes since v3:
- Updated the comments on introduced functions (Johannes)
- Rename the funcations to be more clear (Hugh & Johannes)

Changes since v2:
- fixed build for !CONFIG_MEMCG
- simplified failure path from add_to_swap_cache()

Changes since v1:
- Removes __GFP_NOFAIL and introduced transaction interface for charging
  (suggested by Johannes)
- Updated the commit message

 include/linux/memcontrol.h |  13 +
 mm/memcontrol.c| 117 +++--
 mm/memory.c|  14 ++---
 mm/swap_state.c|  13 ++---
 4 files changed, 97 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e6dc793d587d..f522b09f2df7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -596,6 +596,9 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup 
*memcg)
 }
 
 int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask);
+int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm,
+ gfp_t gfp, swp_entry_t entry);
+void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
 
 void mem_cgroup_uncharge(struct page *page);
 void mem_cgroup_uncharge_list(struct list_head *page_list);
@@ -1141,6 +1144,16 @@ static inline int mem_cgroup_charge(struct page *page, 
struct mm_struct *mm,
return 0;
 }
 
+static inline int mem_cgroup_swapin_charge_page(struct page *page,
+   struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
+{
+   return 0;
+}
+
+static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+{
+}
+
 static inline void mem_cgroup_uncharge(struct page *page)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2db2aeac8a9e..21c38c0b6e5a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6690,6 +6690,27 @@ void mem_cgroup_calculate_protection(struct mem_cgroup 
*root,
atomic_long_read(>memory.children_low_usage)));
 }
 
+static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
+  gfp_t gfp)
+{
+   unsigned int nr_pages = thp_nr_pages(page);
+   int ret;
+
+   ret = try_charge(memcg, gfp, nr_pages);
+   if (ret)
+   goto out;
+
+   css_get(>css);
+   commit_charge(page, memcg);
+
+   local_irq_disable();
+   mem_cgroup_charge_statistics(memcg, page, nr_pages);
+   memcg_check_events(memcg, page);
+   local_irq_enable();
+out:
+   return ret;
+}
+
 /**
  * mem_cgroup_charge - charge a newly allocated page to a cgroup
  * @page: page to charge
@@ -6699,55 +6720,71 @@ void mem_cgroup_calculate_protection(struct mem_cgroup 
*root,
  * Try to charge @page to the memcg that @mm belongs to, reclaiming
  * pages according to @gfp_mask if necessary.
  *
+ * Do not use this for pages allocated for swapin.
+ *
  * Returns 0 on success. Otherwise, an error code is returned.
  */
 int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 {
-   unsigned int nr_pages = thp_nr_pages(page);
-   struct mem_cgroup *memcg = NULL;
-   int ret = 0;
+   struct mem_cgroup *memcg;
+   int ret;
 
if (mem_cgroup_disabled())
-   goto out;
+   return 0;
 
-   if (PageSwapCache(page)) {
-   swp_entry_t ent = { .val = page_private(page), };
-   unsigned short id;
+   memcg = get_mem_cgroup_from_mm(mm);
+   ret = __mem_cgroup_charge(page, memcg, gfp_mask);
+   css_put(>css);
 
-   /*
-* Every swap fault against a single page tries to charge the
-* page, bail as earl

Re: [PATCH v3] memcg: charge before adding to swapcache on swapin

2021-03-05 Thread Shakeel Butt

On Fri, Mar 5, 2021 at 8:25 AM Johannes Weiner  wrote:
>
> On Fri, Mar 05, 2021 at 12:06:31AM -0800, Hugh Dickins wrote:
> > On Wed, 3 Mar 2021, Shakeel Butt wrote:
> >
> > > Currently the kernel adds the page, allocated for swapin, to the
> > > swapcache before charging the page. This is fine but now we want a
> > > per-memcg swapcache stat which is essential for folks who wants to
> > > transparently migrate from cgroup v1's memsw to cgroup v2's memory and
> > > swap counters. In addition charging a page before exposing it to other
> > > parts of the kernel is a step in the right direction.
> > >
> > > To correctly maintain the per-memcg swapcache stat, this patch has
> > > adopted to charge the page before adding it to swapcache. One
> > > challenge in this option is the failure case of add_to_swap_cache() on
> > > which we need to undo the mem_cgroup_charge(). Specifically undoing
> > > mem_cgroup_uncharge_swap() is not simple.
> > >
> > > To resolve the issue, this patch introduces transaction like interface
> > > to charge a page for swapin. The function mem_cgroup_charge_swapin_page()
> > > initiates the charging of the page and mem_cgroup_finish_swapin_page()
> > > completes the charging process. So, the kernel starts the charging
> > > process of the page for swapin with mem_cgroup_charge_swapin_page(),
> > > adds the page to the swapcache and on success completes the charging
> > > process with mem_cgroup_finish_swapin_page().
> > >
> > > Signed-off-by: Shakeel Butt 
> >
> > Quite apart from helping with the stat you want, what you've ended
> > up with here is a nice cleanup in several different ways (and I'm
> > glad Johannes talked you out of __GFP_NOFAIL: much better like this).
> > I'll say
> >
> > Acked-by: Hugh Dickins 
> >
> > but I am quite unhappy with the name mem_cgroup_finish_swapin_page():
> > it doesn't finish the swapin, it doesn't finish the page, and I'm
> > not persuaded by your paragraph above that there's any "transaction"
> > here (if there were, I'd suggest "commit" instead of "finish"'; and
> > I'd get worried by the css_put before it's called - but no, that's
> > fine, it's independent).
> >
> > How about complementing mem_cgroup_charge_swapin_page() with
> > mem_cgroup_uncharge_swapin_swap()?  I think that describes well
> > what it does, at least in the do_memsw_account() case, and I hope
> > we can overlook that it does nothing at all in the other case.
>
> Yes, that's better. The sequence is still somewhat mysterious for
> people not overly familiar with memcg swap internals, but it's much
> clearer for people who are.
>
> I mildly prefer swapping the swapin bit:
>
> mem_cgroup_swapin_charge_page()
> mem_cgroup_swapin_uncharge_swap()
>
> But either way works for me.
>

I will do a coin toss to select one.

> > And it really doesn't need a page argument: both places it's called
> > have just allocated an order-0 page, there's no chance of a THP here;
> > but you might have some idea of future expansion, or matching
> > put_swap_page() - I won't object if you prefer to pass in the page.
>
> Agreed.

Will remove the page parameter.

BTW I will keep mem_cgroup_disabled() check in the uncharge swap
function as I am thinking of removing "swapaccount=0" functionality.

Re: [PATCH v3 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-03-05 Thread Shakeel Butt

On Fri, Mar 5, 2021 at 9:37 AM David Hildenbrand  wrote:
>
> On 04.03.21 01:03, Shakeel Butt wrote:
> > On Wed, Mar 3, 2021 at 3:34 PM Suren Baghdasaryan  wrote:
> >>
> >> On Wed, Mar 3, 2021 at 3:17 PM Shakeel Butt  wrote:
> >>>
> >>> On Wed, Mar 3, 2021 at 10:58 AM Suren Baghdasaryan  
> >>> wrote:
> >>>>
> >>>> process_madvise currently requires ptrace attach capability.
> >>>> PTRACE_MODE_ATTACH gives one process complete control over another
> >>>> process. It effectively removes the security boundary between the
> >>>> two processes (in one direction). Granting ptrace attach capability
> >>>> even to a system process is considered dangerous since it creates an
> >>>> attack surface. This severely limits the usage of this API.
> >>>> The operations process_madvise can perform do not affect the correctness
> >>>> of the operation of the target process; they only affect where the data
> >>>> is physically located (and therefore, how fast it can be accessed).
> >>>> What we want is the ability for one process to influence another process
> >>>> in order to optimize performance across the entire system while leaving
> >>>> the security boundary intact.
> >>>> Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> >>>> and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> >>>> and CAP_SYS_NICE for influencing process performance.
> >>>>
> >>>> Cc: sta...@vger.kernel.org # 5.10+
> >>>> Signed-off-by: Suren Baghdasaryan 
> >>>> Reviewed-by: Kees Cook 
> >>>> Acked-by: Minchan Kim 
> >>>> Acked-by: David Rientjes 
> >>>> ---
> >>>> changes in v3
> >>>> - Added Reviewed-by: Kees Cook 
> >>>> - Created man page for process_madvise per Andrew's request: 
> >>>> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=a144f458bad476a3358e3a45023789cb7bb9f993
> >>>> - cc'ed sta...@vger.kernel.org # 5.10+ per Andrew's request
> >>>> - cc'ed linux-security-mod...@vger.kernel.org per James Morris's request
> >>>>
> >>>>   mm/madvise.c | 13 -
> >>>>   1 file changed, 12 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>> index df692d2e35d4..01fef79ac761 100644
> >>>> --- a/mm/madvise.c
> >>>> +++ b/mm/madvise.c
> >>>> @@ -1198,12 +1198,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, 
> >>>> const struct iovec __user *, vec,
> >>>>  goto release_task;
> >>>>  }
> >>>>
> >>>> -   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> >>>> +   /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> >>>> +   mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> >>>>  if (IS_ERR_OR_NULL(mm)) {
> >>>>  ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> >>>>  goto release_task;
> >>>>  }
> >>>>
> >>>> +   /*
> >>>> +* Require CAP_SYS_NICE for influencing process performance. 
> >>>> Note that
> >>>> +* only non-destructive hints are currently supported.
> >>>
> >>> How is non-destructive defined? Is MADV_DONTNEED non-destructive?
> >>
> >> Non-destructive in this context means the data is not lost and can be
> >> recovered. I follow the logic described in
> >> https://lwn.net/Articles/794704/ where Minchan was introducing
> >> MADV_COLD and MADV_PAGEOUT as non-destructive versions of MADV_FREE
> >> and MADV_DONTNEED. Following that logic, MADV_FREE and MADV_DONTNEED
> >> would be considered destructive hints.
> >> Note that process_madvise_behavior_valid() allows only MADV_COLD and
> >> MADV_PAGEOUT at the moment, which are both non-destructive.
> >>
> >
> > There is a plan to support MADV_DONTNEED for this syscall. Do we need
> > to change these access checks again with that support?
>
> Eh, I absolutely don't think letting another process discard memory in
> another process' address space is a good idea. The target process can
> observe that easily and might even run into real issues.
>
> What's the use case?
>

Userspace oom reaper. Please look at
https://lore.kernel.org/linux-api/20201014183943.ga1489...@google.com/T/

Re: [PATCH v3] memcg: charge before adding to swapcache on swapin

2021-03-05 Thread Shakeel Butt

On Fri, Mar 5, 2021 at 8:25 AM Johannes Weiner  wrote:
>
[...]
> I'd also rename cgroup_memory_noswap to cgroup_swapaccount - to match
> the commandline and (hopefully) make a bit clearer what it effects.

Do we really need to keep supporting "swapaccount=0"? Is swap
page_counter really a performance issue for systems with memcg and
swap? To me, deprecating "swapaccount=0" simplifies already
complicated code.

Re: [PATCH v2 2/2] mm/memcg: set memcg when split page

2021-03-04 Thread Shakeel Butt

On Wed, Mar 3, 2021 at 11:57 PM Zhou Guanghui  wrote:
>
> As described in the split_page function comment, for the non-compound
> high order page, the sub-pages must be freed individually. If the
> memcg of the fisrt page is valid, the tail pages cannot be uncharged
> when be freed.
>
> For example, when alloc_pages_exact is used to allocate 1MB continuous
> physical memory, 2MB is charged(kmemcg is enabled and __GFP_ACCOUNT is
> set). When make_alloc_exact free the unused 1MB and free_pages_exact
> free the applied 1MB, actually, only 4KB(one page) is uncharged.
>
> Therefore, the memcg of the tail page needs to be set when split page.
>
> Signed-off-by: Zhou Guanghui 

Reviewed-by: Shakeel Butt

Re: [PATCH v2 1/2] mm/memcg: rename mem_cgroup_split_huge_fixup to split_page_memcg

2021-03-04 Thread Shakeel Butt

On Wed, Mar 3, 2021 at 11:55 PM Zhou Guanghui  wrote:
>
> Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly
> pass in page number argument.
>
> In this way, the interface name is more common and can be used by
> potential users. In addition, the complete info(memcg and flag) of
> the memcg needs to be set to the tail pages.
>
> Signed-off-by: Zhou Guanghui 

Reviewed-by: Shakeel Butt

Re: [PATCH v3] memcg: charge before adding to swapcache on swapin

2021-03-04 Thread Shakeel Butt

On Thu, Mar 4, 2021 at 7:48 AM Johannes Weiner  wrote:
>
> On Wed, Mar 03, 2021 at 05:42:29PM -0800, Shakeel Butt wrote:
> > Currently the kernel adds the page, allocated for swapin, to the
> > swapcache before charging the page. This is fine but now we want a
> > per-memcg swapcache stat which is essential for folks who wants to
> > transparently migrate from cgroup v1's memsw to cgroup v2's memory and
> > swap counters. In addition charging a page before exposing it to other
> > parts of the kernel is a step in the right direction.
> >
> > To correctly maintain the per-memcg swapcache stat, this patch has
> > adopted to charge the page before adding it to swapcache. One
> > challenge in this option is the failure case of add_to_swap_cache() on
> > which we need to undo the mem_cgroup_charge(). Specifically undoing
> > mem_cgroup_uncharge_swap() is not simple.
> >
> > To resolve the issue, this patch introduces transaction like interface
> > to charge a page for swapin. The function mem_cgroup_charge_swapin_page()
> > initiates the charging of the page and mem_cgroup_finish_swapin_page()
> > completes the charging process. So, the kernel starts the charging
> > process of the page for swapin with mem_cgroup_charge_swapin_page(),
> > adds the page to the swapcache and on success completes the charging
> > process with mem_cgroup_finish_swapin_page().
> >
> > Signed-off-by: Shakeel Butt 
>
> The patch looks good to me, I have just a minor documentation nit
> below. But with that addressed, please add:
>
> Acked-by: Johannes Weiner 

Thanks.

>
[...]
>
> It's possible somebody later needs to change things around in the
> swapin path and it's not immediately obvious when exactly these two
> functions need to be called in the swapin sequence.
>
> Maybe add here and above that charge_swapin_page needs to be called
> before we try adding the page to the swapcache, and finish_swapin_page
> needs to be called when swapcache insertion has been successful?

I will update the comments and send v4 after a day or so to see if
someone else has any comments.

[PATCH v3] memcg: charge before adding to swapcache on swapin

2021-03-03 Thread Shakeel Butt

Currently the kernel adds the page, allocated for swapin, to the
swapcache before charging the page. This is fine but now we want a
per-memcg swapcache stat which is essential for folks who wants to
transparently migrate from cgroup v1's memsw to cgroup v2's memory and
swap counters. In addition charging a page before exposing it to other
parts of the kernel is a step in the right direction.

To correctly maintain the per-memcg swapcache stat, this patch has
adopted to charge the page before adding it to swapcache. One
challenge in this option is the failure case of add_to_swap_cache() on
which we need to undo the mem_cgroup_charge(). Specifically undoing
mem_cgroup_uncharge_swap() is not simple.

To resolve the issue, this patch introduces transaction like interface
to charge a page for swapin. The function mem_cgroup_charge_swapin_page()
initiates the charging of the page and mem_cgroup_finish_swapin_page()
completes the charging process. So, the kernel starts the charging
process of the page for swapin with mem_cgroup_charge_swapin_page(),
adds the page to the swapcache and on success completes the charging
process with mem_cgroup_finish_swapin_page().

Signed-off-by: Shakeel Butt 
---
Changes since v2:
- fixed build for !CONFIG_MEMCG
- simplified failure path from add_to_swap_cache()

Changes since v1:
- Removes __GFP_NOFAIL and introduced transaction interface for charging
  (suggested by Johannes)
- Updated the commit message

 include/linux/memcontrol.h |  14 +
 mm/memcontrol.c| 116 +++--
 mm/memory.c|  14 ++---
 mm/swap_state.c|  13 ++---
 4 files changed, 97 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e6dc793d587d..d31e6dca397f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -596,6 +596,9 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup 
*memcg)
 }
 
 int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask);
+int mem_cgroup_charge_swapin_page(struct page *page, struct mm_struct *mm,
+ gfp_t gfp, swp_entry_t entry);
+void mem_cgroup_finish_swapin_page(struct page *page, swp_entry_t entry);
 
 void mem_cgroup_uncharge(struct page *page);
 void mem_cgroup_uncharge_list(struct list_head *page_list);
@@ -1141,6 +1144,17 @@ static inline int mem_cgroup_charge(struct page *page, 
struct mm_struct *mm,
return 0;
 }
 
+static inline int mem_cgroup_charge_swapin_page(struct page *page,
+   struct mm_struct *mm, gfp_t gfp, swp_entry_t entry)
+{
+   return 0;
+}
+
+static inline void mem_cgroup_finish_swapin_page(struct page *page,
+swp_entry_t entry)
+{
+}
+
 static inline void mem_cgroup_uncharge(struct page *page)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2db2aeac8a9e..226b7bccb44c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6690,6 +6690,27 @@ void mem_cgroup_calculate_protection(struct mem_cgroup 
*root,
atomic_long_read(>memory.children_low_usage)));
 }
 
+static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
+  gfp_t gfp)
+{
+   unsigned int nr_pages = thp_nr_pages(page);
+   int ret;
+
+   ret = try_charge(memcg, gfp, nr_pages);
+   if (ret)
+   goto out;
+
+   css_get(>css);
+   commit_charge(page, memcg);
+
+   local_irq_disable();
+   mem_cgroup_charge_statistics(memcg, page, nr_pages);
+   memcg_check_events(memcg, page);
+   local_irq_enable();
+out:
+   return ret;
+}
+
 /**
  * mem_cgroup_charge - charge a newly allocated page to a cgroup
  * @page: page to charge
@@ -6699,55 +6720,70 @@ void mem_cgroup_calculate_protection(struct mem_cgroup 
*root,
  * Try to charge @page to the memcg that @mm belongs to, reclaiming
  * pages according to @gfp_mask if necessary.
  *
+ * Do not use this for pages allocated for swapin.
+ *
  * Returns 0 on success. Otherwise, an error code is returned.
  */
 int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
 {
-   unsigned int nr_pages = thp_nr_pages(page);
-   struct mem_cgroup *memcg = NULL;
-   int ret = 0;
+   struct mem_cgroup *memcg;
+   int ret;
 
if (mem_cgroup_disabled())
-   goto out;
+   return 0;
 
-   if (PageSwapCache(page)) {
-   swp_entry_t ent = { .val = page_private(page), };
-   unsigned short id;
+   memcg = get_mem_cgroup_from_mm(mm);
+   ret = __mem_cgroup_charge(page, memcg, gfp_mask);
+   css_put(>css);
 
-   /*
-* Every swap fault against a single page tries to charge the
-* page, bail as early as possible.  shmem_unuse() encounters
-* already charged pages, too.  page and memc

Re: [PATCH v3 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-03-03 Thread Shakeel Butt

On Wed, Mar 3, 2021 at 3:34 PM Suren Baghdasaryan  wrote:
>
> On Wed, Mar 3, 2021 at 3:17 PM Shakeel Butt  wrote:
> >
> > On Wed, Mar 3, 2021 at 10:58 AM Suren Baghdasaryan  
> > wrote:
> > >
> > > process_madvise currently requires ptrace attach capability.
> > > PTRACE_MODE_ATTACH gives one process complete control over another
> > > process. It effectively removes the security boundary between the
> > > two processes (in one direction). Granting ptrace attach capability
> > > even to a system process is considered dangerous since it creates an
> > > attack surface. This severely limits the usage of this API.
> > > The operations process_madvise can perform do not affect the correctness
> > > of the operation of the target process; they only affect where the data
> > > is physically located (and therefore, how fast it can be accessed).
> > > What we want is the ability for one process to influence another process
> > > in order to optimize performance across the entire system while leaving
> > > the security boundary intact.
> > > Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> > > and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> > > and CAP_SYS_NICE for influencing process performance.
> > >
> > > Cc: sta...@vger.kernel.org # 5.10+
> > > Signed-off-by: Suren Baghdasaryan 
> > > Reviewed-by: Kees Cook 
> > > Acked-by: Minchan Kim 
> > > Acked-by: David Rientjes 
> > > ---
> > > changes in v3
> > > - Added Reviewed-by: Kees Cook 
> > > - Created man page for process_madvise per Andrew's request: 
> > > https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=a144f458bad476a3358e3a45023789cb7bb9f993
> > > - cc'ed sta...@vger.kernel.org # 5.10+ per Andrew's request
> > > - cc'ed linux-security-mod...@vger.kernel.org per James Morris's request
> > >
> > >  mm/madvise.c | 13 -
> > >  1 file changed, 12 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index df692d2e35d4..01fef79ac761 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1198,12 +1198,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, 
> > > const struct iovec __user *, vec,
> > > goto release_task;
> > > }
> > >
> > > -   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> > > +   /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> > > +   mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> > > if (IS_ERR_OR_NULL(mm)) {
> > > ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > > goto release_task;
> > > }
> > >
> > > +   /*
> > > +* Require CAP_SYS_NICE for influencing process performance. Note 
> > > that
> > > +* only non-destructive hints are currently supported.
> >
> > How is non-destructive defined? Is MADV_DONTNEED non-destructive?
>
> Non-destructive in this context means the data is not lost and can be
> recovered. I follow the logic described in
> https://lwn.net/Articles/794704/ where Minchan was introducing
> MADV_COLD and MADV_PAGEOUT as non-destructive versions of MADV_FREE
> and MADV_DONTNEED. Following that logic, MADV_FREE and MADV_DONTNEED
> would be considered destructive hints.
> Note that process_madvise_behavior_valid() allows only MADV_COLD and
> MADV_PAGEOUT at the moment, which are both non-destructive.
>

There is a plan to support MADV_DONTNEED for this syscall. Do we need
to change these access checks again with that support?

Re: [PATCH v3 1/1] mm/madvise: replace ptrace attach requirement for process_madvise

2021-03-03 Thread Shakeel Butt

On Wed, Mar 3, 2021 at 10:58 AM Suren Baghdasaryan  wrote:
>
> process_madvise currently requires ptrace attach capability.
> PTRACE_MODE_ATTACH gives one process complete control over another
> process. It effectively removes the security boundary between the
> two processes (in one direction). Granting ptrace attach capability
> even to a system process is considered dangerous since it creates an
> attack surface. This severely limits the usage of this API.
> The operations process_madvise can perform do not affect the correctness
> of the operation of the target process; they only affect where the data
> is physically located (and therefore, how fast it can be accessed).
> What we want is the ability for one process to influence another process
> in order to optimize performance across the entire system while leaving
> the security boundary intact.
> Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ
> and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata
> and CAP_SYS_NICE for influencing process performance.
>
> Cc: sta...@vger.kernel.org # 5.10+
> Signed-off-by: Suren Baghdasaryan 
> Reviewed-by: Kees Cook 
> Acked-by: Minchan Kim 
> Acked-by: David Rientjes 
> ---
> changes in v3
> - Added Reviewed-by: Kees Cook 
> - Created man page for process_madvise per Andrew's request: 
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/commit/?id=a144f458bad476a3358e3a45023789cb7bb9f993
> - cc'ed sta...@vger.kernel.org # 5.10+ per Andrew's request
> - cc'ed linux-security-mod...@vger.kernel.org per James Morris's request
>
>  mm/madvise.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index df692d2e35d4..01fef79ac761 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1198,12 +1198,22 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
> struct iovec __user *, vec,
> goto release_task;
> }
>
> -   mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
> +   /* Require PTRACE_MODE_READ to avoid leaking ASLR metadata. */
> +   mm = mm_access(task, PTRACE_MODE_READ_FSCREDS);
> if (IS_ERR_OR_NULL(mm)) {
> ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> goto release_task;
> }
>
> +   /*
> +* Require CAP_SYS_NICE for influencing process performance. Note that
> +* only non-destructive hints are currently supported.

How is non-destructive defined? Is MADV_DONTNEED non-destructive?

> +*/
> +   if (!capable(CAP_SYS_NICE)) {
> +   ret = -EPERM;
> +   goto release_mm;
> +   }
> +
> total_len = iov_iter_count();
>
> while (iov_iter_count()) {
> @@ -1218,6 +1228,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const 
> struct iovec __user *, vec,
> if (ret == 0)
> ret = total_len - iov_iter_count();
>
> +release_mm:
> mmput(mm);
>  release_task:
> put_task_struct(task);
> --
> 2.30.1.766.gb4fecdf3b7-goog
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1184 matches

Mail list logo