Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-10-07 Thread Michal Hocko
On Wed 07-10-20 00:25:29, Uladzislau Rezki wrote:
> On Mon, Oct 05, 2020 at 05:41:00PM +0200, Michal Hocko wrote:
> > On Mon 05-10-20 17:08:01, Uladzislau Rezki wrote:
> > > On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote:
> > > > On Fri 02-10-20 09:50:14, Mel Gorman wrote:
> > > > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> > > > > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > > > > > > > 
> > > > > > > > No, I meant going back to idea of new gfp flag, but adjust the 
> > > > > > > > implementation in
> > > > > > > > the allocator (different from what you posted in previous 
> > > > > > > > version) so that it
> > > > > > > > only looks at the flag after it tries to allocate from pcplist 
> > > > > > > > and finds out
> > > > > > > > it's empty. So, no inventing of new page allocator entry points 
> > > > > > > > or checks such
> > > > > > > > as the one you wrote above, but adding the new gfp flag in a 
> > > > > > > > way that it doesn't
> > > > > > > > affect existing fast paths.
> > > > > > > >
> > > > > > > OK. Now i see. Please have a look below at the patch, so we fully 
> > > > > > > understand
> > > > > > > each other. If that is something that is close to your view or 
> > > > > > > not:
> > > > > > > 
> > > > > > > 
> > > > > > > t a/include/linux/gfp.h b/include/linux/gfp.h
> > > > > > > index c603237e006c..7e613560a502 100644
> > > > > > > --- a/include/linux/gfp.h
> > > > > > > +++ b/include/linux/gfp.h
> > > > > > > @@ -39,8 +39,9 @@ struct vm_area_struct;
> > > > > > >  #define ___GFP_HARDWALL0x10u
> > > > > > >  #define ___GFP_THISNODE0x20u
> > > > > > >  #define ___GFP_ACCOUNT 0x40u
> > > > > > > +#define ___GFP_NO_LOCKS0x80u
> > > > > > 
> > > > > > Even if a new gfp flag gains a sufficient traction and support I am
> > > > > > _strongly_ opposed against consuming another flag for that. Bit 
> > > > > > space is
> > > > > > limited. 
> > > > > 
> > > > > That is definitely true. I'm not happy with the GFP flag at all, the
> > > > > comment is at best a damage limiting move. It still would be better 
> > > > > for
> > > > > a memory pool to be reserved and sized for critical allocations.
> > > > 
> > > > Completely agreed. The only existing usecase is so special cased that a
> > > > dedicated pool is not only easier to maintain but it should be also much
> > > > better tuned for the specific workload. Something not really feasible
> > > > with the allocator.
> > > > 
> > > > > > Besides that we certainly do not want to allow craziness like
> > > > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> > > > > 
> > > > > That would deserve to be taken to a dumpster and set on fire. The flag
> > > > > combination could be checked in the allocator but the allocator path 
> > > > > fast
> > > > > paths are bad enough already.
> > > > 
> > > > If a new allocation/gfp mode is absolutely necessary then I believe that
> > > > the most reasoanble way forward would be
> > > > #define GFP_NO_LOCK ((__force gfp_t)0)
> > > > 
> > > Agree. Even though i see that some code should be adjusted for it. There 
> > > are
> > > a few users of the __get_free_page(0); So, need to double check it:
> > 
> > Yes, I believe I have pointed that out in the previous discussion.
> > 
> OK. I spent more time on it. A passed gfp_mask can be adjusted on the entry,
> that adjustment depends on the gfp_allowed_mask. It can be changed in 
> run-time.
> 
> For example during early boot it excludes: __GFP_RECLAIM|__GFP_IO|__GFP_FS 
> flags,
> what is GFP_KERNEL. So, GFP_KERNEL is adjusted on entry and becomes 0 during 
> early
> boot phase.

Honestly I am not sure how much is GFP_BOOT_MASK still needed. The
remaining user of gfp_allowed_mask mask

Re: [RFC][PATCH 00/12] mm: tweak page cache migration

2020-10-07 Thread Michal Hocko
Am I the only one missing patch 1-5? lore.k.o doesn't seem to link them
under this message id either.

On Tue 06-10-20 13:51:03, Dave Hansen wrote:
> First of all, I think this little slice of code is a bit
> under-documented.  Perhaps this will help clarify things.
> 
> I'm pretty confident the page_count() check in the first
> patch is right, which is why I removed it outright.  The
> xas_load() check is a bit murkier, so I just left a
> warning in for it.
> 
> Cc: Nicholas Piggin 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox (Oracle) 
> Cc: Yang Shi 
> Cc: linux...@kvack.org
> Cc: linux-kernel@vger.kernel.org

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/5] mm/page_alloc: move pages to tail in move_to_free_list()

2020-10-06 Thread Michal Hocko
On Mon 05-10-20 14:15:32, David Hildenbrand wrote:
> Whenever we move pages between freelists via move_to_free_list()/
> move_freepages_block(), we don't actually touch the pages:
> 1. Page isolation doesn't actually touch the pages, it simply isolates
>pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
>When undoing isolation, we move the pages back to the target list.
> 2. Page stealing (steal_suitable_fallback()) moves free pages directly
>between lists without touching them.
> 3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves
>free pages directly between freelists without touching them.
> 
> We already place pages to the tail of the freelists when undoing isolation
> via __putback_isolated_page(), let's do it in any case (e.g., if order <=
> pageblock_order) and document the behavior. To simplify, let's move the
> pages to the tail for all move_to_free_list()/move_freepages_block() users.
> 
> In 2., the target list is empty, so there should be no change. In 3.,
> we might observe a change, however, highatomic is more concerned about
> allocations succeeding than cache hotness - if we ever realize this
> change degrades a workload, we can special-case this instance and add a
> proper comment.
> 
> This change results in all pages getting onlined via online_pages() to
> be placed to the tail of the freelist.
> 
> Reviewed-by: Oscar Salvador 
> Acked-by: Pankaj Gupta 
> Reviewed-by: Wei Yang 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: Scott Cheloha 
> Cc: Michael Ellerman 
> Signed-off-by: David Hildenbrand 

Much simpler!
Acked-by: Michal Hocko 

Thanks!

> ---
>  mm/page_alloc.c | 10 +++---
>  mm/page_isolation.c |  5 +
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index df5ff0cd6df1..b187e46cf640 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -901,13 +901,17 @@ static inline void add_to_free_list_tail(struct page 
> *page, struct zone *zone,
>   area->nr_free++;
>  }
>  
> -/* Used for pages which are on another list */
> +/*
> + * Used for pages which are on another list. Move the pages to the tail
> + * of the list - so the moved pages won't immediately be considered for
> + * allocation again (e.g., optimization for memory onlining).
> + */
>  static inline void move_to_free_list(struct page *page, struct zone *zone,
>unsigned int order, int migratetype)
>  {
>   struct free_area *area = >free_area[order];
>  
> - list_move(>lru, >free_list[migratetype]);
> + list_move_tail(>lru, >free_list[migratetype]);
>  }
>  
>  static inline void del_page_from_free_list(struct page *page, struct zone 
> *zone,
> @@ -2340,7 +2344,7 @@ static inline struct page 
> *__rmqueue_cma_fallback(struct zone *zone,
>  #endif
>  
>  /*
> - * Move the free pages in a range to the free lists of the requested type.
> + * Move the free pages in a range to the freelist tail of the requested type.
>   * Note that start_page and end_pages are not aligned on a pageblock
>   * boundary. If alignment is required, use move_freepages_block()
>   */
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index abfe26ad59fd..83692b937784 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -106,6 +106,11 @@ static void unset_migratetype_isolate(struct page *page, 
> unsigned migratetype)
>* If we isolate freepage with more than pageblock_order, there
>* should be no freepage in the range, so we could avoid costly
>* pageblock scanning for freepage moving.
> +  *
> +  * We didn't actually touch any of the isolated pages, so place them
> +  * to the tail of the freelist. This is an optimization for memory
> +  * onlining - just onlined memory won't immediately be considered for
> +  * allocation.
>*/
>   if (!isolated_page) {
>   nr_pages = move_freepages_block(zone, page, migratetype, NULL);
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

2020-10-06 Thread Michal Hocko
 does not hold for hugetlbfs.

In short it provides a guarantee.

Does the above clarifies it a bit?


[1] this is not entirely true though because there is a non-trivial
admin interface around THP. Mostly because they turned out to be too
transparent and many people do care about internal fragmentation,
allocation latency, locality (small page on a local node or a large on a
slightly further one?) or simply follow a cargo cult - just have a look
how many admin guides recommend disabling THPs. We got seriously burned
by 2MB THP because of the way how they were enforced on users.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation

2020-10-06 Thread Michal Hocko
On Tue 06-10-20 10:40:23, David Hildenbrand wrote:
> On 06.10.20 10:34, Michal Hocko wrote:
> > On Tue 22-09-20 16:37:12, Vlastimil Babka wrote:
> >> Page isolation can race with process freeing pages to pcplists in a way 
> >> that
> >> a page from isolated pageblock can end up on pcplist. This can be fixed by
> >> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
> >> per-cpu pages again during memory offline" in [1].
> >>
> >> David and Michal would prefer that this race was closed in a way that 
> >> callers
> >> of page isolation who need stronger guarantees don't need to repeatedly 
> >> drain.
> >> David suggested disabling pcplists usage completely during page isolation,
> >> instead of repeatedly draining them.
> >>
> >> To achieve this without adding special cases in alloc/free fastpath, we 
> >> can use
> >> the same approach as boot pagesets - when pcp->high is 0, any pcplist 
> >> addition
> >> will be immediately flushed.
> >>
> >> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> >> once, before calling start_isolate_page_range(). The draining will 
> >> serialize
> >> after processes that already disabled interrupts and read the old value of
> >> pcp->high in free_unref_page_commit(), and processes that have not yet 
> >> disabled
> >> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> >> pcplists. This guarantees no stray pages on pcplists in zones where 
> >> isolation
> >> happens.
> >>
> >> This patch thus adds zone_pcplist_disable() and zone_pcplist_enable() 
> >> functions
> >> that page isolation users can call before start_isolate_page_range() and 
> >> after
> >> unisolating (or offlining) the isolated pages. A new zone->pcplist_disabled
> >> atomic variable makes sure we disable only pcplists once and don't enable
> >> them prematurely in case there are multiple users in parallel.
> >>
> >> We however have to avoid external updates to high and batch by taking
> >> pcp_batch_high_lock. To allow multiple isolations in parallel, change this 
> >> lock
> >> from mutex to rwsem.
> > 
> > The overall idea makes sense. I just suspect you are over overcomplicating 
> > the implementation a bit. Is there any reason that we cannot start with
> > a really dumb implementation first. The only user of this functionality
> > is the memory offlining and that is already strongly synchronized
> > (mem_hotplug_begin) so a lot of trickery can be dropped here. Should we
> > find a new user later on we can make the implementation finer grained
> > but now it will not serve any purpose. So can we simply update pcp->high
> > and drain all pcp in the given zone and wait for all remote pcp draining
> > in zone_pcplist_enable and updte revert all that in zone_pcplist_enable.
> > We can stick to the existing pcp_batch_high_lock.
> > 
> > What do you think?
> > 
> 
> My two cents, we might want to make use of this in some cases of
> alloc_contig_range() soon ("try hard mode"). So I'd love to see a
> synchronized mechanism. However, that can be factored out into a
> separate patch, so this patch gets significantly simpler.

Exactly. And the incremental patch can be added along with the a-c-r try
harder mode.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation

2020-10-06 Thread Michal Hocko
On Tue 22-09-20 16:37:12, Vlastimil Babka wrote:
> Page isolation can race with process freeing pages to pcplists in a way that
> a page from isolated pageblock can end up on pcplist. This can be fixed by
> repeated draining of pcplists, as done by patch "mm/memory_hotplug: drain
> per-cpu pages again during memory offline" in [1].
> 
> David and Michal would prefer that this race was closed in a way that callers
> of page isolation who need stronger guarantees don't need to repeatedly drain.
> David suggested disabling pcplists usage completely during page isolation,
> instead of repeatedly draining them.
> 
> To achieve this without adding special cases in alloc/free fastpath, we can 
> use
> the same approach as boot pagesets - when pcp->high is 0, any pcplist addition
> will be immediately flushed.
> 
> The race can thus be closed by setting pcp->high to 0 and draining pcplists
> once, before calling start_isolate_page_range(). The draining will serialize
> after processes that already disabled interrupts and read the old value of
> pcp->high in free_unref_page_commit(), and processes that have not yet 
> disabled
> interrupts, will observe pcp->high == 0 when they are rescheduled, and skip
> pcplists. This guarantees no stray pages on pcplists in zones where isolation
> happens.
> 
> This patch thus adds zone_pcplist_disable() and zone_pcplist_enable() 
> functions
> that page isolation users can call before start_isolate_page_range() and after
> unisolating (or offlining) the isolated pages. A new zone->pcplist_disabled
> atomic variable makes sure we disable only pcplists once and don't enable
> them prematurely in case there are multiple users in parallel.
> 
> We however have to avoid external updates to high and batch by taking
> pcp_batch_high_lock. To allow multiple isolations in parallel, change this 
> lock
> from mutex to rwsem.

The overall idea makes sense. I just suspect you are over overcomplicating 
the implementation a bit. Is there any reason that we cannot start with
a really dumb implementation first. The only user of this functionality
is the memory offlining and that is already strongly synchronized
(mem_hotplug_begin) so a lot of trickery can be dropped here. Should we
find a new user later on we can make the implementation finer grained
but now it will not serve any purpose. So can we simply update pcp->high
and drain all pcp in the given zone and wait for all remote pcp draining
in zone_pcplist_enable and updte revert all that in zone_pcplist_enable.
We can stick to the existing pcp_batch_high_lock.

What do you think?

Btw. zone_pcplist_{enable,disable} would benefit from a documentation
explaining the purpose and guarantees. And side effect on the
performance. I would even stick this interface to internal.h

> Currently the only user of this functionality is offline_pages().
> 
> [1] 
> https://lore.kernel.org/linux-mm/20200903140032.380431-1-pasha.tatas...@soleen.com/
> 
> Suggested-by: David Hildenbrand 
> Suggested-by: Michal Hocko 
> Signed-off-by: Vlastimil Babka 
> ---
>  include/linux/mmzone.h |  2 ++
>  include/linux/page-isolation.h |  2 ++
>  mm/internal.h  |  4 
>  mm/memory_hotplug.c| 28 
>  mm/page_alloc.c| 29 ++---
>  mm/page_isolation.c| 22 +++---
>  6 files changed, 57 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 7ad3f14dbe88..3c653d6348b1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -536,6 +536,8 @@ struct zone {
>* of pageblock. Protected by zone->lock.
>*/
>   unsigned long   nr_isolate_pageblock;
> + /* Track requests for disabling pcplists */
> + atomic_tpcplist_disabled;
>  #endif
>  
>  #ifdef CONFIG_MEMORY_HOTPLUG
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 572458016331..1dec5d0f62a6 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -38,6 +38,8 @@ struct page *has_unmovable_pages(struct zone *zone, struct 
> page *page,
>  void set_pageblock_migratetype(struct page *page, int migratetype);
>  int move_freepages_block(struct zone *zone, struct page *page,
>   int migratetype, int *num_movable);
> +void zone_pcplist_disable(struct zone *zone);
> +void zone_pcplist_enable(struct zone *zone);
>  
>  /*
>   * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
> diff --git a/mm/internal.h b/mm/internal.h
> index ea66ef45da6c..154e38e702dd 100644
> --- a/mm/internal.h
> +++ b/mm/interna

Re: [RFC V2] mm/vmstat: Add events for HugeTLB migration

2020-10-06 Thread Michal Hocko
On Tue 06-10-20 08:26:35, Anshuman Khandual wrote:
> 
> 
> On 10/05/2020 11:35 AM, Michal Hocko wrote:
> > On Mon 05-10-20 07:59:12, Anshuman Khandual wrote:
> >>
> >>
> >> On 10/02/2020 05:34 PM, Michal Hocko wrote:
> >>> On Wed 30-09-20 11:30:49, Anshuman Khandual wrote:
> >>>> Add following new vmstat events which will track HugeTLB page migration.
> >>>>
> >>>> 1. HUGETLB_MIGRATION_SUCCESS
> >>>> 2. HUGETLB_MIGRATION_FAILURE
> >>>>
> >>>> It follows the existing semantics to accommodate HugeTLB subpages in 
> >>>> total
> >>>> page migration statistics. While here, this updates current trace event
> >>>> 'mm_migrate_pages' to accommodate now available HugeTLB based statistics.
> >>> What is the actual usecase? And how do you deal with the complexity
> >>> introduced by many different hugetlb page sizes. Really, what is the
> >>> point to having such a detailed view on hugetlb migration?
> >>>
> >>
> >> It helps differentiate various page migration events i.e normal, THP and
> >> HugeTLB and gives us more reliable and accurate measurement. Current stats
> >> as per PGMIGRATE_SUCCESS and PGMIGRATE_FAIL are misleading, as they contain
> >> both normal and HugeTLB pages as single entities, which is not accurate.
> > 
> > Yes this is true. But why does it really matter? Do you have a specific
> > example?
> 
> An example which demonstrates that mixing and misrepresentation in these
> stats create some problem ? Well, we could just create one scenario via
> an application with different VMA types and triggering some migrations.
> But the fact remains, that these stats are inaccurate and misleading
> which is very clear and apparent.

This doesn't sound like a proper justification to me.
 
> >> After this change, PGMIGRATE_SUCCESS and PGMIGRATE_FAIL will contain page
> >> migration statistics in terms of normal pages irrespective of whether any
> >> previous migrations until that point involved normal pages, THP or HugeTLB
> >> (any size) pages. At the least, this fixes existing misleading stats with
> >> PGMIGRATE_SUCCESS and PGMIGRATE_FAIL.
> >>
> >> Besides, it helps us understand HugeTLB migrations in more detail. Even
> >> though HugeTLB can be of various sizes on a given platform, these new
> >> stats HUGETLB_MIGRATION_SUCCESS and HUGETLB_MIGRATION_FAILURE give enough
> >> overall insight into HugeTLB migration events.
> > 
> > While true this all is way too vague to add yet another imprecise
> > counter.
> 
> Given that user knows about all HugeTLB mappings it has got, these counters
> are not really vague and could easily be related back.

How can you tell a difference between different hugetlb sizes?

> Moreover this change
> completes the migration stats restructuring which was started with adding
> THP counters. Otherwise it remains incomplete.

Our counters will be always incomplete. Please do realize they are mere
aid rather than a comprihensive toolset to identify the system behavior
to the very detail. We have way too many counters and this particular
ones are not adding much IMHO. At least not from your description which
sounds to me like "Yeah, why not do that although there is no strong
reason, well except that THP already has it so let's do it for hugetlb
as well". I really do not want to deal with yet another report in few
years that somebody wants to distinguish hugetlb migration of different
sizes.

Really, is there any _real_ practical use for these other than "nice,
let's just have it"?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation

2020-10-05 Thread Michal Hocko
On Mon 05-10-20 16:22:46, Vlastimil Babka wrote:
> On 10/5/20 4:05 PM, Michal Hocko wrote:
> > On Fri 25-09-20 13:10:05, Vlastimil Babka wrote:
> >> On 9/25/20 12:54 PM, David Hildenbrand wrote:
> >>
> >> Hmm that temporary write lock would still block new callers until previous
> >> finish with the downgraded-to-read lock.
> >>
> >> But I guess something like this would work:
> >>
> >> retry:
> >>   if (atomic_read(...) == 0) {
> >> // zone_update... + drain
> >> atomic_inc(...);
> >>   else if (atomic_inc_return == 1)
> >> // atomic_cmpxchg from 0 to 1; if that fails, goto retry
> >>
> >> Tricky, but races could only read to unnecessary duplicated updates + 
> >> flushing
> >> but nothing worse?
> >>
> >> Or add another spinlock to cover this part instead of the temp write 
> >> lock...
> > 
> > Do you plan to post a new version or should I review this one?
> 
> I will, hopefully this week, but you could comment on other details and
> overall approach meanwhile? I don't think it will change significantly.

OK. I will have a look tomorrow.

-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-10-05 Thread Michal Hocko
On Mon 05-10-20 17:08:01, Uladzislau Rezki wrote:
> On Fri, Oct 02, 2020 at 11:05:07AM +0200, Michal Hocko wrote:
> > On Fri 02-10-20 09:50:14, Mel Gorman wrote:
> > > On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> > > > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > > > > > 
> > > > > > No, I meant going back to idea of new gfp flag, but adjust the 
> > > > > > implementation in
> > > > > > the allocator (different from what you posted in previous version) 
> > > > > > so that it
> > > > > > only looks at the flag after it tries to allocate from pcplist and 
> > > > > > finds out
> > > > > > it's empty. So, no inventing of new page allocator entry points or 
> > > > > > checks such
> > > > > > as the one you wrote above, but adding the new gfp flag in a way 
> > > > > > that it doesn't
> > > > > > affect existing fast paths.
> > > > > >
> > > > > OK. Now i see. Please have a look below at the patch, so we fully 
> > > > > understand
> > > > > each other. If that is something that is close to your view or not:
> > > > > 
> > > > > 
> > > > > t a/include/linux/gfp.h b/include/linux/gfp.h
> > > > > index c603237e006c..7e613560a502 100644
> > > > > --- a/include/linux/gfp.h
> > > > > +++ b/include/linux/gfp.h
> > > > > @@ -39,8 +39,9 @@ struct vm_area_struct;
> > > > >  #define ___GFP_HARDWALL0x10u
> > > > >  #define ___GFP_THISNODE0x20u
> > > > >  #define ___GFP_ACCOUNT 0x40u
> > > > > +#define ___GFP_NO_LOCKS0x80u
> > > > 
> > > > Even if a new gfp flag gains a sufficient traction and support I am
> > > > _strongly_ opposed against consuming another flag for that. Bit space is
> > > > limited. 
> > > 
> > > That is definitely true. I'm not happy with the GFP flag at all, the
> > > comment is at best a damage limiting move. It still would be better for
> > > a memory pool to be reserved and sized for critical allocations.
> > 
> > Completely agreed. The only existing usecase is so special cased that a
> > dedicated pool is not only easier to maintain but it should be also much
> > better tuned for the specific workload. Something not really feasible
> > with the allocator.
> > 
> > > > Besides that we certainly do not want to allow craziness like
> > > > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> > > 
> > > That would deserve to be taken to a dumpster and set on fire. The flag
> > > combination could be checked in the allocator but the allocator path fast
> > > paths are bad enough already.
> > 
> > If a new allocation/gfp mode is absolutely necessary then I believe that
> > the most reasoanble way forward would be
> > #define GFP_NO_LOCK ((__force gfp_t)0)
> > 
> Agree. Even though i see that some code should be adjusted for it. There are
> a few users of the __get_free_page(0); So, need to double check it:

Yes, I believe I have pointed that out in the previous discussion.

> 
> [0.650351] BUG: kernel NULL pointer dereference, address: 0010
> [0.651083] #PF: supervisor read access in kernel mode
> [0.651639] #PF: error_code(0x) - not-present page
> [0.652200] PGD 0 P4D 0
> [0.652523] Oops:  [#1] SMP NOPTI
> [0.652668] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 5.9.0-rc7-next-20200930+ #140
> [0.652668] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.12.0-1 04/01/2014
> [0.652668] RIP: 0010:__find_event_file+0x21/0x80
> 
> 
> Apart of that. There is a post_alloc_hook(), that gets called from the 
> prep_new_page().
> If "debug page alloc enabled", it maps a page for debug purposes invoking 
> kernel_map_pages().
> __kernel_map_pages() is ARCH specific. For example, powerpc variant uses 
> sleep-able locks
> what can be easily converted to raw variant. 

Yes, there are likely more surprises like that. I am not sure about
kasan, page owner (which depens on the stack unwinder) and others which
hook into this path.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 9/9] mm, page_alloc: optionally disable pcplists during page isolation

2020-10-05 Thread Michal Hocko
On Fri 25-09-20 13:10:05, Vlastimil Babka wrote:
> On 9/25/20 12:54 PM, David Hildenbrand wrote:
> >>> --- a/mm/page_isolation.c
> >>> +++ b/mm/page_isolation.c
> >>> @@ -15,6 +15,22 @@
> >>>  #define CREATE_TRACE_POINTS
> >>>  #include 
> >>>  
> >>> +void zone_pcplist_disable(struct zone *zone)
> >>> +{
> >>> + down_read(_batch_high_lock);
> >>> + if (atomic_inc_return(>pcplist_disabled) == 1) {
> >>> + zone_update_pageset_high_and_batch(zone, 0, 1);
> >>> + __drain_all_pages(zone, true);
> >>> + }
> >> Hm, if one CPU is still inside the if-clause, the other one would
> >> continue, however pcp wpould not be disabled and zones not drained when
> >> returning.
> 
> Ah, well spotted, thanks!
> 
> >> (while we only allow a single Offline_pages() call, it will be different
> >> when we use the function in other context - especially,
> >> alloc_contig_range() for some users)
> >>
> >> Can't we use down_write() here? So it's serialized and everybody has to
> >> properly wait. (and we would not have to rely on an atomic_t)
> > Sorry, I meant down_write only temporarily in this code path. Not
> > keeping it locked in write when returning (I remember there is a way to
> > downgrade).
> 
> Hmm that temporary write lock would still block new callers until previous
> finish with the downgraded-to-read lock.
> 
> But I guess something like this would work:
> 
> retry:
>   if (atomic_read(...) == 0) {
> // zone_update... + drain
> atomic_inc(...);
>   else if (atomic_inc_return == 1)
> // atomic_cmpxchg from 0 to 1; if that fails, goto retry
> 
> Tricky, but races could only read to unnecessary duplicated updates + flushing
> but nothing worse?
> 
> Or add another spinlock to cover this part instead of the temp write lock...

Do you plan to post a new version or should I review this one?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 8/9] mm, page_alloc: drain all pcplists during memory offline

2020-10-05 Thread Michal Hocko
On Fri 25-09-20 12:46:27, David Hildenbrand wrote:
> On 22.09.20 16:37, Vlastimil Babka wrote:
[...]
> > +/*
> > + * Spill all the per-cpu pages from all CPUs back into the buddy allocator.
> > + *
> > + * When zone parameter is non-NULL, spill just the single zone's pages.
> > + *
> > + * Note that this can be extremely slow as the draining happens in a 
> > workqueue.
> > + */
> > +void drain_all_pages(struct zone *zone)
> > +{
> > +   __drain_all_pages(zone, false);
> > +}
> > +
> >  #ifdef CONFIG_HIBERNATION
> >  
> >  /*
> > 
> 
> Interesting race. Instead of this ugly __drain_all_pages() with a
> boolean parameter, can we have two properly named functions to be used
> in !page_alloc.c code without scratching your head what the difference is?

I tend to agree here. I would even fold this into the next patch because
disable/enable interface is much more manageable.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 7/9] mm, page_alloc: move draining pcplists to page isolation users

2020-10-05 Thread Michal Hocko
On Tue 22-09-20 16:37:10, Vlastimil Babka wrote:
> Currently, pcplists are drained during set_migratetype_isolate() which means
> once per pageblock processed start_isolate_page_range(). This is somewhat
> wasteful. Moreover, the callers might need different guarantees, and the
> draining is currently prone to races and does not guarantee that no page
> from isolated pageblock will end up on the pcplist after the drain.
> 
> Better guarantees are added by later patches and require explicit actions
> by page isolation users that need them. Thus it makes sense to move the
> current imperfect draining to the callers also as a preparation step.
> 
> Suggested-by: Pavel Tatashin 
> Signed-off-by: Vlastimil Babka 

Acked-by: Michal Hocko 

> ---
>  mm/memory_hotplug.c | 11 ++-
>  mm/page_alloc.c |  2 ++
>  mm/page_isolation.c | 10 +-
>  3 files changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9db80ee29caa..08f729922e18 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1524,6 +1524,8 @@ int __ref offline_pages(unsigned long start_pfn, 
> unsigned long nr_pages)
>   goto failed_removal;
>   }
>  
> + drain_all_pages(zone);
> +
>   arg.start_pfn = start_pfn;
>   arg.nr_pages = nr_pages;
>   node_states_check_changes_offline(nr_pages, zone, );
> @@ -1574,11 +1576,10 @@ int __ref offline_pages(unsigned long start_pfn, 
> unsigned long nr_pages)
>   }
>  
>   /*
> -  * per-cpu pages are drained in start_isolate_page_range, but if
> -  * there are still pages that are not free, make sure that we
> -  * drain again, because when we isolated range we might
> -  * have raced with another thread that was adding pages to pcp
> -  * list.
> +  * per-cpu pages are drained after start_isolate_page_range, but
> +  * if there are still pages that are not free, make sure that we
> +  * drain again, because when we isolated range we might have
> +  * raced with another thread that was adding pages to pcp list.
>*
>* Forward progress should be still guaranteed because
>* pages on the pcp list can only belong to MOVABLE_ZONE
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 901907799bdc..4e37bc3f6077 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -8432,6 +8432,8 @@ int alloc_contig_range(unsigned long start, unsigned 
> long end,
>   if (ret)
>   return ret;
>  
> + drain_all_pages(cc.zone);
> +
>   /*
>* In case of -EBUSY, we'd like to know which page causes problem.
>* So, just fall through. test_pages_isolated() has a tracepoint
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index abfe26ad59fd..57d8bc1e7760 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -49,7 +49,6 @@ static int set_migratetype_isolate(struct page *page, int 
> migratetype, int isol_
>  
>   __mod_zone_freepage_state(zone, -nr_pages, mt);
>   spin_unlock_irqrestore(>lock, flags);
> - drain_all_pages(zone);
>   return 0;
>   }
>  
> @@ -167,11 +166,12 @@ __first_valid_page(unsigned long pfn, unsigned long 
> nr_pages)
>   *
>   * Please note that there is no strong synchronization with the page 
> allocator
>   * either. Pages might be freed while their page blocks are marked ISOLATED.
> - * In some cases pages might still end up on pcp lists and that would allow
> + * A call to drain_all_pages() after isolation can flush most of them. 
> However
> + * in some cases pages might still end up on pcp lists and that would allow
>   * for their allocation even when they are in fact isolated already. 
> Depending
> - * on how strong of a guarantee the caller needs drain_all_pages might be 
> needed
> - * (e.g. __offline_pages will need to call it after check for isolated range 
> for
> - * a next retry).
> + * on how strong of a guarantee the caller needs, further drain_all_pages()
> + * might be needed (e.g. __offline_pages will need to call it after check for
> + * isolated range for a next retry).
>   *
>   * Return: 0 on success and -EBUSY if any part of range cannot be isolated.
>   */
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 6/9] mm, page_alloc: cache pageset high and batch in struct zone

2020-10-05 Thread Michal Hocko
On Tue 22-09-20 16:37:09, Vlastimil Babka wrote:
> All per-cpu pagesets for a zone use the same high and batch values, that are
> duplicated there just for performance (locality) reasons. This patch adds the
> same variables also to struct zone as a shared copy.
> 
> This will be useful later for making possible to disable pcplists temporarily
> by setting high value to 0, while remembering the values for restoring them
> later. But we can also immediately benefit from not updating pagesets of all
> possible cpus in case the newly recalculated values (after sysctl change or
> memory online/offline) are actually unchanged from the previous ones.

Advantage of this patch is not really clear from it in isolation. Maybe
merge it with the patch which uses the duplicated state.

> 
> Signed-off-by: Vlastimil Babka 
> ---
>  include/linux/mmzone.h |  6 ++
>  mm/page_alloc.c| 16 ++--
>  2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 90721f3156bc..7ad3f14dbe88 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -470,6 +470,12 @@ struct zone {
>  #endif
>   struct pglist_data  *zone_pgdat;
>   struct per_cpu_pageset __percpu *pageset;
> + /*
> +  * the high and batch values are copied to individual pagesets for
> +  * faster access
> +  */
> + int pageset_high;
> + int pageset_batch;
>  
>  #ifndef CONFIG_SPARSEMEM
>   /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index de3b48bda45c..901907799bdc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5824,6 +5824,8 @@ static void build_zonelists(pg_data_t *pgdat)
>   * Other parts of the kernel may not check if the zone is available.
>   */
>  static void pageset_init(struct per_cpu_pageset *p);
> +#define BOOT_PAGESET_HIGH0
> +#define BOOT_PAGESET_BATCH   1
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> @@ -6213,8 +6215,8 @@ static void pageset_init(struct per_cpu_pageset *p)
>* need to be as careful as pageset_update() as nobody can access the
>* pageset yet.
>*/
> - pcp->high = 0;
> - pcp->batch = 1;
> + pcp->high = BOOT_PAGESET_HIGH;
> + pcp->batch = BOOT_PAGESET_BATCH;
>  }
>  
>  /*
> @@ -6238,6 +6240,14 @@ static void zone_set_pageset_high_and_batch(struct 
> zone *zone)
>   new_batch = max(1UL, 1 * new_batch);
>   }
>  
> + if (zone->pageset_high != new_high ||
> + zone->pageset_batch != new_batch) {
> + zone->pageset_high = new_high;
> + zone->pageset_batch = new_batch;
> + } else {
> + return;
> + }
> +
>   for_each_possible_cpu(cpu) {
>   p = per_cpu_ptr(zone->pageset, cpu);
>   pageset_update(>pcp, new_high, new_batch);
> @@ -6300,6 +6310,8 @@ static __meminit void zone_pcp_init(struct zone *zone)
>* offset of a (static) per cpu variable into the per cpu area.
>*/
>   zone->pageset = _pageset;
> + zone->pageset_high = BOOT_PAGESET_HIGH;
> + zone->pageset_batch = BOOT_PAGESET_BATCH;
>  
>   if (populated_zone(zone))
>   printk(KERN_DEBUG "  %s zone: %lu pages, LIFO batch:%u\n",
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 5/9] mm, page_alloc: make per_cpu_pageset accessible only after init

2020-10-05 Thread Michal Hocko
On Tue 22-09-20 16:37:08, Vlastimil Babka wrote:
> setup_zone_pageset() replaces the boot_pageset by allocating and initializing 
> a
> proper percpu one. Currently it assigns zone->pageset with the newly allocated
> one before initializing it. That's currently not an issue, because the zone
> should not be in any zonelist, thus not visible to allocators at this point.
> 
> Memory ordering between the pcplist contents and its visibility is also not
> guaranteed here, but that also shouldn't be an issue because online_pages()
> does a spin_unlock(pgdat->node_size_lock) before building the zonelists.
> 
> However it's best that we don't silently rely on operations that can be 
> changed
> in the future. Make sure only properly initialized pcplists are visible, using
> smp_store_release(). The read side has a data dependency via the zone->pageset
> pointer instead of an explicit read barrier.

Heh, this looks like inveting a similar trap the previous patch was
removing. But more seriously considering that we need a locking for the
whole setup, wouldn't it be better to simply document the locking
requirements rather than adding scary looking barriers future ourselves
or somebody else will need to scratch heads about. I am pretty sure we
don't do anything like that when initializating numa node or other data
structures that might be allocated during the memory hotadd.

> Signed-off-by: Vlastimil Babka 
> ---
>  mm/page_alloc.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 99b74c1c2b0a..de3b48bda45c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6246,15 +6246,17 @@ static void zone_set_pageset_high_and_batch(struct 
> zone *zone)
>  
>  void __meminit setup_zone_pageset(struct zone *zone)
>  {
> + struct per_cpu_pageset __percpu * new_pageset;
>   struct per_cpu_pageset *p;
>   int cpu;
>  
> - zone->pageset = alloc_percpu(struct per_cpu_pageset);
> + new_pageset = alloc_percpu(struct per_cpu_pageset);
>   for_each_possible_cpu(cpu) {
> - p = per_cpu_ptr(zone->pageset, cpu);
> + p = per_cpu_ptr(new_pageset, cpu);
>   pageset_init(p);
>   }
>  
> + smp_store_release(>pageset, new_pageset);
>   zone_set_pageset_high_and_batch(zone);
>  }
>  
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 4/9] mm, page_alloc: simplify pageset_update()

2020-10-05 Thread Michal Hocko
On Tue 22-09-20 16:37:07, Vlastimil Babka wrote:
> pageset_update() attempts to update pcplist's high and batch values in a way
> that readers don't observe batch > high. It uses smp_wmb() to order the 
> updates
> in a way to achieve this. However, without proper pairing read barriers in
> readers this guarantee doesn't hold, and there are no such barriers in
> e.g. free_unref_page_commit().
> 
> Commit 88e8ac11d2ea ("mm, page_alloc: fix core hung in free_pcppages_bulk()")
> already showed this is problematic, and solved this by ultimately only trusing
> pcp->count of the current cpu with interrupts disabled.
> 
> The update dance with unpaired write barriers thus makes no sense. Replace
> them with plain WRITE_ONCE to prevent store tearing, and document that the
> values can change asynchronously and should not be trusted for correctness.
> 
> All current readers appear to be OK after 88e8ac11d2ea. Convert them to
> READ_ONCE to prevent unnecessary read tearing, but mainly to alert anybody
> making future changes to the code that special care is needed.
> 
> Signed-off-by: Vlastimil Babka 

Yes, this should be safe AFAICS. I believe the original intention was
well minded but didn't go all the way to do the thing properly.

I have to admit I have stumbled over this weirdness few times and never
found enough motivation to think that through.

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 40 ++--
>  1 file changed, 18 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 76c2b4578723..99b74c1c2b0a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1297,7 +1297,7 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>  {
>   int migratetype = 0;
>   int batch_free = 0;
> - int prefetch_nr = 0;
> + int prefetch_nr = READ_ONCE(pcp->batch);
>   bool isolated_pageblocks;
>   struct page *page, *tmp;
>   LIST_HEAD(head);
> @@ -1348,8 +1348,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>* avoid excessive prefetching due to large count, only
>* prefetch buddy for the first pcp->batch nr of pages.
>*/
> - if (prefetch_nr++ < pcp->batch)
> + if (prefetch_nr) {
>   prefetch_buddy(page);
> + prefetch_nr--;
> + }
>   } while (--count && --batch_free && !list_empty(list));
>   }
>  
> @@ -3131,10 +3133,8 @@ static void free_unref_page_commit(struct page *page, 
> unsigned long pfn)
>   pcp = _cpu_ptr(zone->pageset)->pcp;
>   list_add(>lru, >lists[migratetype]);
>   pcp->count++;
> - if (pcp->count >= pcp->high) {
> - unsigned long batch = READ_ONCE(pcp->batch);
> - free_pcppages_bulk(zone, batch, pcp);
> - }
> + if (pcp->count >= READ_ONCE(pcp->high))
> + free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
>  }
>  
>  /*
> @@ -3318,7 +3318,7 @@ static struct page *__rmqueue_pcplist(struct zone 
> *zone, int migratetype,
>   do {
>   if (list_empty(list)) {
>   pcp->count += rmqueue_bulk(zone, 0,
> - pcp->batch, list,
> + READ_ONCE(pcp->batch), list,
>   migratetype, alloc_flags);
>   if (unlikely(list_empty(list)))
>   return NULL;
> @@ -6174,13 +6174,16 @@ static int zone_batchsize(struct zone *zone)
>  }
>  
>  /*
> - * pcp->high and pcp->batch values are related and dependent on one another:
> - * ->batch must never be higher then ->high.
> - * The following function updates them in a safe manner without read side
> - * locking.
> + * pcp->high and pcp->batch values are related and generally batch is lower
> + * than high. They are also related to pcp->count such that count is lower
> + * than high, and as soon as it reaches high, the pcplist is flushed.
>   *
> - * Any new users of pcp->batch and pcp->high should ensure they can cope with
> - * those fields changing asynchronously (acording to the above rule).
> + * However, guaranteeing these relations at all times would require e.g. 
> write
> + * barriers here but also careful usage of read barriers at the read side, 
> and
> + * thus be prone to error and bad for performance. Thus the update only 
> prevents
> + * store tearing. Any new users of pcp->batch an

Re: [PATCH 2/9] mm, page_alloc: calculate pageset high and batch once per zone

2020-10-05 Thread Michal Hocko
On Tue 22-09-20 16:37:05, Vlastimil Babka wrote:
> We currently call pageset_set_high_and_batch() for each possible cpu, which
> repeats the same calculations of high and batch values.
> 
> Instead call the function just once per zone, and make it apply the calculated
> values to all per-cpu pagesets of the zone.
> 
> This also allows removing the zone_pageset_init() and __zone_pcp_update()
> wrappers.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka 
> Reviewed-by: Oscar Salvador 
> Reviewed-by: David Hildenbrand 

I like this. One question below
Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 42 ++
>  1 file changed, 18 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a163c5e561f2..26069c8d1b19 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6219,13 +6219,14 @@ static void setup_pageset(struct per_cpu_pageset *p)
>  }
>  
>  /*
> - * Calculate and set new high and batch values for given per-cpu pageset of a
> + * Calculate and set new high and batch values for all per-cpu pagesets of a
>   * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
>   */
> -static void pageset_set_high_and_batch(struct zone *zone,
> -struct per_cpu_pageset *p)
> +static void zone_set_pageset_high_and_batch(struct zone *zone)
>  {
>   unsigned long new_high, new_batch;
> + struct per_cpu_pageset *p;
> + int cpu;
>  
>   if (percpu_pagelist_fraction) {
>   new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> @@ -6237,23 +6238,25 @@ static void pageset_set_high_and_batch(struct zone 
> *zone,
>   new_high = 6 * new_batch;
>   new_batch = max(1UL, 1 * new_batch);
>   }
> - pageset_update(>pcp, new_high, new_batch);
> -}
> -
> -static void __meminit zone_pageset_init(struct zone *zone, int cpu)
> -{
> - struct per_cpu_pageset *pcp = per_cpu_ptr(zone->pageset, cpu);
>  
> - pageset_init(pcp);
> - pageset_set_high_and_batch(zone, pcp);
> + for_each_possible_cpu(cpu) {
> + p = per_cpu_ptr(zone->pageset, cpu);
> + pageset_update(>pcp, new_high, new_batch);
> + }
>  }
>  
>  void __meminit setup_zone_pageset(struct zone *zone)
>  {
> + struct per_cpu_pageset *p;
>   int cpu;
> +
>   zone->pageset = alloc_percpu(struct per_cpu_pageset);
> - for_each_possible_cpu(cpu)
> - zone_pageset_init(zone, cpu);
> + for_each_possible_cpu(cpu) {
> + p = per_cpu_ptr(zone->pageset, cpu);
> + pageset_init(p);
> + }
> +
> +     zone_set_pageset_high_and_batch(zone);

I hope I am not misreading the diff but it seems that setup_zone_pageset
is calling pageset_init which is then done again by
zone_set_pageset_high_and_batch as a part of pageset_update

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 3/9] mm, page_alloc: remove setup_pageset()

2020-10-05 Thread Michal Hocko
On Tue 22-09-20 16:37:06, Vlastimil Babka wrote:
> We initialize boot-time pagesets with setup_pageset(), which sets high and
> batch values that effectively disable pcplists.
> 
> We can remove this wrapper if we just set these values for all pagesets in
> pageset_init(). Non-boot pagesets then subsequently update them to the proper
> values.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka 
> ---
>  mm/page_alloc.c | 17 ++---
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 26069c8d1b19..76c2b4578723 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5823,7 +5823,7 @@ static void build_zonelists(pg_data_t *pgdat)
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static void setup_pageset(struct per_cpu_pageset *p);
> +static void pageset_init(struct per_cpu_pageset *p);
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> @@ -5891,7 +5891,7 @@ build_all_zonelists_init(void)
>* (a chicken-egg dilemma).
>*/
>   for_each_possible_cpu(cpu)
> - setup_pageset(_cpu(boot_pageset, cpu));
> + pageset_init(_cpu(boot_pageset, cpu));
>  
>   mminit_verify_zonelist();
>   cpuset_init_current_mems_allowed();
> @@ -6210,12 +6210,15 @@ static void pageset_init(struct per_cpu_pageset *p)
>   pcp = >pcp;
>   for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++)
>   INIT_LIST_HEAD(>lists[migratetype]);
> -}
>  
> -static void setup_pageset(struct per_cpu_pageset *p)
> -{
> - pageset_init(p);
> - pageset_update(>pcp, 0, 1);
> + /*
> +  * Set batch and high values safe for a boot pageset. A true percpu
> +  * pageset's initialization will update them subsequently. Here we don't
> +  * need to be as careful as pageset_update() as nobody can access the
> +  * pageset yet.

Isn't this slightly misleading? pageset_init is called from setup_zone_pageset
which is called from the memory hotplug as well. Isn't this more about
early zone initialization rather than boot pagesets? Or am I misreading
the patch?

> +  */
> + pcp->high = 0;
> + pcp->batch = 1;
>  }
>  
>  /*
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/9] mm, page_alloc: clean up pageset high and batch update

2020-10-05 Thread Michal Hocko
On Tue 22-09-20 16:37:04, Vlastimil Babka wrote:
> The updates to pcplists' high and batch valued are handled by multiple
> functions that make the calculations hard to follow. Consolidate everything
> to pageset_set_high_and_batch() and remove pageset_set_batch() and
> pageset_set_high() wrappers.
> 
> The only special case using one of the removed wrappers was:
> build_all_zonelists_init()
>   setup_pageset()
> pageset_set_batch()
> which was hardcoding batch as 0, so we can just open-code a call to
> pageset_update() with constant parameters instead.
> 
> No functional change.
> 
> Signed-off-by: Vlastimil Babka 
> Reviewed-by: Oscar Salvador 

yes this looks better, the original code was really hard to follow.

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 49 -
>  1 file changed, 20 insertions(+), 29 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 60a0e94645a6..a163c5e561f2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5823,7 +5823,7 @@ static void build_zonelists(pg_data_t *pgdat)
>   * not check if the processor is online before following the pageset pointer.
>   * Other parts of the kernel may not check if the zone is available.
>   */
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
> +static void setup_pageset(struct per_cpu_pageset *p);
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>  
> @@ -5891,7 +5891,7 @@ build_all_zonelists_init(void)
>* (a chicken-egg dilemma).
>*/
>   for_each_possible_cpu(cpu)
> - setup_pageset(_cpu(boot_pageset, cpu), 0);
> + setup_pageset(_cpu(boot_pageset, cpu));
>  
>   mminit_verify_zonelist();
>   cpuset_init_current_mems_allowed();
> @@ -6200,12 +6200,6 @@ static void pageset_update(struct per_cpu_pages *pcp, 
> unsigned long high,
>   pcp->batch = batch;
>  }
>  
> -/* a companion to pageset_set_high() */
> -static void pageset_set_batch(struct per_cpu_pageset *p, unsigned long batch)
> -{
> - pageset_update(>pcp, 6 * batch, max(1UL, 1 * batch));
> -}
> -
>  static void pageset_init(struct per_cpu_pageset *p)
>  {
>   struct per_cpu_pages *pcp;
> @@ -6218,35 +6212,32 @@ static void pageset_init(struct per_cpu_pageset *p)
>   INIT_LIST_HEAD(>lists[migratetype]);
>  }
>  
> -static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
> +static void setup_pageset(struct per_cpu_pageset *p)
>  {
>   pageset_init(p);
> - pageset_set_batch(p, batch);
> + pageset_update(>pcp, 0, 1);
>  }
>  
>  /*
> - * pageset_set_high() sets the high water mark for hot per_cpu_pagelist
> - * to the value high for the pageset p.
> + * Calculate and set new high and batch values for given per-cpu pageset of a
> + * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
>   */
> -static void pageset_set_high(struct per_cpu_pageset *p,
> - unsigned long high)
> -{
> - unsigned long batch = max(1UL, high / 4);
> - if ((high / 4) > (PAGE_SHIFT * 8))
> - batch = PAGE_SHIFT * 8;
> -
> - pageset_update(>pcp, high, batch);
> -}
> -
>  static void pageset_set_high_and_batch(struct zone *zone,
> -struct per_cpu_pageset *pcp)
> +struct per_cpu_pageset *p)
>  {
> - if (percpu_pagelist_fraction)
> - pageset_set_high(pcp,
> - (zone_managed_pages(zone) /
> - percpu_pagelist_fraction));
> - else
> - pageset_set_batch(pcp, zone_batchsize(zone));
> + unsigned long new_high, new_batch;
> +
> + if (percpu_pagelist_fraction) {
> + new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
> + new_batch = max(1UL, new_high / 4);
> + if ((new_high / 4) > (PAGE_SHIFT * 8))
> + new_batch = PAGE_SHIFT * 8;
> + } else {
> + new_batch = zone_batchsize(zone);
> + new_high = 6 * new_batch;
> + new_batch = max(1UL, 1 * new_batch);
> + }
> + pageset_update(>pcp, new_high, new_batch);
>  }
>  
>  static void __meminit zone_pageset_init(struct zone *zone, int cpu)
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: optionally disable brk()

2020-10-05 Thread Michal Hocko
On Mon 05-10-20 11:13:48, David Hildenbrand wrote:
> On 05.10.20 08:12, Michal Hocko wrote:
> > On Sat 03-10-20 00:44:09, Topi Miettinen wrote:
> >> On 2.10.2020 20.52, David Hildenbrand wrote:
> >>> On 02.10.20 19:19, Topi Miettinen wrote:
> >>>> The brk() system call allows to change data segment size (heap). This
> >>>> is mainly used by glibc for memory allocation, but it can use mmap()
> >>>> and that results in more randomized memory mappings since the heap is
> >>>> always located at fixed offset to program while mmap()ed memory is
> >>>> randomized.
> >>>
> >>> Want to take more Unix out of Linux?
> >>>
> >>> Honestly, why care about disabling? User space can happily use mmap() if
> >>> it prefers.
> >>
> >> brk() interface doesn't seem to be used much and glibc is happy to switch 
> >> to
> >> mmap() if brk() fails, so why not allow disabling it optionally? If you
> >> don't care to disable, don't do it and this is even the default.
> > 
> > I do not think we want to have config per syscall, do we? 
> 
> I do wonder if grouping would be a better option then (finding a proper
> level of abstraction ...).

I have a vague recollection that project for the kernel tinification was
aiming that direction. No idea what is the current state or whether
somebody is pursuing it.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH RFC v2] Opportunistic memory reclaim

2020-10-05 Thread Michal Hocko
A similar thing has been proposed recently by Shakeel
http://lkml.kernel.org/r/20200909215752.1725525-1-shake...@google.com
Please have a look at the follow up discussion.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: optionally disable brk()

2020-10-05 Thread Michal Hocko
On Mon 05-10-20 11:11:35, Topi Miettinen wrote:
[...]
> I think hardened, security oriented systems should disable brk() completely
> because it will increase the randomization of the process address space
> (ASLR). This wouldn't be a good option to enable for systems where maximum
> compatibility with legacy software is more important than any hardening.

I believe we already do have means to filter syscalls from userspace for
security hardened environements. Or is there any reason to duplicate
that and control during the configuration time?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/util.c: Add error logs for commitment overflow

2020-10-05 Thread Michal Hocko
On Fri 02-10-20 21:53:37, pi...@codeaurora.org wrote:
> On 2020-10-02 17:47, Michal Hocko wrote:
> 
> > > __vm_enough_memory: commitment overflow: ppid:150, pid:164,
> > > pages:62451
> > > fork failed[count:0]: Cannot allocate memory
> > 
> > While I understand that fork failing due to overrcomit heuristic is non
> > intuitive and I have seen people scratching heads due to this in the
> > past I am not convinced this is a right approach to tackle the problem.
> 
> Dear Michal,
> First, thank you so much for your review and comments.
> I totally agree with you.
> 
> > First off, referencing pids is not really going to help much if process
> > is short lived.
> 
> Yes, I agree with you.
> But I think this is most important mainly for short lived processes itself.
> Because, when this situation occurs, no one knows who could be the culprit.

Pid will not tell you much for those processes, right?

> However, user keeps dumping "ps" or "top" in background to reproduce once
> again.

I do not think this would be an effective way to catch the problem.
Especially with _once reporting. My experience with these reports is
that a reporter notices a malfunctioning (usually more complex)
workload. In some cases ENOMEM from fork is reported into the log by the
userspace.

For others it is strace -f that tells us that fork is failing and a
test with OVERCOMMIT_ALWAYS usually confirms the theory that this is
the culprit. But a rule of thumb is that it is almost always overcommit
to blame. Why? An undocumented secret is that ENOMEM resulting from an
actual memory allocation in the fork/clone path is close to impossible
because kernel does all it can to satisfy them (an even invokes OOM
killer). There are exceptions (e.g. like high order allocation) but
those should be very rare in that path.

> At this time, we can easily match the pid, process-name (at least in most
> cases).

Maybe our definitions of short lived processes differ but in my book
those are pretty hard to catch in flight.

> > Secondly, __vm_enough_memory is about any address space
> > allocation. Why would you be interested in parent when doing mmap?
> > 
> 
> Yes agree, we can remove ppid from here.
> I thought it might be useful at least in case of fork (or short lived
> process).

I suspect you have missed my point here. Let me clarify a bit more.
__vm_enough_memory is called from much more places than fork.
Essentially any mmap, brk etc are going though this. This is where
parent pid certainly doesn't make any sense. In fork this is a different
case because your forked process pid on its own doesn't make much sense
as it is going to die very quickly anyway. This is when parent is likely
a more important information.

That being said the content really depends on the specific path and that
suggestes that you are trying to log at a wrong layer.

Another question is whether we really need a logging done by the kernel.
Ratelimiting would be tricky to get right and we do not want to allow an
easy way to swamp logs either.
As already mentioned ENOMEM usually means overcommit failure. Maybe we
want to be more explicit this in the man page?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 3/5] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

2020-10-05 Thread Michal Hocko
On Fri 02-10-20 17:20:09, David Hildenbrand wrote:
> On 02.10.20 15:24, Michal Hocko wrote:
> > On Mon 28-09-20 20:21:08, David Hildenbrand wrote:
> >> Page isolation doesn't actually touch the pages, it simply isolates
> >> pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
> >>
> >> We already place pages to the tail of the freelists when undoing
> >> isolation via __putback_isolated_page(), let's do it in any case
> >> (e.g., if order <= pageblock_order) and document the behavior.
> >>
> >> Add a "to_tail" parameter to move_freepages_block() but introduce a
> >> a new move_to_free_list_tail() - similar to add_to_free_list_tail().
> >>
> >> This change results in all pages getting onlined via online_pages() to
> >> be placed to the tail of the freelist.
> > 
> > Is there anything preventing to do this unconditionally? Or in other
> > words is any of the existing callers of move_freepages_block benefiting
> > from adding to the head?
> 
> 1. mm/page_isolation.c:set_migratetype_isolate()
> 
> We move stuff to the MIGRATE_ISOLATE list, we don't care about the order
> there.
> 
> 2. steal_suitable_fallback():
> 
> I don't think we care too much about the order when already stealing
> pageblocks ... and the freelist is empty I guess?
> 
> 3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock()
> 
> Not sure if we really care.

Honestly, I have no idea. I can imagine that some atomic high order
workloads (e.g. in net) might benefit from cache line hot pages but I am
not sure this is really observable. If yes it would likely be better to
have this documented than relying on wild guess. If we do not have any
evidence then I would vote for simplicity first and go with
unconditional add_to_tail which would simply your patch a bit.

Maybe Vlastimil or Mel would have a better picture.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm: optionally disable brk()

2020-10-05 Thread Michal Hocko
On Sat 03-10-20 00:44:09, Topi Miettinen wrote:
> On 2.10.2020 20.52, David Hildenbrand wrote:
> > On 02.10.20 19:19, Topi Miettinen wrote:
> > > The brk() system call allows to change data segment size (heap). This
> > > is mainly used by glibc for memory allocation, but it can use mmap()
> > > and that results in more randomized memory mappings since the heap is
> > > always located at fixed offset to program while mmap()ed memory is
> > > randomized.
> > 
> > Want to take more Unix out of Linux?
> > 
> > Honestly, why care about disabling? User space can happily use mmap() if
> > it prefers.
> 
> brk() interface doesn't seem to be used much and glibc is happy to switch to
> mmap() if brk() fails, so why not allow disabling it optionally? If you
> don't care to disable, don't do it and this is even the default.

I do not think we want to have config per syscall, do we? There are many
other syscalls which are rarely used. Your changelog is actually missing
the most important part. Why do we care so much to increase the config
space and make the kerneel even more tricky for users to configure? How
do I know that something won't break? brk() is one of those syscalls
that has been here for ever and a lot of userspace might depend on it.
I haven't checked but the code size is very unlikely to be shrunk much
as this is mostly a tiny wrapper around mmap code. We are not going to
get rid of any complexity.

So what is the point?
-- 
Michal Hocko
SUSE Labs


Re: [RFC V2] mm/vmstat: Add events for HugeTLB migration

2020-10-05 Thread Michal Hocko
On Mon 05-10-20 07:59:12, Anshuman Khandual wrote:
> 
> 
> On 10/02/2020 05:34 PM, Michal Hocko wrote:
> > On Wed 30-09-20 11:30:49, Anshuman Khandual wrote:
> >> Add following new vmstat events which will track HugeTLB page migration.
> >>
> >> 1. HUGETLB_MIGRATION_SUCCESS
> >> 2. HUGETLB_MIGRATION_FAILURE
> >>
> >> It follows the existing semantics to accommodate HugeTLB subpages in total
> >> page migration statistics. While here, this updates current trace event
> >> 'mm_migrate_pages' to accommodate now available HugeTLB based statistics.
> > What is the actual usecase? And how do you deal with the complexity
> > introduced by many different hugetlb page sizes. Really, what is the
> > point to having such a detailed view on hugetlb migration?
> >
> 
> It helps differentiate various page migration events i.e normal, THP and
> HugeTLB and gives us more reliable and accurate measurement. Current stats
> as per PGMIGRATE_SUCCESS and PGMIGRATE_FAIL are misleading, as they contain
> both normal and HugeTLB pages as single entities, which is not accurate.

Yes this is true. But why does it really matter? Do you have a specific
example?

> After this change, PGMIGRATE_SUCCESS and PGMIGRATE_FAIL will contain page
> migration statistics in terms of normal pages irrespective of whether any
> previous migrations until that point involved normal pages, THP or HugeTLB
> (any size) pages. At the least, this fixes existing misleading stats with
> PGMIGRATE_SUCCESS and PGMIGRATE_FAIL.
> 
> Besides, it helps us understand HugeTLB migrations in more detail. Even
> though HugeTLB can be of various sizes on a given platform, these new
> stats HUGETLB_MIGRATION_SUCCESS and HUGETLB_MIGRATION_FAILURE give enough
> overall insight into HugeTLB migration events.

While true this all is way too vague to add yet another imprecise
counter.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2020-10-02 Thread Michal Hocko
On Fri 02-10-20 09:53:05, Rik van Riel wrote:
> On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote:
> > On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote:
> > > (Apologies for messing up the mailing list thread, Gmail had fooled
> > > me into
> > > believing that it properly picked up the thread)
> > > 
> > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko  wrote:
> > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > > > > yes it shows the bottleneck but it is quite artificial. Read
> > > > > > data is
> > > > > > usually processed and/or written back and that changes the
> > > > > > picture a
> > > > > > lot.
> > > > > Apologies for reviving an ancient thread (and apologies in
> > > > > advance for my lack
> > > > > of knowledge on how mailing lists work), but I'd like to offer
> > > > > up another
> > > > > reason why merging this might be a good idea.
> > > > > 
> > > > > From what I understand, zswap runs its compression on the same
> > > > > kswapd thread,
> > > > > limiting it to a single thread for compression. Given enough
> > > > > processing power,
> > > > > zswap can get great throughput using heavier compression
> > > > > algorithms like zstd,
> > > > > but this is currently greatly limited by the lack of threading.
> > > > 
> > > > Isn't this a problem of the zswap implementation rather than
> > > > general
> > > > kswapd reclaim? Why zswap doesn't do the same as normal swap out
> > > > in a
> > > > context outside of the reclaim?
> 
> On systems with lots of very fast IO devices, we have
> also seen kswapd take 100% CPU time without any zswap
> in use.

Do you have more details? Does the saturated kswapd lead to pre-mature
direct reclaim? What is the saturated number of reclaimed pages per unit
of time? Have you tried to play with this to see whether an additional
worker would help?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 5/5] mm/memory_hotplug: update comment regarding zone shuffling

2020-10-02 Thread Michal Hocko
On Mon 28-09-20 20:21:10, David Hildenbrand wrote:
> As we no longer shuffle via generic_online_page() and when undoing
> isolation, we can simplify the comment.
> 
> We now effectively shuffle only once (properly) when onlining new
> memory.
> 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Signed-off-by: David Hildenbrand 

Acked-by: Michal Hocko 

> ---
>  mm/memory_hotplug.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9db80ee29caa..c589bd8801bb 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -859,13 +859,10 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages,
>   undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
>  
>   /*
> -  * When exposing larger, physically contiguous memory areas to the
> -  * buddy, shuffling in the buddy (when freeing onlined pages, putting
> -  * them either to the head or the tail of the freelist) is only helpful
> -  * for maintaining the shuffle, but not for creating the initial
> -  * shuffle. Shuffle the whole zone to make sure the just onlined pages
> -  * are properly distributed across the whole freelist. Make sure to
> -  * shuffle once pageblocks are no longer isolated.
> +  * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
> +  * the tail of the freelist when undoing isolation). Shuffle the whole
> +  * zone to make sure the just onlined pages are properly distributed
> +  * across the whole freelist - to create an initial shuffle.
>*/
>   shuffle_zone(zone);
>  
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 4/5] mm/page_alloc: place pages to tail in __free_pages_core()

2020-10-02 Thread Michal Hocko
On Mon 28-09-20 20:21:09, David Hildenbrand wrote:
> __free_pages_core() is used when exposing fresh memory to the buddy
> during system boot and when onlining memory in generic_online_page().
> 
> generic_online_page() is used in two cases:
> 
> 1. Direct memory onlining in online_pages().
> 2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
>balloon and virtio-mem), when parts of a section are kept
>fake-offline to be fake-onlined later on.
> 
> In 1, we already place pages to the tail of the freelist. Pages will be
> freed to MIGRATE_ISOLATE lists first and moved to the tail of the freelists
> via undo_isolate_page_range().
> 
> In 2, we currently don't implement a proper rule. In case of virtio-mem,
> where we currently always online MAX_ORDER - 1 pages, the pages will be
> placed to the HEAD of the freelist - undesireable. While the hyper-v
> balloon calls generic_online_page() with single pages, usually it will
> call it on successive single pages in a larger block.
> 
> The pages are fresh, so place them to the tail of the freelists and avoid
> the PCP. In __free_pages_core(), remove the now superflouos call to
> set_page_refcounted() and add a comment regarding page initialization and
> the refcount.
> 
> Note: In 2. we currently don't shuffle. If ever relevant (page shuffling
> is usually of limited use in virtualized environments), we might want to
> shuffle after a sequence of generic_online_page() calls in the
> relevant callers.

It took some time to get through all the freeing paths with subtle
differences but this looks reasonable. You are mentioning that this
influences a boot time free memory ordering as well but only very
briefly. I do not expect this to make a huge difference but who knows.
It makes some sense to add pages in the order they show up in the
physical address ordering.

> Reviewed-by: Vlastimil Babka 
> Reviewed-by: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Cc: Wei Liu 
> Signed-off-by: David Hildenbrand 

That being said I do not see any fundamental problems.
Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 37 -
>  1 file changed, 24 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d5a5f528b8ca..8a2134fe9947 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -270,7 +270,8 @@ bool pm_suspended_storage(void)
>  unsigned int pageblock_order __read_mostly;
>  #endif
>  
> -static void __free_pages_ok(struct page *page, unsigned int order);
> +static void __free_pages_ok(struct page *page, unsigned int order,
> + fop_t fop_flags);
>  
>  /*
>   * results with 256, 32 in the lowmem_reserve sysctl:
> @@ -682,7 +683,7 @@ static void bad_page(struct page *page, const char 
> *reason)
>  void free_compound_page(struct page *page)
>  {
>   mem_cgroup_uncharge(page);
> - __free_pages_ok(page, compound_order(page));
> + __free_pages_ok(page, compound_order(page), FOP_NONE);
>  }
>  
>  void prep_compound_page(struct page *page, unsigned int order)
> @@ -1419,17 +1420,15 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   spin_unlock(>lock);
>  }
>  
> -static void free_one_page(struct zone *zone,
> - struct page *page, unsigned long pfn,
> - unsigned int order,
> - int migratetype)
> +static void free_one_page(struct zone *zone, struct page *page, unsigned 
> long pfn,
> +   unsigned int order, int migratetype, fop_t fop_flags)
>  {
>   spin_lock(>lock);
>   if (unlikely(has_isolate_pageblock(zone) ||
>   is_migrate_isolate(migratetype))) {
>   migratetype = get_pfnblock_migratetype(page, pfn);
>   }
> - __free_one_page(page, pfn, zone, order, migratetype, FOP_NONE);
> + __free_one_page(page, pfn, zone, order, migratetype, fop_flags);
>   spin_unlock(>lock);
>  }
>  
> @@ -1507,7 +1506,8 @@ void __meminit reserve_bootmem_region(phys_addr_t 
> start, phys_addr_t end)
>   }
>  }
>  
> -static void __free_pages_ok(struct page *page, unsigned int order)
> +static void __free_pages_ok(struct page *page, unsigned int order,
> + fop_t fop_flags)
>  {
>   unsigned long flags;
>   int migratetype;
> @@ -1519,7 +1519,8 @@ static void __free_pages_ok(struct p

Re: [PATCH v1 3/5] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

2020-10-02 Thread Michal Hocko
On Mon 28-09-20 20:21:08, David Hildenbrand wrote:
> Page isolation doesn't actually touch the pages, it simply isolates
> pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
> 
> We already place pages to the tail of the freelists when undoing
> isolation via __putback_isolated_page(), let's do it in any case
> (e.g., if order <= pageblock_order) and document the behavior.
> 
> Add a "to_tail" parameter to move_freepages_block() but introduce a
> a new move_to_free_list_tail() - similar to add_to_free_list_tail().
> 
> This change results in all pages getting onlined via online_pages() to
> be placed to the tail of the freelist.

Is there anything preventing to do this unconditionally? Or in other
words is any of the existing callers of move_freepages_block benefiting
from adding to the head?

> Reviewed-by: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: Scott Cheloha 
> Cc: Michael Ellerman 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/page-isolation.h |  4 ++--
>  mm/page_alloc.c| 35 +++---
>  mm/page_isolation.c| 12 +---
>  3 files changed, 35 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 572458016331..3eca9b3c5305 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -36,8 +36,8 @@ static inline bool is_migrate_isolate(int migratetype)
>  struct page *has_unmovable_pages(struct zone *zone, struct page *page,
>int migratetype, int flags);
>  void set_pageblock_migratetype(struct page *page, int migratetype);
> -int move_freepages_block(struct zone *zone, struct page *page,
> - int migratetype, int *num_movable);
> +int move_freepages_block(struct zone *zone, struct page *page, int 
> migratetype,
> +  bool to_tail, int *num_movable);
>  
>  /*
>   * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9e3ed4a6f69a..d5a5f528b8ca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -905,6 +905,15 @@ static inline void move_to_free_list(struct page *page, 
> struct zone *zone,
>   list_move(>lru, >free_list[migratetype]);
>  }
>  
> +/* Used for pages which are on another list */
> +static inline void move_to_free_list_tail(struct page *page, struct zone 
> *zone,
> +   unsigned int order, int migratetype)
> +{
> + struct free_area *area = >free_area[order];
> +
> + list_move_tail(>lru, >free_list[migratetype]);
> +}
> +
>  static inline void del_page_from_free_list(struct page *page, struct zone 
> *zone,
>  unsigned int order)
>  {
> @@ -2338,9 +2347,9 @@ static inline struct page 
> *__rmqueue_cma_fallback(struct zone *zone,
>   * Note that start_page and end_pages are not aligned on a pageblock
>   * boundary. If alignment is required, use move_freepages_block()
>   */
> -static int move_freepages(struct zone *zone,
> -   struct page *start_page, struct page *end_page,
> -   int migratetype, int *num_movable)
> +static int move_freepages(struct zone *zone, struct page *start_page,
> +   struct page *end_page, int migratetype,
> +   bool to_tail, int *num_movable)
>  {
>   struct page *page;
>   unsigned int order;
> @@ -2371,7 +2380,10 @@ static int move_freepages(struct zone *zone,
>   VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>  
>   order = page_order(page);
> - move_to_free_list(page, zone, order, migratetype);
> + if (to_tail)
> + move_to_free_list_tail(page, zone, order, migratetype);
> + else
> + move_to_free_list(page, zone, order, migratetype);
>   page += 1 << order;
>   pages_moved += 1 << order;
>   }
> @@ -2379,8 +2391,8 @@ static int move_freepages(struct zone *zone,
>   return pages_moved;
>  }
>  
> -int move_freepages_block(struct zone *zone, struct page *page,
> - int migratetype, int *num_movable)
> +int move_freepages_block(struct zone *zone, struct page *page, int 
> migratetype,
> +  bool to_tail, int *num_movable)
>  {
>   unsigned long start_pfn

Re: [PATCH v1 2/5] mm/page_alloc: place pages to tail in __putback_isolated_page()

2020-10-02 Thread Michal Hocko
On Mon 28-09-20 20:21:07, David Hildenbrand wrote:
> __putback_isolated_page() already documents that pages will be placed to
> the tail of the freelist - this is, however, not the case for
> "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
> the case for all existing users.
> 
> This change affects two users:
> - free page reporting
> - page isolation, when undoing the isolation (including memory onlining).
> 
> This behavior is desireable for pages that haven't really been touched
> lately, so exactly the two users that don't actually read/write page
> content, but rather move untouched pages.
> 
> The new behavior is especially desirable for memory onlining, where we
> allow allocation of newly onlined pages via undo_isolate_page_range()
> in online_pages(). Right now, we always place them to the head of the
> free list, resulting in undesireable behavior: Assume we add
> individual memory chunks via add_memory() and online them right away to
> the NORMAL zone. We create a dependency chain of unmovable allocations
> e.g., via the memmap. The memmap of the next chunk will be placed onto
> previous chunks - if the last block cannot get offlined+removed, all
> dependent ones cannot get offlined+removed. While this can already be
> observed with individual DIMMs, it's more of an issue for virtio-mem
> (and I suspect also ppc DLPAR).
> 
> Document that this should only be used for optimizations, and no code
> should realy on this for correction (if the order of freepage lists
> ever changes).
> 
> We won't care about page shuffling: memory onlining already properly
> shuffles after onlining. free page reporting doesn't care about
> physically contiguous ranges, and there are already cases where page
> isolation will simply move (physically close) free pages to (currently)
> the head of the freelists via move_freepages_block() instead of
> shuffling. If this becomes ever relevant, we should shuffle the whole
> zone when undoing isolation of larger ranges, and after
> free_contig_range().
> 
> Reviewed-by: Alexander Duyck 
> Reviewed-by: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: Scott Cheloha 
> Cc: Michael Ellerman 
> Signed-off-by: David Hildenbrand 

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index daab90e960fe..9e3ed4a6f69a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -89,6 +89,18 @@ typedef int __bitwise fop_t;
>   */
>  #define FOP_SKIP_REPORT_NOTIFY   ((__force fop_t)BIT(0))
>  
> +/*
> + * Place the (possibly merged) page to the tail of the freelist. Will ignore
> + * page shuffling (relevant code - e.g., memory onlining - is expected to
> + * shuffle the whole zone).
> + *
> + * Note: No code should rely onto this flag for correctness - it's purely
> + *   to allow for optimizations when handing back either fresh pages
> + *   (memory onlining) or untouched pages (page isolation, free page
> + *   reporting).
> + */
> +#define FOP_TO_TAIL  ((__force fop_t)BIT(1))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_FRACTION (8)
> @@ -1038,7 +1050,9 @@ static inline void __free_one_page(struct page *page, 
> unsigned long pfn,
>  done_merging:
>   set_page_order(page, order);
>  
> - if (is_shuffle_order(order))
> + if (fop_flags & FOP_TO_TAIL)
> + to_tail = true;
> + else if (is_shuffle_order(order))
>   to_tail = shuffle_pick_tail();
>   else
>   to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
> @@ -3300,7 +3314,7 @@ void __putback_isolated_page(struct page *page, 
> unsigned int order, int mt)
>  
>   /* Return isolated page to tail of freelist. */
>   __free_one_page(page, page_to_pfn(page), zone, order, mt,
> - FOP_SKIP_REPORT_NOTIFY);
> + FOP_SKIP_REPORT_NOTIFY | FOP_TO_TAIL);
>  }
>  
>  /*
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v1 1/5] mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag

2020-10-02 Thread Michal Hocko
On Mon 28-09-20 20:21:06, David Hildenbrand wrote:
> Let's prepare for additional flags and avoid long parameter lists of bools.
> Follow-up patches will also make use of the flags in __free_pages_ok(),
> however, I wasn't able to come up with a better name for the type - should
> be good enough for internal purposes.
> 
> Reviewed-by: Alexander Duyck 
> Reviewed-by: Vlastimil Babka 
> Reviewed-by: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Signed-off-by: David Hildenbrand 

Hopefully this will not wrack the generated code. But considering we
would need another parameter there is not much choice left.

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 28 
>  1 file changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index df90e3654f97..daab90e960fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -77,6 +77,18 @@
>  #include "shuffle.h"
>  #include "page_reporting.h"
>  
> +/* Free One Page flags: for internal, non-pcp variants of free_pages(). */
> +typedef int __bitwise fop_t;
> +
> +/* No special request */
> +#define FOP_NONE ((__force fop_t)0)
> +
> +/*
> + * Skip free page reporting notification for the (possibly merged) page. 
> (will
> + * *not* mark the page reported, only skip the notification).
> + */
> +#define FOP_SKIP_REPORT_NOTIFY   ((__force fop_t)BIT(0))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_FRACTION (8)
> @@ -948,10 +960,9 @@ buddy_merge_likely(unsigned long pfn, unsigned long 
> buddy_pfn,
>   * -- nyc
>   */
>  
> -static inline void __free_one_page(struct page *page,
> - unsigned long pfn,
> - struct zone *zone, unsigned int order,
> - int migratetype, bool report)
> +static inline void __free_one_page(struct page *page, unsigned long pfn,
> +struct zone *zone, unsigned int order,
> +int migratetype, fop_t fop_flags)
>  {
>   struct capture_control *capc = task_capc(zone);
>   unsigned long buddy_pfn;
> @@ -1038,7 +1049,7 @@ static inline void __free_one_page(struct page *page,
>   add_to_free_list(page, zone, order, migratetype);
>  
>   /* Notify page reporting subsystem of freed page */
> - if (report)
> + if (!(fop_flags & FOP_SKIP_REPORT_NOTIFY))
>   page_reporting_notify_free(order);
>  }
>  
> @@ -1379,7 +1390,7 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   if (unlikely(isolated_pageblocks))
>   mt = get_pageblock_migratetype(page);
>  
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
> + __free_one_page(page, page_to_pfn(page), zone, 0, mt, FOP_NONE);
>   trace_mm_page_pcpu_drain(page, 0, mt);
>   }
>   spin_unlock(>lock);
> @@ -1395,7 +1406,7 @@ static void free_one_page(struct zone *zone,
>   is_migrate_isolate(migratetype))) {
>   migratetype = get_pfnblock_migratetype(page, pfn);
>   }
> - __free_one_page(page, pfn, zone, order, migratetype, true);
> + __free_one_page(page, pfn, zone, order, migratetype, FOP_NONE);
>   spin_unlock(>lock);
>  }
>  
> @@ -3288,7 +3299,8 @@ void __putback_isolated_page(struct page *page, 
> unsigned int order, int mt)
>   lockdep_assert_held(>lock);
>  
>   /* Return isolated page to tail of freelist. */
> - __free_one_page(page, page_to_pfn(page), zone, order, mt, false);
> + __free_one_page(page, page_to_pfn(page), zone, order, mt,
> + FOP_SKIP_REPORT_NOTIFY);
>  }
>  
>  /*
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/util.c: Add error logs for commitment overflow

2020-10-02 Thread Michal Hocko
On Fri 02-10-20 17:27:41, Pintu Kumar wrote:
> The headless embedded devices often come with very limited amount
> of RAM such as: 256MB or even lesser.
> These types of system often rely on command line interface which can
> execute system commands in the background using the fork/exec combination.
> There could be even many child tasks invoked internally to handle multiple
> requests.
> In this scenario, if the parent task keeps committing large amount of
> memory, there are chances that this commitment can easily overflow the
> total RAM available in the system. Now if the parent process invokes fork
> or system commands (using system() call) and the commitment ratio is at
> 50%, the request fails with the following, even though there are large
> amount of free memory available in the system:
> fork failed: Cannot allocate memory
> 
> If there are too many 3rd party tasks calling fork, it becomes difficult to
> identify exactly which parent process is overcommitting memory.
> Since, free memory is also available, this "Cannot allocate memory" from
> fork creates confusion to application developer.
> 
> Thus, I found that this simple print message (even once) is helping in
> quickly identifying the culprit.
> 
> This is the output we can see on a 256MB system and with a simple malloc
> and fork program.
> 
> [root@ ~]# cat /proc/meminfo
> MemTotal: 249520 kB   ==> 243MB
> MemFree:  179100 kB
> 
> PPID  PID USERRSS VSZ STATARGS
>  150  164 root1440250580  S   ./consume-and-fork.out 243
> 
> __vm_enough_memory: commitment overflow: ppid:150, pid:164, pages:62451
> fork failed[count:0]: Cannot allocate memory

While I understand that fork failing due to overrcomit heuristic is non
intuitive and I have seen people scratching heads due to this in the
past I am not convinced this is a right approach to tackle the problem.
First off, referencing pids is not really going to help much if process
is short lived. Secondly, __vm_enough_memory is about any address space
allocation. Why would you be interested in parent when doing mmap?

Last but not least _once is questionable as well. The first instance
might happen early during the system lifetime and you will not learn
about future failures so the overall point of debuggability is seriously
inhibited.

Maybe what you want is to report higher up the call chain (fork?) and
have it ratelimited rather than _once? Or maybe just try to live with
the confusing situation?

> Signed-off-by: Pintu Kumar 
> ---
>  mm/util.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 5ef378a..9431ce7a 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -895,6 +895,9 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, 
> int cap_sys_admin)
>  error:
>   vm_unacct_memory(pages);
>  
> + pr_err_once("%s: commitment overflow: ppid:%d, pid:%d, pages:%ld\n",
> + __func__, current->parent->pid, current->pid, pages);
> +
>   return -ENOMEM;
>  }
>  
> -- 
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.,
> is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.

-- 
Michal Hocko
SUSE Labs


Re: [RFC V2] mm/vmstat: Add events for HugeTLB migration

2020-10-02 Thread Michal Hocko
   break;
> + }
>   nr_failed++;
>   break;
>   }
>   }
>   }
> - nr_failed += retry + thp_retry;
> + nr_failed += retry + thp_retry + hugetlb_retry;
>   nr_thp_failed += thp_retry;
> + nr_hugetlb_failed += hugetlb_retry;
>   rc = nr_failed;
>  out:
>   count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
> @@ -1524,8 +1557,11 @@ int migrate_pages(struct list_head *from, new_page_t 
> get_new_page,
>   count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded);
>   count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed);
>   count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split);
> + count_vm_events(HUGETLB_MIGRATION_SUCCESS, nr_hugetlb_succeeded);
> + count_vm_events(HUGETLB_MIGRATION_FAIL, nr_hugetlb_failed);
>   trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded,
> -nr_thp_failed, nr_thp_split, mode, reason);
> +nr_thp_failed, nr_thp_split, 
> nr_hugetlb_succeeded,
> +nr_hugetlb_failed, mode, reason);
>  
>   if (!swapwrite)
>   current->flags &= ~PF_SWAPWRITE;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 79e5cd0abd0e..12fd35ba135f 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1286,6 +1286,8 @@ const char * const vmstat_text[] = {
>   "thp_migration_success",
>   "thp_migration_fail",
>   "thp_migration_split",
> + "hugetlb_migration_success",
> + "hugetlb_migration_fail",
>  #endif
>  #ifdef CONFIG_COMPACTION
>   "compact_migrate_scanned",
> -- 
> 2.20.1
> 

-- 
Michal Hocko
SUSE Labs


Re: [v5] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged

2020-10-02 Thread Michal Hocko
On Wed 30-09-20 15:03:11, Mike Kravetz wrote:
> On 9/30/20 1:47 PM, Vijay Balakrishna wrote:
> > On 9/30/2020 11:20 AM, Mike Kravetz wrote:
> >> On 9/29/20 9:49 AM, Vijay Balakrishna wrote:
> >>
> >> Sorry for jumping in so late.  Should we use this as an opportunity to
> >> also fix up the messages logged when (re)calculating mfk?  They are wrong
> >> and could be quite confusing.
> > 
> > 
> > Sure.  Please share your thoughts regarding appropriate message.  Here is 
> > what I'm thinking
> > 
> > pr_warn("min_free_kbytes is not updated to %d because current value %d is 
> > preferred\n", new_min_free_kbytes, min_free_kbytes);
> > 
> > If above message is reasonable I can post a new revision (v6).
> 
> Just considering the below example,
> 
> >> For example consider the following sequence
> >> of operations and corresponding log messages produced.
> >>
> >> Freshly booted VM with 2 nodes and 8GB memory:
> >> # cat /proc/sys/vm/min_free_kbytes
> >> 90112
> >> # echo 9 > /proc/sys/vm/min_free_kbytes
> >> # cat /proc/sys/vm/min_free_kbytes
> >> 9
> >> # echo 0 > /sys/devices/system/node/node1/memory56/online
> >> [  135.099947] Offlined Pages 32768
> >> [  135.102362] min_free_kbytes is not updated to 11241 because user 
> >> defined value 9 is preferred
> 
> I am not sure if there is any value in printing the above line.  Especially
> in this context as it becomes obsolete with the printing of the next line.

The original intention was to make it explicit that auto-tuning is
influenced by the user provided configuration.

> >> [  135.109070] khugepaged: raising min_free_kbytes from 9 to 90112 to 
> >> help t
> >> ransparent hugepage allocations
> 
> IMO, the above line is the only one that should be output as a result of the
> recalculation.

Well, but khugepaged could be disabled and then the above might not get
printed. Sure the code could get reorganized and all that but is this
really worth that?

> I guess that brings up the question of 'should we continue to track the user
> defined value if we overwrite it?".  If we quit tracking it may help with the
> next message.

Auto tuning and user provided override is quite tricky to get sensible.
Especially in the case here. Admin has provided an override but has the
potential memory hotplug been considered? Or to make it even more
complicated, consider that the hotplug happens without admin involvement
- e.g. memory gets hotremoved due to HW problems. Is the admin provided
value still meaningful? To be honest I do not have a good answer and I
am not sure we should care all that much until we see practical
problems.
-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-10-02 Thread Michal Hocko
On Fri 02-10-20 09:50:14, Mel Gorman wrote:
> On Fri, Oct 02, 2020 at 09:11:23AM +0200, Michal Hocko wrote:
> > On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > > > 
> > > > No, I meant going back to idea of new gfp flag, but adjust the 
> > > > implementation in
> > > > the allocator (different from what you posted in previous version) so 
> > > > that it
> > > > only looks at the flag after it tries to allocate from pcplist and 
> > > > finds out
> > > > it's empty. So, no inventing of new page allocator entry points or 
> > > > checks such
> > > > as the one you wrote above, but adding the new gfp flag in a way that 
> > > > it doesn't
> > > > affect existing fast paths.
> > > >
> > > OK. Now i see. Please have a look below at the patch, so we fully 
> > > understand
> > > each other. If that is something that is close to your view or not:
> > > 
> > > 
> > > t a/include/linux/gfp.h b/include/linux/gfp.h
> > > index c603237e006c..7e613560a502 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -39,8 +39,9 @@ struct vm_area_struct;
> > >  #define ___GFP_HARDWALL0x10u
> > >  #define ___GFP_THISNODE0x20u
> > >  #define ___GFP_ACCOUNT 0x40u
> > > +#define ___GFP_NO_LOCKS0x80u
> > 
> > Even if a new gfp flag gains a sufficient traction and support I am
> > _strongly_ opposed against consuming another flag for that. Bit space is
> > limited. 
> 
> That is definitely true. I'm not happy with the GFP flag at all, the
> comment is at best a damage limiting move. It still would be better for
> a memory pool to be reserved and sized for critical allocations.

Completely agreed. The only existing usecase is so special cased that a
dedicated pool is not only easier to maintain but it should be also much
better tuned for the specific workload. Something not really feasible
with the allocator.

> > Besides that we certainly do not want to allow craziness like
> > __GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
> 
> That would deserve to be taken to a dumpster and set on fire. The flag
> combination could be checked in the allocator but the allocator path fast
> paths are bad enough already.

If a new allocation/gfp mode is absolutely necessary then I believe that
the most reasoanble way forward would be
#define GFP_NO_LOCK ((__force gfp_t)0)

and explicitly document it as a final flag to use without any further
modifiers. Yeah there are some that could be used potentially - e.g. zone
specifiers, __GFP_ZERO and likely few others. But support for those can
be added when there is an actual and reasonable demand.

I would also strongly argue against implementation alowing to fully
consume pcp free pages.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

2020-10-02 Thread Michal Hocko
On Fri 02-10-20 09:50:02, David Hildenbrand wrote:
> >>> - huge page sizes controllable by the userspace?
> >>
> >> It might be good to allow advanced users to choose the page sizes, so they
> >> have better control of their applications.
> > 
> > Could you elaborate more? Those advanced users can use hugetlb, right?
> > They get a very good control over page size and pool preallocation etc.
> > So they can get what they need - assuming there is enough memory.
> > 
> 
> I am still not convinced that 1G THP (TGP :) ) are really what we want
> to support. I can understand that there are some use cases that might
> benefit from it, especially:

Well, I would say that internal support for larger huge pages (e.g. 1GB)
that can transparently split under memory pressure is a useful
funtionality. I cannot really judge how complex that would be
consideting that 2MB THP have turned out to be quite a pain but
situation has settled over time. Maybe our current code base is prepared
for that much better.

Exposing that interface to the userspace is a different story of course.
I do agree that we likely do not want to be very explicit about that.
E.g. an interface for address space defragmentation without any more
specifics sounds like a useful feature to me. It will be up to the
kernel to decide which huge pages to use.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

2020-10-02 Thread Michal Hocko
On Thu 01-10-20 11:14:14, Zi Yan wrote:
> On 30 Sep 2020, at 7:55, Michal Hocko wrote:
> 
> > On Mon 28-09-20 13:53:58, Zi Yan wrote:
> >> From: Zi Yan 
> >>
> >> Hi all,
> >>
> >> This patchset adds support for 1GB PUD THP on x86_64. It is on top of
> >> v5.9-rc5-mmots-2020-09-18-21-23. It is also available at:
> >> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23
> >>
> >> Other than PUD THP, we had some discussion on generating THPs and 
> >> contiguous
> >> physical memory via a synchronous system call [0]. I am planning to send 
> >> out a
> >> separate patchset on it later, since I feel that it can be done 
> >> independently of
> >> PUD THP support.
> >
> > While the technical challenges for the kernel implementation can be
> > discussed before the user API is decided I believe we cannot simply add
> > something now and then decide about a proper interface. I have raised
> > few basic questions we should should find answers for before the any
> > interface is added. Let me copy them here for easier reference
> Sure. Thank you for doing this.
> 
> For this new interface, I think it should generate THPs out of populated
> memory regions synchronously. It would be complement to khugepaged, which
> generate THPs asynchronously on the background.
> 
> > - THP allocation time - #PF and/or madvise context
> I am not sure this is relevant, since the new interface is supposed to
> operate on populated memory regions. For THP allocation, madvise and
> the options from /sys/kernel/mm/transparent_hugepage/defrag should give
> enough choices to users.

OK, so no #PF, this makes things easier.

> > - lazy/sync instantiation
> 
> I would say the new interface only does sync instantiation. madvise has
> provided the lazy instantiation option by adding MADV_HUGEPAGE to populated
> memory regions and letting khugepaged generate THPs from them.

OK

> > - huge page sizes controllable by the userspace?
> 
> It might be good to allow advanced users to choose the page sizes, so they
> have better control of their applications.

Could you elaborate more? Those advanced users can use hugetlb, right?
They get a very good control over page size and pool preallocation etc.
So they can get what they need - assuming there is enough memory.

> For normal users, we can provide
> best-effort service. Different options can be provided for these two cases.

Do we really need two sync mechanisms to compact physical memory? This
adds an API complexity because it has to cover all possible huge pages
and that can be a large set of sizes. We already have that choice for
hugetlb mmap interface but that is needed to cover all existing setups.
I would argue this doesn't make the API particurarly easy to use.

> The new interface might want to inform user how many THPs are generated
> after the call for them to decide what to do with the memory region.

Why would that be useful? /proc//smaps should give a good picture
already, right?

> > - aggressiveness - how hard to try
> 
> The new interface would try as hard as it can, since I assume users really
> want THPs when they use this interface.
> 
> > - internal fragmentation - allow to create THPs on sparsely or unpopulated
> >   ranges
> 
> The new interface would only operate on populated memory regions. MAP_POPULATE
> like option can be added if necessary.

OK, so initialy you do not want to populate more memory. How do you
envision a future extension to provide such a functionality. A different
API, modification to existing?

> > - do we need some sort of access control or privilege check as some THPs
> >   would be a really scarce (like those that require pre-reservation).
> 
> It seems too much to me. I suppose if we provide page size options to users
> when generating THPs, users apps could coordinate themselves. BTW, do we have
> access control for hugetlb pages? If yes, we could borrow their method.

We do not. Well, there is a hugetlb cgroup controller but I am not sure
this is the right method. A lack of hugetlb access controll is a serious
shortcoming which has turned this interface into "only first class
citizens" feature with a very closed coordination with an admin.
-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-10-02 Thread Michal Hocko
On Thu 01-10-20 21:26:26, Uladzislau Rezki wrote:
> > 
> > No, I meant going back to idea of new gfp flag, but adjust the 
> > implementation in
> > the allocator (different from what you posted in previous version) so that 
> > it
> > only looks at the flag after it tries to allocate from pcplist and finds out
> > it's empty. So, no inventing of new page allocator entry points or checks 
> > such
> > as the one you wrote above, but adding the new gfp flag in a way that it 
> > doesn't
> > affect existing fast paths.
> >
> OK. Now i see. Please have a look below at the patch, so we fully understand
> each other. If that is something that is close to your view or not:
> 
> 
> t a/include/linux/gfp.h b/include/linux/gfp.h
> index c603237e006c..7e613560a502 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -39,8 +39,9 @@ struct vm_area_struct;
>  #define ___GFP_HARDWALL0x10u
>  #define ___GFP_THISNODE0x20u
>  #define ___GFP_ACCOUNT 0x40u
> +#define ___GFP_NO_LOCKS0x80u

Even if a new gfp flag gains a sufficient traction and support I am
_strongly_ opposed against consuming another flag for that. Bit space is
limited.  Besides that we certainly do not want to allow craziness like
__GFP_NO_LOCK | __GFP_RECLAIM (and similar), do we?
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2020-10-02 Thread Michal Hocko
On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote:
> (Apologies for messing up the mailing list thread, Gmail had fooled me into
> believing that it properly picked up the thread)
> 
> On Thu, 1 Oct 2020 at 14:30, Michal Hocko  wrote:
> >
> > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > > yes it shows the bottleneck but it is quite artificial. Read data is
> > > > usually processed and/or written back and that changes the picture a
> > > > lot.
> > > Apologies for reviving an ancient thread (and apologies in advance for my 
> > > lack
> > > of knowledge on how mailing lists work), but I'd like to offer up another
> > > reason why merging this might be a good idea.
> > >
> > > From what I understand, zswap runs its compression on the same kswapd 
> > > thread,
> > > limiting it to a single thread for compression. Given enough processing 
> > > power,
> > > zswap can get great throughput using heavier compression algorithms like 
> > > zstd,
> > > but this is currently greatly limited by the lack of threading.
> >
> > Isn't this a problem of the zswap implementation rather than general
> > kswapd reclaim? Why zswap doesn't do the same as normal swap out in a
> > context outside of the reclaim?
> 
> I wouldn't be able to tell you, the documentation on zswap is fairly limited
> from what I've found.

I would recommend you to talk to zswap maintainers. Describing your
problem and suggesting to offload the heavy lifting into a separate
context like the standard swap IO does. You are not the only one to hit
this problem
http://lkml.kernel.org/r/CALvZod43VXKZ3StaGXK_EZG_fKcW3v3=ceyowfwp4hnjpoo...@mail.gmail.com.
Ccing Shakeel on such an email might help you to give more usecases.

> > My recollection of the particular patch is dimm but I do remember it
> > tried to add more kswapd threads which would just paper over the problem
> > you are seein rather than solve it.
> 
> Yeah, that's exactly what it does, just adding more kswap threads.

Which is far from trivial because it has its side effects on the over
system balanc. See my reply to the original request and the follow up
discussion. I am not saying this is impossible to achieve and tune
properly but it is certainly non trivial and it would require a really
strong justification.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH tip/core/rcu 14/15] rcu/tree: Allocate a page when caller is preemptible

2020-10-02 Thread Michal Hocko
On Thu 01-10-20 09:27:09, Paul E. McKenney wrote:
[...]
> commit ea5c19d21233b5e8d3d06c0d4ecd6be9f2829dc3
> Author: Paul E. McKenney 
> Date:   Thu Oct 1 09:24:40 2020 -0700
> 
> kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()
> 
> This commit applies the __GFP_NOMEMALLOC gfp flag to memory allocations
> carried out by the single-argument variant of kvfree_rcu(), thus avoiding
> this can-sleep code path from dipping into the emergency reserves.
>     
> Suggested-by: Michal Hocko 
> Signed-off-by: Paul E. McKenney 

LGTM. At least for this one I feel competent to give you
Acked-by: Michal Hocko 

> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 242f0f0..6132452 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3364,7 +3364,8 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
>  {
>   struct kvfree_rcu_bulk_data *bnode;
>   bool can_alloc_page = preemptible();
> - gfp_t gfp = (can_sleep ? GFP_KERNEL | __GFP_RETRY_MAYFAIL : GFP_ATOMIC) 
> | __GFP_NOWARN;
> + gfp_t gfp = (can_sleep ? GFP_KERNEL | __GFP_RETRY_MAYFAIL | 
> __GFP_NOMEMALLOC
> +: GFP_ATOMIC) | __GFP_NOWARN;
>   int idx;
>  
>   *krcp = krc_this_cpu_lock(flags);

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

2020-10-01 Thread Michal Hocko
On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > yes it shows the bottleneck but it is quite artificial. Read data is
> > usually processed and/or written back and that changes the picture a
> > lot.
> Apologies for reviving an ancient thread (and apologies in advance for my lack
> of knowledge on how mailing lists work), but I'd like to offer up another
> reason why merging this might be a good idea.
> 
> From what I understand, zswap runs its compression on the same kswapd thread,
> limiting it to a single thread for compression. Given enough processing power,
> zswap can get great throughput using heavier compression algorithms like zstd,
> but this is currently greatly limited by the lack of threading.

Isn't this a problem of the zswap implementation rather than general
kswapd reclaim? Why zswap doesn't do the same as normal swap out in a
context outside of the reclaim?

My recollection of the particular patch is dimm but I do remember it
tried to add more kswapd threads which would just paper over the problem
you are seein rather than solve it.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH tip/core/rcu 14/15] rcu/tree: Allocate a page when caller is preemptible

2020-10-01 Thread Michal Hocko
On Wed 30-09-20 16:21:54, Paul E. McKenney wrote:
> On Wed, Sep 30, 2020 at 10:41:39AM +0200, Michal Hocko wrote:
> > On Tue 29-09-20 18:53:27, Paul E. McKenney wrote:
[...]
> > > No argument on it being confusing, and I hope that the added header
> > > comment helps.  But specifically, can_sleep==true is a promise by the
> > > caller to be schedulable and not to be holding any lock/mutex/whatever
> > > that might possibly be acquired by the memory allocator or by anything
> > > else that the memory allocator might invoke, to your point, including
> > > for but one example the reclaim logic.
> > > 
> > > The only way that can_sleep==true is if this function was invoked due
> > > to a call to single-argument kvfree_rcu(), which must be schedulable
> > > because its fallback is to invoke synchronize_rcu().
> > 
> > OK. I have to say that it is still not clear to me whether this call
> > path can be called from the memory reclaim context. If yes then you need
> > __GFP_NOMEMALLOC as well.
> 
> Right now the restriction is that single-argument (AKA can_sleep==true)
> kvfree_rcu() cannot be invoked from memory reclaim context.
> 
> But would adding __GFP_NOMEMALLOC to the can_sleep==true GFP_ flags
> allow us to remove this restriction?  If so, I will queue a separate
> patch making this change.  The improved ease of use would be well
> worth it, if I understand correctly (ha!!!).

It would be quite daring to claim it will be ok but it will certainly be
less problematic. Adding the flag will not hurt in any case. As this is
a shared called that might be called from many contexts I think it will
be safer to have it there. The justification is that it will prevent
consumption of memory reserves from MEMALLOC contexts.

> 
> > [...]
> > 
> > > > What is the point of calling kmalloc  for a PAGE_SIZE object? Wouldn't
> > > > using the page allocator directly be better?
> > > 
> > > Well, you guys gave me considerable heat about abusing internal allocator
> > > interfaces, and kmalloc() and kfree() seem to be about as non-internal
> > > as you can get and still be invoking the allocator.  ;-)
> > 
> > alloc_pages resp. __get_free_pages is a normal page allocator interface
> > to use for page size granular allocations. kmalloc is for more fine
> > grained allocations.
> 
> OK, in the short term, both work, but I have queued a separate patch
> making this change and recording the tradeoffs.  This is not yet a
> promise to push this patch, but it is a promise not to lose this part
> of the picture.  Please see below.

It doesn't matter all that much. Both allocators will work. It is just a
matter of using optimal tool for the specific purose.

> You mentioned alloc_pages().  I reverted to __get_free_pages(), but
> alloc_pages() of course looks nicer.  What are the tradeoffs between
> __get_free_pages() and alloc_pages()?

alloc_pages will return struct page but you need a kernel pointer. That
is what __get_free_pages will give you (or you can call page_address
directly).

>   Thanx, Paul
> 
> 
> 
> commit 490b638d7c241ac06cee168ccf8688bb8b872478
> Author: Paul E. McKenney 
> Date:   Wed Sep 30 16:16:39 2020 -0700
> 
> kvfree_rcu(): Switch from kmalloc/kfree to __get_free_page/free_page.
> 
> The advantages of using kmalloc() and kfree() are a possible small speedup
> on CONFIG_SLAB=y systems, avoiding the allocation-side cast, and use of
> more-familiar API members.  The advantages of using __get_free_page()
> and free_page() are a possible reduction in fragmentation and direct
> access to the buddy allocator.
> 
> To help settle the question as to which to use, this commit switches
> from kmalloc() and kfree() to __get_free_page() and free_page().
> 
> Suggested-by: Michal Hocko 
> Suggested-by: "Uladzislau Rezki (Sony)" 
> Signed-off-by: Paul E. McKenney 

Yes, looks good to me. I am not entirely sure about the fragmentation
argument. It really depends on the SL.B allocator internals. The same
applies for the potential speed up. I would be even surprised if the
SLAB was faster in average considering it has to use the page allocator
as well. So to me the primary motivation would be "use the right tool
for the purpose".

> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 2886e81..242f0f0 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3225,7 +3225,8 @@ static void kfree_rcu_work(struct work_struct *work)
>   bkvhead[i] 

Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-30 Thread Michal Hocko
On Wed 30-09-20 13:03:29, Joel Fernandes wrote:
> On Wed, Sep 30, 2020 at 12:48 PM Michal Hocko  wrote:
> >
> > On Wed 30-09-20 11:25:17, Joel Fernandes wrote:
> > > On Fri, Sep 25, 2020 at 05:47:41PM +0200, Michal Hocko wrote:
> > > > On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote:
> > > > > > > > >
> > > > > > > > > All good points!
> > > > > > > > >
> > > > > > > > > On the other hand, duplicating a portion of the allocator 
> > > > > > > > > functionality
> > > > > > > > > within RCU increases the amount of reserved memory, and 
> > > > > > > > > needlessly most
> > > > > > > > > of the time.
> > > > > > > > >
> > > > > > > >
> > > > > > > > But it's very similar to what mempools are for.
> > > > > > > >
> > > > > > > As for dynamic caching or mempools. It requires extra logic on 
> > > > > > > top of RCU
> > > > > > > to move things forward and it might be not efficient way. As a 
> > > > > > > side
> > > > > > > effect, maintaining of the bulk arrays in the separate worker 
> > > > > > > thread
> > > > > > > will introduce other drawbacks:
> > > > > >
> > > > > > This is true but it is also true that it is RCU to require this 
> > > > > > special
> > > > > > logic and we can expect that we might need to fine tune this logic
> > > > > > depending on the RCU usage. We definitely do not want to tune the
> > > > > > generic page allocator for a very specific usecase, do we?
> > > > > >
> > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > > > to provide a memory service for contexts which are not allowed to
> > > > > sleep, RCU is part of them. Both flags used to provide such ability
> > > > > before but not anymore.
> > > > >
> > > > > Do you agree with it?
> > > >
> > > > Yes this sucks. But this is something that we likely really want to live
> > > > with. We have to explicitly _document_ that really atomic contexts in RT
> > > > cannot use the allocator. From the past discussions we've had this is
> > > > likely the most reasonable way forward because we do not really want to
> > > > encourage anybody to do something like that and there should be ways
> > > > around that. The same is btw. true also for !RT. The allocator is not
> > > > NMI safe and while we should be able to make it compatible I am not
> > > > convinced we really want to.
> > > >
> > > > Would something like this be helpful wrt documentation?
> > > >
> > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > > index 67a0774e080b..9fcd47606493 100644
> > > > --- a/include/linux/gfp.h
> > > > +++ b/include/linux/gfp.h
> > > > @@ -238,7 +238,9 @@ struct vm_area_struct;
> > > >   * %__GFP_FOO flags as necessary.
> > > >   *
> > > >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. 
> > > > A lower
> > > > - * watermark is applied to allow access to "atomic reserves"
> > > > + * watermark is applied to allow access to "atomic reserves".
> > > > + * The current implementation doesn't support NMI and other 
> > > > non-preemptive context
> > > > + * (e.g. raw_spin_lock).
> > >
> > > I think documenting is useful.
> > >
> > > Could it be more explicit in what the issue is? Something like:
> > >
> > > * Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT
> > > systems. Therefore, the current low-level allocator implementation does 
> > > not
> > > support being called from special contexts that are atomic on RT - such as
> > > NMI and raw_spin_lock. Due to these constraints and considering calling 
> > > code
> > > usually has no control over the PREEMPT_RT configuration, callers of the
> > > allocator should avoid calling the allocator from these cotnexts even in
> > > non-RT systems.
> >
> > I do not mind documenting RT specific behavior but as mentioned in other
> > reply, this should likely go via RT tree for now. There is likely more
> > to clarify about atomicity for PREEMPT_RT.
> 
> I am sorry, I did not understand what you meant by something missing
> in Linus Tree. CONFIG_PREEMPT_RT is there.

OK, I was not aware we already CONFIG_PREEMPT_RT in the three. There is
still a lot from the RT patchset including sleeping spin locks that make
a real difference. Or maybe I haven't checked properly.

> Could you clarify? Also the CONFIG_PREEMPT_RT is the only thing
> driving this requirement for GFP_ATOMIC right? Or are there non-RT
> reasons why GFP_ATOMIC allocation cannot be done from
> NMI/raw_spin_lock ?

I have already sent a clarification patch [1]. NMI is not supported
regardless of the preemption mode. Hope this clarifies it a bit.

[1] http://lkml.kernel.org/r/20200929123010.5137-1-mho...@kernel.org

-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-30 Thread Michal Hocko
On Wed 30-09-20 11:25:17, Joel Fernandes wrote:
> On Fri, Sep 25, 2020 at 05:47:41PM +0200, Michal Hocko wrote:
> > On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote:
> > > > > > > 
> > > > > > > All good points!
> > > > > > > 
> > > > > > > On the other hand, duplicating a portion of the allocator 
> > > > > > > functionality
> > > > > > > within RCU increases the amount of reserved memory, and 
> > > > > > > needlessly most
> > > > > > > of the time.
> > > > > > > 
> > > > > > 
> > > > > > But it's very similar to what mempools are for.
> > > > > > 
> > > > > As for dynamic caching or mempools. It requires extra logic on top of 
> > > > > RCU
> > > > > to move things forward and it might be not efficient way. As a side
> > > > > effect, maintaining of the bulk arrays in the separate worker thread
> > > > > will introduce other drawbacks:
> > > > 
> > > > This is true but it is also true that it is RCU to require this special
> > > > logic and we can expect that we might need to fine tune this logic
> > > > depending on the RCU usage. We definitely do not want to tune the
> > > > generic page allocator for a very specific usecase, do we?
> > > > 
> > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > to provide a memory service for contexts which are not allowed to
> > > sleep, RCU is part of them. Both flags used to provide such ability
> > > before but not anymore.
> > > 
> > > Do you agree with it?
> > 
> > Yes this sucks. But this is something that we likely really want to live
> > with. We have to explicitly _document_ that really atomic contexts in RT
> > cannot use the allocator. From the past discussions we've had this is
> > likely the most reasonable way forward because we do not really want to
> > encourage anybody to do something like that and there should be ways
> > around that. The same is btw. true also for !RT. The allocator is not
> > NMI safe and while we should be able to make it compatible I am not
> > convinced we really want to.
> > 
> > Would something like this be helpful wrt documentation?
> > 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 67a0774e080b..9fcd47606493 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -238,7 +238,9 @@ struct vm_area_struct;
> >   * %__GFP_FOO flags as necessary.
> >   *
> >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A 
> > lower
> > - * watermark is applied to allow access to "atomic reserves"
> > + * watermark is applied to allow access to "atomic reserves".
> > + * The current implementation doesn't support NMI and other non-preemptive 
> > context
> > + * (e.g. raw_spin_lock).
> 
> I think documenting is useful.
> 
> Could it be more explicit in what the issue is? Something like:
> 
> * Even with GFP_ATOMIC, calls to the allocator can sleep on PREEMPT_RT
> systems. Therefore, the current low-level allocator implementation does not
> support being called from special contexts that are atomic on RT - such as
> NMI and raw_spin_lock. Due to these constraints and considering calling code
> usually has no control over the PREEMPT_RT configuration, callers of the
> allocator should avoid calling the allocator from these cotnexts even in
> non-RT systems.

I do not mind documenting RT specific behavior but as mentioned in other
reply, this should likely go via RT tree for now. There is likely more
to clarify about atomicity for PREEMPT_RT.
-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-30 Thread Michal Hocko
On Wed 30-09-20 15:39:54, Uladzislau Rezki wrote:
> On Wed, Sep 30, 2020 at 02:44:13PM +0200, Michal Hocko wrote:
> > On Wed 30-09-20 14:35:35, Uladzislau Rezki wrote:
> > > On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote:
> > > > On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote:
> > > > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. 
> > > > > > > inability
> > > > > > > to provide a memory service for contexts which are not allowed to
> > > > > > > sleep, RCU is part of them. Both flags used to provide such 
> > > > > > > ability
> > > > > > > before but not anymore.
> > > > > > > 
> > > > > > > Do you agree with it?
> > > > > > 
> > > > > > Yes this sucks. But this is something that we likely really want to 
> > > > > > live
> > > > > > with. We have to explicitly _document_ that really atomic contexts 
> > > > > > in RT
> > > > > > cannot use the allocator. From the past discussions we've had this 
> > > > > > is
> > > > > > likely the most reasonable way forward because we do not really 
> > > > > > want to
> > > > > > encourage anybody to do something like that and there should be ways
> > > > > > around that. The same is btw. true also for !RT. The allocator is 
> > > > > > not
> > > > > > NMI safe and while we should be able to make it compatible I am not
> > > > > > convinced we really want to.
> > > > > > 
> > > > > > Would something like this be helpful wrt documentation?
> > > > > > 
> > > > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > > > > index 67a0774e080b..9fcd47606493 100644
> > > > > > --- a/include/linux/gfp.h
> > > > > > +++ b/include/linux/gfp.h
> > > > > > @@ -238,7 +238,9 @@ struct vm_area_struct;
> > > > > >   * %__GFP_FOO flags as necessary.
> > > > > >   *
> > > > > >   * %GFP_ATOMIC users can not sleep and need the allocation to 
> > > > > > succeed. A lower
> > > > > > - * watermark is applied to allow access to "atomic reserves"
> > > > > > + * watermark is applied to allow access to "atomic reserves".
> > > > > > + * The current implementation doesn't support NMI and other 
> > > > > > non-preemptive context
> > > > > > + * (e.g. raw_spin_lock).
> > > > > >   *
> > > > > >   * %GFP_KERNEL is typical for kernel-internal allocations. The 
> > > > > > caller requires
> > > > > >   * %ZONE_NORMAL or a lower zone for direct access but can direct 
> > > > > > reclaim.
> > > > > > 
> > > > > To me it is clear. But also above conflicting statement:
> > > > > 
> > > > > 
> > > > > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A 
> > > > > %lower
> > > > > 
> > > > > 
> > > > > should be rephrased, IMHO.
> > > > 
> > > > Any suggestions? Or more specifics about which part is conflicting? It
> > > > tries to say that there is a higher demand to succeed even though the
> > > > context cannot sleep to take active measures to achieve that. So the
> > > > only way to achieve that is to break the watermakrs to a certain degree
> > > > which is making them more "higher class" than other allocations.
> > > > 
> > > Michal, i had only one concern about it. It says that %GFP_ATOMIC users
> > > can not sleep, i.e. callers know that they are in atomic, thus no any
> > > sleeping, but the chose they make will force them to sleep.
> > 
> > I am not sure I follow you here. Do you mean they will be forced to
> > sleep with PREEMPT_RT?
> > 
> Exactly :)

We can make that more specific once RT patchset is merged. As of now
this is not the thing in the Linus tree. I believe there will be more to
clarify about atomic contexts in the RT tree as it means something else
than people are used to think.

-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-30 Thread Michal Hocko
On Wed 30-09-20 14:35:35, Uladzislau Rezki wrote:
> On Wed, Sep 30, 2020 at 11:27:32AM +0200, Michal Hocko wrote:
> > On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote:
> > > > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > > > to provide a memory service for contexts which are not allowed to
> > > > > sleep, RCU is part of them. Both flags used to provide such ability
> > > > > before but not anymore.
> > > > > 
> > > > > Do you agree with it?
> > > > 
> > > > Yes this sucks. But this is something that we likely really want to live
> > > > with. We have to explicitly _document_ that really atomic contexts in RT
> > > > cannot use the allocator. From the past discussions we've had this is
> > > > likely the most reasonable way forward because we do not really want to
> > > > encourage anybody to do something like that and there should be ways
> > > > around that. The same is btw. true also for !RT. The allocator is not
> > > > NMI safe and while we should be able to make it compatible I am not
> > > > convinced we really want to.
> > > > 
> > > > Would something like this be helpful wrt documentation?
> > > > 
> > > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > > index 67a0774e080b..9fcd47606493 100644
> > > > --- a/include/linux/gfp.h
> > > > +++ b/include/linux/gfp.h
> > > > @@ -238,7 +238,9 @@ struct vm_area_struct;
> > > >   * %__GFP_FOO flags as necessary.
> > > >   *
> > > >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. 
> > > > A lower
> > > > - * watermark is applied to allow access to "atomic reserves"
> > > > + * watermark is applied to allow access to "atomic reserves".
> > > > + * The current implementation doesn't support NMI and other 
> > > > non-preemptive context
> > > > + * (e.g. raw_spin_lock).
> > > >   *
> > > >   * %GFP_KERNEL is typical for kernel-internal allocations. The caller 
> > > > requires
> > > >   * %ZONE_NORMAL or a lower zone for direct access but can direct 
> > > > reclaim.
> > > > 
> > > To me it is clear. But also above conflicting statement:
> > > 
> > > 
> > > %GFP_ATOMIC users can not sleep and need the allocation to succeed. A 
> > > %lower
> > > 
> > > 
> > > should be rephrased, IMHO.
> > 
> > Any suggestions? Or more specifics about which part is conflicting? It
> > tries to say that there is a higher demand to succeed even though the
> > context cannot sleep to take active measures to achieve that. So the
> > only way to achieve that is to break the watermakrs to a certain degree
> > which is making them more "higher class" than other allocations.
> > 
> Michal, i had only one concern about it. It says that %GFP_ATOMIC users
> can not sleep, i.e. callers know that they are in atomic, thus no any
> sleeping, but the chose they make will force them to sleep.

I am not sure I follow you here. Do you mean they will be forced to
sleep with PREEMPT_RT?

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v2 00/30] 1GB PUD THP support on x86_64

2020-09-30 Thread Michal Hocko
On Mon 28-09-20 13:53:58, Zi Yan wrote:
> From: Zi Yan 
> 
> Hi all,
> 
> This patchset adds support for 1GB PUD THP on x86_64. It is on top of
> v5.9-rc5-mmots-2020-09-18-21-23. It is also available at:
> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.9-rc5-mmots-2020-09-18-21-23
> 
> Other than PUD THP, we had some discussion on generating THPs and contiguous
> physical memory via a synchronous system call [0]. I am planning to send out a
> separate patchset on it later, since I feel that it can be done independently 
> of
> PUD THP support.

While the technical challenges for the kernel implementation can be
discussed before the user API is decided I believe we cannot simply add
something now and then decide about a proper interface. I have raised
few basic questions we should should find answers for before the any
interface is added. Let me copy them here for easier reference
- THP allocation time - #PF and/or madvise context
- lazy/sync instantiation
- huge page sizes controllable by the userspace?
- aggressiveness - how hard to try
- internal fragmentation - allow to create THPs on sparsely or unpopulated
  ranges
- do we need some sort of access control or privilege check as some THPs
  would be a really scarce (like those that require pre-reservation).
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v3] mm: memcontrol: reword obsolete comment of mem_cgroup_unmark_under_oom()

2020-09-30 Thread Michal Hocko
On Wed 30-09-20 05:53:36, Miaohe Lin wrote:
> Since commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
> counter"), the mem_cgroup_unmark_under_oom() is added and the comment of
> the mem_cgroup_oom_unlock() is moved here.  But this comment make no sense
> here because mem_cgroup_oom_lock() does not operate on under_oom field. So
> we reword the comment as this would be helpful.
> [Thanks Michal Hocko for rewording this comment.]
> 
> Signed-off-by: Miaohe Lin 
> Cc: Johannes Weiner 
> Cc: Michal Hocko 
> Cc: Vladimir Davydov 

Acked-by: Michal Hocko 

Thanks!

> ---
>  mm/memcontrol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6877c765b8d0..4f0c14cb8690 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1817,8 +1817,8 @@ static void mem_cgroup_unmark_under_oom(struct 
> mem_cgroup *memcg)
>   struct mem_cgroup *iter;
>  
>   /*
> -  * When a new child is created while the hierarchy is under oom,
> -  * mem_cgroup_oom_lock() may not be called. Watch for underflow.
> +  * Be careful about under_oom underflows becase a child memcg
> +  * could have been added after mem_cgroup_mark_under_oom.
>*/
>   spin_lock(_oom_lock);
>   for_each_mem_cgroup_tree(iter, memcg)
> -- 
> 2.19.1

-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-30 Thread Michal Hocko
On Wed 30-09-20 00:07:42, Uladzislau Rezki wrote:
[...]
> 
> bool is_pcp_cache_empty(gfp_t gfp)
> {
> struct per_cpu_pages *pcp;
> struct zoneref *ref;
> unsigned long flags;
> bool empty;
> 
> ref = first_zones_zonelist(node_zonelist(
> numa_node_id(), gfp), gfp_zone(gfp), NULL);
> if (!ref->zone)
> return true;
> 
> local_irq_save(flags);
> pcp = _cpu_ptr(ref->zone->pageset)->pcp;
> empty = list_empty(>lists[gfp_migratetype(gfp)]);
> local_irq_restore(flags);
> 
> return empty;
> }
> 
> disable_irq();
> if (!is_pcp_cache_empty(GFP_NOWAIT))
> __get_free_page(GFP_NOWAIT);
> enable_irq();
> 
> 
> Do you mean to have something like above? I mean some extra API
> function that returns true or false if fast-fast allocation can
> either occur or not. Above code works just fine and never touches
> main zone->lock.

The above code works with the _current_ implementation and it restricts
its implementation to some degree. Future changes might get harder to
implement with a pattern like this. I do not think we want users to be
aware of internal implementation details like pcp caches, migrate types
or others. While pcp caches are here for years and unlikely to change in
a foreseeable future many details are changing on regular basis.
-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-30 Thread Michal Hocko
On Tue 29-09-20 18:25:14, Uladzislau Rezki wrote:
> > > I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> > > to provide a memory service for contexts which are not allowed to
> > > sleep, RCU is part of them. Both flags used to provide such ability
> > > before but not anymore.
> > > 
> > > Do you agree with it?
> > 
> > Yes this sucks. But this is something that we likely really want to live
> > with. We have to explicitly _document_ that really atomic contexts in RT
> > cannot use the allocator. From the past discussions we've had this is
> > likely the most reasonable way forward because we do not really want to
> > encourage anybody to do something like that and there should be ways
> > around that. The same is btw. true also for !RT. The allocator is not
> > NMI safe and while we should be able to make it compatible I am not
> > convinced we really want to.
> > 
> > Would something like this be helpful wrt documentation?
> > 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 67a0774e080b..9fcd47606493 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -238,7 +238,9 @@ struct vm_area_struct;
> >   * %__GFP_FOO flags as necessary.
> >   *
> >   * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A 
> > lower
> > - * watermark is applied to allow access to "atomic reserves"
> > + * watermark is applied to allow access to "atomic reserves".
> > + * The current implementation doesn't support NMI and other non-preemptive 
> > context
> > + * (e.g. raw_spin_lock).
> >   *
> >   * %GFP_KERNEL is typical for kernel-internal allocations. The caller 
> > requires
> >   * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> > 
> To me it is clear. But also above conflicting statement:
> 
> 
> %GFP_ATOMIC users can not sleep and need the allocation to succeed. A %lower
> 
> 
> should be rephrased, IMHO.

Any suggestions? Or more specifics about which part is conflicting? It
tries to say that there is a higher demand to succeed even though the
context cannot sleep to take active measures to achieve that. So the
only way to achieve that is to break the watermakrs to a certain degree
which is making them more "higher class" than other allocations.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] mm: memcontrol: remove obsolete comment of mem_cgroup_unmark_under_oom()

2020-09-30 Thread Michal Hocko
On Wed 30-09-20 01:34:25, linmiaohe wrote:
> Michal Hocko  wrote:
> > On Thu 17-09-20 06:59:00, Miaohe Lin wrote:
> >> Since commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather 
> >> than counter"), the mem_cgroup_unmark_under_oom() is added and the 
> >> comment of the mem_cgroup_oom_unlock() is moved here. But this comment 
> >> make no sense here because mem_cgroup_oom_lock() does not operate on 
> >> under_oom field.
> >
> >OK, so I've looked into this more deeply and I finally remember why we have 
> >this comment here. The point is that under_oom shouldn't underflow and that 
> >we have to explicitly check for > 0 because a new child memcg could have 
> >been added between mem_cgroup_mark_under_oom and mem_cgroup_unmark_under_oom.
> >
> >So the comment makes sense although it is not as helpful as it could be.
> >I think that changing it to the following will be more usefule
> >
> > /*
> >  * Be careful about under_oom underflows becase a child memcg
> >  * could have neem added after mem_cgroup_mark_under_oom
> 
> Should it be s/neem/been/ ?

yep, fat fingers...

> 
> >  */
> 
> Many thanks for detailed explanation. Will fix it in v2. Thanks again.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH tip/core/rcu 14/15] rcu/tree: Allocate a page when caller is preemptible

2020-09-30 Thread Michal Hocko
On Tue 29-09-20 18:53:27, Paul E. McKenney wrote:
> On Tue, Sep 29, 2020 at 02:07:56PM +0200, Michal Hocko wrote:
> > On Mon 28-09-20 16:31:01, paul...@kernel.org wrote:
> > [...]
> 
> Apologies for the delay, but today has not been boring.
> 
> > > This commit therefore uses preemptible() to determine whether allocation
> > > is possible at all for double-argument kvfree_rcu().
> > 
> > This deserves a comment. Because GFP_ATOMIC is possible for many
> > !preemptible() contexts. It is the raw_spin_lock, NMIs and likely few
> > others that are a problem. You are taking a conservative approach which
> > is fine but it would be good to articulate that explicitly.
> 
> Good point, and so I have added the following as a header comment to
> the add_ptr_to_bulk_krc_lock() function:
> 
> // Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
> // state specified by flags.  If can_sleep is true, the caller must
> // be schedulable and not be holding any locks or mutexes that might be
> // acquired by the memory allocator or anything that it might invoke.
> // If !can_sleep, then if !preemptible() no allocation will be undertaken,
> // otherwise the allocation will use GFP_ATOMIC to avoid the remainder of
> // the aforementioned deadlock possibilities.  Returns true iff ptr was
> // successfully recorded, else the caller must use a fallback.

OK, not trivial to follow but at least verbose enough to understand the
intention after some mulling. Definitely an improvement, thanks!

[...]
> > > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> > > + unsigned long *flags, void *ptr, bool can_sleep)
> > >  {
> > >   struct kvfree_rcu_bulk_data *bnode;
> > > + bool can_alloc_page = preemptible();
> > > + gfp_t gfp = (can_sleep ? GFP_KERNEL | __GFP_RETRY_MAYFAIL : GFP_ATOMIC) 
> > > | __GFP_NOWARN;
> > 
> > This is quite confusing IMHO. At least without a further explanation.
> > can_sleep is not as much about sleeping as it is about the reclaim
> > recursion AFAIU your changelog, right?
> 
> No argument on it being confusing, and I hope that the added header
> comment helps.  But specifically, can_sleep==true is a promise by the
> caller to be schedulable and not to be holding any lock/mutex/whatever
> that might possibly be acquired by the memory allocator or by anything
> else that the memory allocator might invoke, to your point, including
> for but one example the reclaim logic.
> 
> The only way that can_sleep==true is if this function was invoked due
> to a call to single-argument kvfree_rcu(), which must be schedulable
> because its fallback is to invoke synchronize_rcu().

OK. I have to say that it is still not clear to me whether this call
path can be called from the memory reclaim context. If yes then you need
__GFP_NOMEMALLOC as well.

[...]

> > What is the point of calling kmalloc  for a PAGE_SIZE object? Wouldn't
> > using the page allocator directly be better?
> 
> Well, you guys gave me considerable heat about abusing internal allocator
> interfaces, and kmalloc() and kfree() seem to be about as non-internal
> as you can get and still be invoking the allocator.  ;-)

alloc_pages resp. __get_free_pages is a normal page allocator interface
to use for page size granular allocations. kmalloc is for more fine
grained allocations.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: introduce per-memcg reclaim interface

2020-09-29 Thread Michal Hocko
On Mon 28-09-20 17:02:16, Johannes Weiner wrote:
[...]
> My take is that a proactive reclaim feature, whose goal is never to
> thrash or punish but to keep the LRUs warm and the workingset trimmed,
> would ideally have:
> 
> - a pressure or size target specified by userspace but with
>   enforcement driven inside the kernel from the allocation path
> 
> - the enforcement work NOT be done synchronously by the workload
>   (something I'd argue we want for *all* memory limits)
> 
> - the enforcement work ACCOUNTED to the cgroup, though, since it's the
>   cgroup's memory allocations causing the work (again something I'd
>   argue we want in general)
> 
> - a delegatable knob that is independent of setting the maximum size
>   of a container, as that expresses a different type of policy
> 
> - if size target, self-limiting (ha) enforcement on a pressure
>   threshold or stop enforcement when the userspace component dies
> 
> Thoughts?

Agreed with above points. What do you think about
http://lkml.kernel.org/r/20200922190859.gh12...@dhcp22.suse.cz. I assume
that you do not want to override memory.high to implement this because
that tends to be tricky from the configuration POV as you mentioned
above. But a new limit (memory.middle for a lack of a better name) to
define the background reclaim sounds like a good fit with above points.

-- 
Michal Hocko
SUSE Labs


Re: [v4] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged

2020-09-29 Thread Michal Hocko
On Mon 28-09-20 17:07:27, Vijay Balakrishna wrote:
> When memory is hotplug added or removed the min_free_kbytes must be

s@must@should@

> recalculated based on what is expected by khugepaged.  Currently
> after hotplug, min_free_kbytes will be set to a lower default and higher
> default set when THP enabled is lost.  This change restores min_free_kbytes
> as expected for THP consumers.
> 
> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
> 
> Signed-off-by: Vijay Balakrishna 
> Cc: sta...@vger.kernel.org
> Reviewed-by: Pavel Tatashin 
> Acked-by: Michal Hocko 
> ---
> v3 -> v4
> - made changes to move khugepaged_min_free_kbytes_update into
>   init_per_zone_wmark_min and rested changes
>   [suggestion from Michal Hocko]
> 
> [v2 1/2]
> - removed symptoms references from changelog
> 
> [v2 2/2]
> - addressed following issues Michal Hocko raised:
>   . nr_free_buffer_pages can oveflow in int on very large machines
>   . min_free_kbytes can decrease the size theoretically
> 
> v1 -> v2
> 
> - addressed issue Kirill A. Shutemov raised:
>   . changes would override min_free_kbytes set by user
> 
>  include/linux/khugepaged.h |  5 +
>  mm/khugepaged.c| 13 +++--
>  mm/page_alloc.c|  3 +++
>  3 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index bc45ea1efbf7..c941b7377321 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -15,6 +15,7 @@ extern int __khugepaged_enter(struct mm_struct *mm);
>  extern void __khugepaged_exit(struct mm_struct *mm);
>  extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
> unsigned long vm_flags);
> +extern void khugepaged_min_free_kbytes_update(void);
>  #ifdef CONFIG_SHMEM
>  extern void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long 
> addr);
>  #else
> @@ -85,6 +86,10 @@ static inline void collapse_pte_mapped_thp(struct 
> mm_struct *mm,
>  unsigned long addr)
>  {
>  }
> +
> +static inline void khugepaged_min_free_kbytes_update(void)
> +{
> +}
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
>  #endif /* _LINUX_KHUGEPAGED_H */
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index cfa0dba5fd3b..4f7107476a6f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -56,6 +56,9 @@ enum scan_result {
>  #define CREATE_TRACE_POINTS
>  #include 
>  
> +static struct task_struct *khugepaged_thread __read_mostly;
> +static DEFINE_MUTEX(khugepaged_mutex);
> +
>  /* default scan 8*512 pte (or vmas) every 30 second */
>  static unsigned int khugepaged_pages_to_scan __read_mostly;
>  static unsigned int khugepaged_pages_collapsed;
> @@ -2292,8 +2295,6 @@ static void set_recommended_min_free_kbytes(void)
>  
>  int start_stop_khugepaged(void)
>  {
> - static struct task_struct *khugepaged_thread __read_mostly;
> - static DEFINE_MUTEX(khugepaged_mutex);
>   int err = 0;
>  
>   mutex_lock(_mutex);
> @@ -2320,3 +2321,11 @@ int start_stop_khugepaged(void)
>   mutex_unlock(_mutex);
>   return err;
>  }
> +
> +void khugepaged_min_free_kbytes_update(void)
> +{
> + mutex_lock(_mutex);
> + if (khugepaged_enabled() && khugepaged_thread)
> + set_recommended_min_free_kbytes();
> + mutex_unlock(_mutex);
> +}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fab5e97dc9ca..ac25d3526fa5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -69,6 +69,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -7891,6 +7892,8 @@ int __meminit init_per_zone_wmark_min(void)
>   setup_min_slab_ratio();
>  #endif
>  
> + khugepaged_min_free_kbytes_update();
> +
>   return 0;
>  }
>  postcore_initcall(init_per_zone_wmark_min)
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


Re: [patch 00/13] preempt: Make preempt count unconditional

2020-09-29 Thread Michal Hocko
On Tue 29-09-20 11:00:03, Daniel Vetter wrote:
> On Tue, Sep 29, 2020 at 10:19:38AM +0200, Michal Hocko wrote:
> > On Wed 16-09-20 23:43:02, Daniel Vetter wrote:
> > > I can
> > > then figure out whether it's better to risk not spotting issues with
> > > call_rcu vs slapping a memalloc_noio_save/restore around all these
> > > critical section which force-degrades any allocation to GFP_ATOMIC at
> > 
> > did you mean memalloc_noreclaim_* here?
> 
> Yeah I picked the wrong one of that family of functions.
> 
> > > most, but has the risk that we run into code that assumes "GFP_KERNEL
> > > never fails for small stuff" and has a decidedly less tested fallback
> > > path than rcu code.
> > 
> > Even if the above then please note that memalloc_noreclaim_* or
> > PF_MEMALLOC should be used with an extreme care. Essentially only for
> > internal memory reclaimers. It grants access to _all_ the available
> > memory so any abuse can be detrimental to the overall system operation.
> > Allocation failure in this mode means that we are out of memory and any
> > code relying on such an allocation has to carefuly consider failure.
> > This is not a random allocation mode.
> 
> Agreed, that's why I don't like having these kind of automagic critical
> sections. It's a bit a shotgun approach. Paul said that the code would
> handle failures, but the problem is that it applies everywhere.

Ohh, in the ideal world we wouldn't need anything like that. But then
the reality fires:
* PF_MEMALLOC (resp memalloc_noreclaim_* for that matter) is primarily used
  to make sure that allocations from inside the memory reclaim - yeah that
  happens - will not recurse.
* PF_MEMALLOC_NO{FS,IO} (resp memalloc_no{fs,io}*) are used to mark no
  fs/io reclaim recursion critical sections because controling that for
  each allocation inside fs transaction (or other sensitive) or IO
  contexts turned out to be unmaintainable and people simply fallen into
  using NOFS/NOIO unconditionally which is causing reclaim imbalance
  problems.
* PF_MEMALLOC_NOCMA (resp memalloc_nocma*) is used for long term pinning
  when CMA pages cannot be pinned because that would break the CMA
  guarantees. Communicating this to all potential allocations during
  pinning is simply unfeasible.

So you are absolutely right that these critical sections with side
effects on all allocations are far from ideal from the API point of view
but they are mostly mirroring a demand for functionality which is
_practically_ impossible to achieve with our current code base. Not that
we couldn't get back to drawing board and come up with a saner thing and
rework the world...
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2] mm: memcontrol: remove obsolete comment of mem_cgroup_unmark_under_oom()

2020-09-29 Thread Michal Hocko
On Thu 17-09-20 06:59:00, Miaohe Lin wrote:
> Since commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
> counter"), the mem_cgroup_unmark_under_oom() is added and the comment of
> the mem_cgroup_oom_unlock() is moved here. But this comment make no sense
> here because mem_cgroup_oom_lock() does not operate on under_oom field.

OK, so I've looked into this more deeply and I finally remember why we
have this comment here. The point is that under_oom shouldn't underflow
and that we have to explicitly check for > 0 because a new child memcg
could have been added between mem_cgroup_mark_under_oom and
mem_cgroup_unmark_under_oom.

So the comment makes sense although it is not as helpful as it could be.
I think that changing it to the following will be more usefule

/*
 * Be careful about under_oom underflows becase a child memcg
 * could have neem added after mem_cgroup_mark_under_oom
 */
> 
> Signed-off-by: Miaohe Lin 
> ---
>  mm/memcontrol.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cd5f83de9a6f..e44f5afaf78b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1848,10 +1848,6 @@ static void mem_cgroup_unmark_under_oom(struct 
> mem_cgroup *memcg)
>  {
>   struct mem_cgroup *iter;
>  
> - /*
> -  * When a new child is created while the hierarchy is under oom,
> -  * mem_cgroup_oom_lock() may not be called. Watch for underflow.
> -  */
>   spin_lock(_oom_lock);
>   for_each_mem_cgroup_tree(iter, memcg)
>   if (iter->under_oom > 0)
> -- 
> 2.19.1

-- 
Michal Hocko
SUSE Labs


[PATCH] mm: clarify usage of GFP_ATOMIC in !preemptible contexts

2020-09-29 Thread Michal Hocko
From: Michal Hocko 

There is a general understanding that GFP_ATOMIC/GFP_NOWAIT are
to be used from atomic contexts. E.g. from within a spin lock or from
the IRQ context. This is correct but there are some atomic contexts
where the above doesn't hold. One of them would be an NMI context.
Page allocator has never supported that and the general fear of this
context didn't let anybody to actually even try to use the allocator
there. Good, but let's be more specific about that.

Another such a context, and that is where people seem to be more daring,
is raw_spin_lock. Mostly because it simply resembles regular spin lock
which is supported by the allocator and there is not any implementation
difference with !RT kernels in the first place. Be explicit that such
a context is not supported by the allocator. The underlying reason is
that zone->lock would have to become raw_spin_lock as well and that has
turned out to be a problem for RT
(http://lkml.kernel.org/r/87mu305c1w@nanos.tec.linutronix.de).

Signed-off-by: Michal Hocko 
---
 include/linux/gfp.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..2e8370cf60c7 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -238,7 +238,9 @@ struct vm_area_struct;
  * %__GFP_FOO flags as necessary.
  *
  * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
- * watermark is applied to allow access to "atomic reserves"
+ * watermark is applied to allow access to "atomic reserves".
+ * The current implementation doesn't support NMI and few other strict
+ * non-preemptive contexts (e.g. raw_spin_lock). The same applies to 
%GFP_NOWAIT.
  *
  * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
  * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
-- 
2.28.0



Re: [PATCH tip/core/rcu 14/15] rcu/tree: Allocate a page when caller is preemptible

2020-09-29 Thread Michal Hocko
On Mon 28-09-20 16:31:01, paul...@kernel.org wrote:
[...]
> This commit therefore uses preemptible() to determine whether allocation
> is possible at all for double-argument kvfree_rcu().

This deserves a comment. Because GFP_ATOMIC is possible for many
!preemptible() contexts. It is the raw_spin_lock, NMIs and likely few
others that are a problem. You are taking a conservative approach which
is fine but it would be good to articulate that explicitly.

> If !preemptible(),
> then allocation is not possible, and kvfree_rcu() falls back to using
> the less cache-friendly rcu_head approach.  Even when preemptible(),
> the caller might be involved in reclaim, so the GFP_ flags used by
> double-argument kvfree_rcu() must avoid invoking reclaim processing.

Could you be more specific? Is this about being called directly in the
reclaim context and you want to prevent a recursion? If that is the
case, do you really need to special case this in any way? Any memory
reclaim will set PF_MEMALLOC so allocations called from that context
will not perform reclaim. So if you are called from the reclaim directly
then you might want to do GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN.
That should handle both from-the-recalim and outside of reclaim contexts
just fine (assuming you don't allocated from !preemptible()) context.

> Note that single-argument kvfree_rcu() must be invoked in sleepable
> contexts, and that its fallback is the relatively high latency
> synchronize_rcu().  Single-argument kvfree_rcu() therefore uses
> GFP_KERNEL|__GFP_RETRY_MAYFAIL to allow limited sleeping within the
> memory allocator.
[...]
>  static inline bool
> -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
> + unsigned long *flags, void *ptr, bool can_sleep)
>  {
>   struct kvfree_rcu_bulk_data *bnode;
> + bool can_alloc_page = preemptible();
> + gfp_t gfp = (can_sleep ? GFP_KERNEL | __GFP_RETRY_MAYFAIL : GFP_ATOMIC) 
> | __GFP_NOWARN;

This is quite confusing IMHO. At least without a further explanation.
can_sleep is not as much about sleeping as it is about the reclaim
recursion AFAIU your changelog, right?

>   int idx;
>  
> - if (unlikely(!krcp->initialized))
> + *krcp = krc_this_cpu_lock(flags);
> + if (unlikely(!(*krcp)->initialized))
>   return false;
>  
> - lockdep_assert_held(>lock);
>   idx = !!is_vmalloc_addr(ptr);
>  
>   /* Check if a new block is required. */
> - if (!krcp->bkvhead[idx] ||
> - krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
> {
> - bnode = get_cached_bnode(krcp);
> - if (!bnode) {
> - /*
> -  * To keep this path working on raw non-preemptible
> -  * sections, prevent the optional entry into the
> -  * allocator as it uses sleeping locks. In fact, even
> -  * if the caller of kfree_rcu() is preemptible, this
> -  * path still is not, as krcp->lock is a raw spinlock.
> -  * With additional page pre-allocation in the works,
> -  * hitting this return is going to be much less likely.
> -  */
> - if (IS_ENABLED(CONFIG_PREEMPT_RT))
> - return false;
> -
> - /*
> -  * NOTE: For one argument of kvfree_rcu() we can
> -  * drop the lock and get the page in sleepable
> -  * context. That would allow to maintain an array
> -  * for the CONFIG_PREEMPT_RT as well if no cached
> -  * pages are available.
> -  */
> - bnode = (struct kvfree_rcu_bulk_data *)
> - __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> + if (!(*krcp)->bkvhead[idx] ||
> + (*krcp)->bkvhead[idx]->nr_records == 
> KVFREE_BULK_MAX_ENTR) {
> + bnode = get_cached_bnode(*krcp);
> + if (!bnode && can_alloc_page) {
> +     krc_this_cpu_unlock(*krcp, *flags);
> + bnode = kmalloc(PAGE_SIZE, gfp);

What is the point of calling kmalloc  for a PAGE_SIZE object? Wouldn't
using the page allocator directly be better?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

2020-09-29 Thread Michal Hocko
On Tue 29-09-20 17:38:43, Joonsoo Kim wrote:
> 2020년 9월 29일 (화) 오후 5:08, Michal Hocko 님이 작성:
> >
> > On Mon 28-09-20 17:50:46, Joonsoo Kim wrote:
> > > From: Joonsoo Kim 
> > >
> > > memalloc_nocma_{save/restore} APIs can be used to skip page allocation
> > > on CMA area, but, there is a missing case and the page on CMA area could
> > > be allocated even if APIs are used. This patch handles this case to fix
> > > the potential issue.
> > >
> > > Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
> > > could have the pages on CMA area so we need to skip it if ALLOC_CMA isn't
> > > specified.
> > >
> > > Fixes: 8510e69c8efe (mm/page_alloc: fix memalloc_nocma_{save/restore} 
> > > APIs)
> > > Signed-off-by: Joonsoo Kim 
> > > ---
> > >  mm/page_alloc.c | 13 ++---
> > >  1 file changed, 10 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index fab5e97..104d2e1 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3367,9 +3367,16 @@ struct page *rmqueue(struct zone *preferred_zone,
> > >   struct page *page;
> > >
> > >   if (likely(order == 0)) {
> > > - page = rmqueue_pcplist(preferred_zone, zone, gfp_flags,
> > > + /*
> > > +  * MIGRATE_MOVABLE pcplist could have the pages on CMA area 
> > > and
> > > +  * we need to skip it when CMA area isn't allowed.
> > > +  */
> > > + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> > > + migratetype != MIGRATE_MOVABLE) {
> > > + page = rmqueue_pcplist(preferred_zone, zone, 
> > > gfp_flags,
> > >   migratetype, alloc_flags);
> > > - goto out;
> > > + goto out;
> > > + }
> > >   }
> >
> > This approach looks definitely better than the previous version.
> 
> Thanks!
> 
> > >
> > >   /*
> > > @@ -3381,7 +3388,7 @@ struct page *rmqueue(struct zone *preferred_zone,
> > >
> > >   do {
> > >   page = NULL;
> > > - if (alloc_flags & ALLOC_HARDER) {
> > > + if (order > 0 && alloc_flags & ALLOC_HARDER) {
> > >   page = __rmqueue_smallest(zone, order, 
> > > MIGRATE_HIGHATOMIC);
> > >   if (page)
> > >   trace_mm_page_alloc_zone_locked(page, 
> > > order, migratetype);
> >
> > But this condition is not clear to me. __rmqueue_smallest doesn't access
> > pcp lists. Maybe I have missed the point in the original discussion but
> > this deserves a comment at least.
> 
> Before the pcplist skipping is applied, order-0 request can not reach here.
> But, now, an order-0 request can reach here. Free memory on
> MIGRATE_HIGHATOMIC is reserved for high-order atomic allocation
> so an order-0 request should skip it.

OK, I see. Thanks for the clarification.

> I will add a code comment on the next version.

Thanks, that would be indeed helpful. With that, feel free to add
Acked-by: Michal Hocko 

-- 
Michal Hocko
SUSE Labs


Re: [patch 00/13] preempt: Make preempt count unconditional

2020-09-29 Thread Michal Hocko
On Wed 16-09-20 23:43:02, Daniel Vetter wrote:
> I can
> then figure out whether it's better to risk not spotting issues with
> call_rcu vs slapping a memalloc_noio_save/restore around all these
> critical section which force-degrades any allocation to GFP_ATOMIC at

did you mean memalloc_noreclaim_* here?

> most, but has the risk that we run into code that assumes "GFP_KERNEL
> never fails for small stuff" and has a decidedly less tested fallback
> path than rcu code.

Even if the above then please note that memalloc_noreclaim_* or
PF_MEMALLOC should be used with an extreme care. Essentially only for
internal memory reclaimers. It grants access to _all_ the available
memory so any abuse can be detrimental to the overall system operation.
Allocation failure in this mode means that we are out of memory and any
code relying on such an allocation has to carefuly consider failure.
This is not a random allocation mode.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 for v5.9] mm/page_alloc: handle a missing case for memalloc_nocma_{save/restore} APIs

2020-09-29 Thread Michal Hocko
On Mon 28-09-20 17:50:46, Joonsoo Kim wrote:
> From: Joonsoo Kim 
> 
> memalloc_nocma_{save/restore} APIs can be used to skip page allocation
> on CMA area, but, there is a missing case and the page on CMA area could
> be allocated even if APIs are used. This patch handles this case to fix
> the potential issue.
> 
> Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
> could have the pages on CMA area so we need to skip it if ALLOC_CMA isn't
> specified.
> 
> Fixes: 8510e69c8efe (mm/page_alloc: fix memalloc_nocma_{save/restore} APIs)
> Signed-off-by: Joonsoo Kim 
> ---
>  mm/page_alloc.c | 13 ++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index fab5e97..104d2e1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3367,9 +3367,16 @@ struct page *rmqueue(struct zone *preferred_zone,
>   struct page *page;
>  
>   if (likely(order == 0)) {
> - page = rmqueue_pcplist(preferred_zone, zone, gfp_flags,
> + /*
> +  * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
> +  * we need to skip it when CMA area isn't allowed.
> +  */
> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> + migratetype != MIGRATE_MOVABLE) {
> + page = rmqueue_pcplist(preferred_zone, zone, gfp_flags,
>   migratetype, alloc_flags);
> - goto out;
> + goto out;
> + }
>   }

This approach looks definitely better than the previous version.

>  
>   /*
> @@ -3381,7 +3388,7 @@ struct page *rmqueue(struct zone *preferred_zone,
>  
>   do {
>   page = NULL;
> - if (alloc_flags & ALLOC_HARDER) {
> + if (order > 0 && alloc_flags & ALLOC_HARDER) {
>   page = __rmqueue_smallest(zone, order, 
> MIGRATE_HIGHATOMIC);
>   if (page)
>   trace_mm_page_alloc_zone_locked(page, order, 
> migratetype);

But this condition is not clear to me. __rmqueue_smallest doesn't access
pcp lists. Maybe I have missed the point in the original discussion but
this deserves a comment at least.

> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-25 Thread Michal Hocko
On Fri 25-09-20 17:31:29, Uladzislau Rezki wrote:
> > > > > 
> > > > > All good points!
> > > > > 
> > > > > On the other hand, duplicating a portion of the allocator 
> > > > > functionality
> > > > > within RCU increases the amount of reserved memory, and needlessly 
> > > > > most
> > > > > of the time.
> > > > > 
> > > > 
> > > > But it's very similar to what mempools are for.
> > > > 
> > > As for dynamic caching or mempools. It requires extra logic on top of RCU
> > > to move things forward and it might be not efficient way. As a side
> > > effect, maintaining of the bulk arrays in the separate worker thread
> > > will introduce other drawbacks:
> > 
> > This is true but it is also true that it is RCU to require this special
> > logic and we can expect that we might need to fine tune this logic
> > depending on the RCU usage. We definitely do not want to tune the
> > generic page allocator for a very specific usecase, do we?
> > 
> I look at it in scope of GFP_ATOMIC/GFP_NOWAIT issues, i.e. inability
> to provide a memory service for contexts which are not allowed to
> sleep, RCU is part of them. Both flags used to provide such ability
> before but not anymore.
> 
> Do you agree with it?

Yes this sucks. But this is something that we likely really want to live
with. We have to explicitly _document_ that really atomic contexts in RT
cannot use the allocator. From the past discussions we've had this is
likely the most reasonable way forward because we do not really want to
encourage anybody to do something like that and there should be ways
around that. The same is btw. true also for !RT. The allocator is not
NMI safe and while we should be able to make it compatible I am not
convinced we really want to.

Would something like this be helpful wrt documentation?

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..9fcd47606493 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -238,7 +238,9 @@ struct vm_area_struct;
  * %__GFP_FOO flags as necessary.
  *
  * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
- * watermark is applied to allow access to "atomic reserves"
+ * watermark is applied to allow access to "atomic reserves".
+ * The current implementation doesn't support NMI and other non-preemptive 
context
+ * (e.g. raw_spin_lock).
  *
  * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
  * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.

[...]
-- 
Michal Hocko
SUSE Labs


Re: Ways to deprecate /sys/devices/system/memory/memoryX/phys_device ?

2020-09-25 Thread Michal Hocko
On Fri 25-09-20 16:49:28, David Hildenbrand wrote:
> >> There were once RFC patches to make use of it in ACPI, but it could be
> >> solved using different interfaces [1].
> >>
> >>
> >> While I'd love to rip it out completely, I think it would break old
> >> lsmem/chmem completely - and I assume that's not acceptable. I was
> >> wondering what would be considered safe to do now/in the future:
> >>
> >> 1. Make it always return 0 (just as if "sclp.rzm" would be set to 0 on
> >> s390x). This will make old lsmem/chmem behave differently after
> >> switching to a new kernel, like if sclp.rzm would not be set by HW -
> >> AFAIU, it will assume all memory is in a single memory increment. Do we
> >> care?
> > 
> > No, at least not until that kernel change would be backported to some
> > old distribution level where we still use lsmem/chmem from s390-tools.
> > Given that this is just some clean-up w/o any functional benefit, and
> > hopefully w/o any negative impact, I think we can safely assume that no
> > distributor will do that "just for fun".
> > 
> > Even if there would be good reasons for backports, then I guess we also
> > have good reasons for backporting / switching to the util-linux version
> > of lsmem / chmem for such distribution levels. Alternatively, adjust the
> > s390-tools lsmem / chmem there.
> > 
> > But I would rather "rip it out completely" than just return 0. You'd
> > need some lsmem / chmem changes anyway, at least in case this would
> > ever be backported.
> 
> Thanks for your input Gerald.
> 
> So unless people would be running shiny new kernels on older
> distributions it shouldn't be a problem (and I don't think we care too
> much about something like that). I don't expect something like that to
> get backported - there is absolutely no reason to do so IMHO.

Ohh, there are many people running current Linus tree on an older
distribution. Including me.

> >> 2. Restrict it to s390x only. It always returned 0 on other
> >> architectures, I was not able to find any user.
> >>
> >> I think 2 should be safe to do (never used on other archs). I do wonder
> >> what the feelings are about 1.
> > 
> > Please don't add any s390-specific workarounds here, that does not
> > really sound like a clean-up, rather the opposite.
> 
> People seem to have different opinions here. I'm happy as long as we can
> get rid of it (either now, or in the future with a new model).
> 
> > 
> > That being said, I do not really see the benefit of this change at
> > all. As Michal mentioned, there really should be some more fundamental
> > change. And from the rest of this thread, it also seems that phys_device
> > usage might not be the biggest issue here.
> > 
> 
> As I already expressed, I am more of a friend of small, incremental
> changes than having a single big world switch where everything will be
> shiny and perfect.
> 
> (Deprecating it now - in any way - stops any new users from appearing -
> both, in the kernel and from user space - eventually making the big
> world switch later a little easier because there is one thing less that
> vanished)

Realistically people do not care about deprecation all that much. They
simply use whatever they can find or somebody will show them. Really,
deprecation has never really worked. The only thing that worked was to
remove the functionality and then wait for somebody to complain and
revert or somehow allow the functionality without necessity to alter the
userspace.

As much as I would like to remove as much crud as possible I strongly
suspect that the existing hotplug interface is just a lost case and it
doesn't make for the best used time to put a lip stick on a pig. Even if
we remove this particular interface we are not going to get rid of a lot
of code or we won't gain any more sensible semantic, right?
-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-25 Thread Michal Hocko
On Thu 24-09-20 10:16:14, Uladzislau Rezki wrote:
> > On Wed, Sep 23, 2020 at 08:41:05AM -0700, Paul E. McKenney wrote:
> > > > Fundamentally, this is simply shifting the problem from RCU to the page
> > > > allocator because of the locking arrangements and hazard of acquiring 
> > > > zone
> > > > lock is a raw spinlock is held on RT. It does not even make the timing
> > > > predictable as an empty PCU list (for example, a full drain in low 
> > > > memory
> > > > situations) may mean the emergency path is hit anyway. About all it 
> > > > changes
> > > > is the timing of when the emergency path is hit in some circumstances --
> > > > it's not fixing the problem, it's simply changing the shape.
> > > 
> > > All good points!
> > > 
> > > On the other hand, duplicating a portion of the allocator functionality
> > > within RCU increases the amount of reserved memory, and needlessly most
> > > of the time.
> > > 
> > 
> > But it's very similar to what mempools are for.
> > 
> As for dynamic caching or mempools. It requires extra logic on top of RCU
> to move things forward and it might be not efficient way. As a side
> effect, maintaining of the bulk arrays in the separate worker thread
> will introduce other drawbacks:

This is true but it is also true that it is RCU to require this special
logic and we can expect that we might need to fine tune this logic
depending on the RCU usage. We definitely do not want to tune the
generic page allocator for a very specific usecase, do we?

>  a) There is an extra latency window, a time during which a fallback
> mechanism is used until pages are obtained via the special
> worker for further pointers collecting over arrays.

This will be always the case for async refilling. More importantly this
will be a bigger problem at the page allocator level which has other
users other than RCU so more users are suffering...

>  b) It is impossible to predict how many pages will be required to
> cover a demand that is controlled by different workloads on
> various systems. It would require a rough value.

I have asked for some actual numbers for real life scenarios this work
is meant to cover. There was nothing presented so far. We can hand wave
for ever but this will not move us forward much. As I've said in other
email, few dozens pages per CPU by default will hardly get noticeable.
You have a trivial initial implementation and can build on top in
incremental steps. You can kick a background allocator to start new
allocations when the pool hits a watermark and aggressivelly remove
cached pages when they are not used. You will have quite a freedom to
fine tune the scheme which is much harder for the page allocator because
there are many other consumers.

Anyway, I am afraid that we are going in circles here. We do not have
any meaningful numbers to claim memory footprint problems. There is a
clear opposition to hook into page allocator for reasons already
mentioned. You are looking for a dedicated memory pool and it should be
quite trivial to develop one and fine tune it for your specific usecase.
All that on top of page allocator. Unless this is seen as completely
unfeasible based on some solid arguments then we can start talking about
the page allocator itself.
-- 
Michal Hocko
SUSE Labs


Re: [v3 1/2] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged

2020-09-25 Thread Michal Hocko
On Thu 24-09-20 19:51:03, Andrew Morton wrote:
> On Wed, 23 Sep 2020 14:27:30 -0700 Vijay Balakrishna 
>  wrote:
> 
> > Can this patch be included?  As Kirill is ok with patch now.
> 
> He is?  I can't immediately find that email.

http://lkml.kernel.org/r/20200922070726.dlw24lf3wd3p2...@black.fi.intel.com
-- 
Michal Hocko
SUSE Labs


Re: [v3 1/2] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged

2020-09-25 Thread Michal Hocko
On Wed 16-09-20 18:21:48, Vijay Balakrishna wrote:
> When memory is hotplug added or removed the min_free_kbytes must be
> recalculated based on what is expected by khugepaged.  Currently
> after hotplug, min_free_kbytes will be set to a lower default and higher
> default set when THP enabled is lost.  This change restores min_free_kbytes
> as expected for THP consumers.
> 
> Fixes: f000565adb77 ("thp: set recommended min free kbytes")
> 
> Signed-off-by: Vijay Balakrishna 
> Cc: sta...@vger.kernel.org
> Reviewed-by: Pavel Tatashin 

I am ok with this patch. I am not sure this is worth backporting to
stable trees becasuse this is not a functional bug. Surprising behavior,
yes, but not much more than that.

Acked-by: Michal Hocko 

One minor comment below
[...]
> @@ -857,6 +858,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages,
>   zone_pcp_update(zone);
>  
>   init_per_zone_wmark_min();
> + khugepaged_min_free_kbytes_update();
>  
>   kswapd_run(nid);
>   kcompactd_run(nid);
> @@ -1600,6 +1602,7 @@ static int __ref __offline_pages(unsigned long 
> start_pfn,
>   pgdat_resize_unlock(zone->zone_pgdat, );
>  
>   init_per_zone_wmark_min();
> + khugepaged_min_free_kbytes_update();
>  
>   if (!populated_zone(zone)) {
>   zone_pcp_reset(zone);

Can we move khugepaged_min_free_kbytes_update into
init_per_zone_wmark_min? If it stays external we might hit the same
problem when somebody else needs to modify min_free_kbytes. Early init
call will be likely too early for khugepaged but that shouldn't matter
AFAICS because it will call khugepaged_min_free_kbytes_update on its
own.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/2] vmalloc: Free pages as a batch

2020-09-23 Thread Michal Hocko
On Mon 21-09-20 23:46:27, Matthew Wilcox wrote:
> Use release_pages() to free the pages allocated by vmalloc().  This is
> slightly more efficient in terms of disabling and enabling IRQs once
> per batch instead of once per page.

Hmm, does this really lead to runtime improvements? Batching IRQ is
certainly nice but release_pages is much more heavy weight and all
additional checks are simply always false for vmalloc pages so all those
checks are pointless.

Maybe storing those pages into the linked list and use
free_unref_page_list instead would achieve what you want?

> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  mm/vmalloc.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index be4724b916b3..3893fc8915c4 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2263,16 +2263,8 @@ static void __vunmap(const void *addr, int 
> deallocate_pages)
>   vm_remove_mappings(area, deallocate_pages);
>  
>   if (deallocate_pages) {
> - int i;
> -
> - for (i = 0; i < area->nr_pages; i++) {
> - struct page *page = area->pages[i];
> -
> - BUG_ON(!page);
> - __free_pages(page, 0);
> - }
> + release_pages(area->pages, area->nr_pages);
>   atomic_long_sub(area->nr_pages, _vmalloc_pages);
> -
>   kvfree(area->pages);
>   }
>  
> -- 
> 2.28.0

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

2020-09-23 Thread Michal Hocko
On Mon 21-09-20 18:06:44, Michal Hocko wrote:
[...]
> Thanks a lot for this clarification! So I believe the only existing bug
> is in documentation which should be explicit that the cgroup fd read
> access is not sufficient because it also requires to have a write access
> for cgroup.procs in the same directory at the time of fork. I will send
> a patch if I find some time for that.

I have reread the man page and concluded that the current wording is
not bugy. It is referring to cgroups(7) which has all the information
but it takes quite some to drill down to the important point. On the
other hand there are many details (like delegation, namespaces) which
makes it quite complex to be concise in clone(2) so it is very likely
better to leave as it is.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: introduce per-memcg reclaim interface

2020-09-22 Thread Michal Hocko
On Tue 22-09-20 11:10:17, Shakeel Butt wrote:
> On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko  wrote:
[...]
> > Last but not least the memcg
> > background reclaim is something that should be possible without a new
> > interface.
> 
> So, it comes down to adding more functionality/semantics to
> memory.high or introducing a new simple interface. I am fine with
> either of one but IMO convoluted memory.high might have a higher
> maintenance cost.

One idea would be to schedule a background worker (which work on behalf
on the memcg) to do the high limit reclaim with high limit target as
soon as the high limit is reached. There would be one work item for each
memcg. Userspace would recheck the high limit on return to the userspace
and do the reclaim if the excess is larger than a threshold, and sleep
as the fallback.

Excessive consumers would get throttled if the background work cannot
keep up with the charge pace and most of them would return without doing
any reclaim because there is somebody working on their behalf - and is
accounted for that.

The semantic of high limit would be preserved IMHO because high limit is
actively throttled. Where that work is done shouldn't matter as long as
it is accounted properly and memcg cannot outsource all the work to the
rest of the system.

Would something like that (with many details to be sorted out of course)
be feasible?

If we do not want to change the existing semantic of high and want a new
api then I think having another limit for the background reclaim then
that would make more sense to me. It would resemble the global reclaim
and kswapd model and something that would be easier to reason about.
Comparing to echo $N > reclaim which might mean to reclaim any number
pages around N.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: introduce per-memcg reclaim interface

2020-09-22 Thread Michal Hocko
On Tue 22-09-20 11:10:17, Shakeel Butt wrote:
> On Tue, Sep 22, 2020 at 9:55 AM Michal Hocko  wrote:
[...]
> > So far I have learned that you are primarily working around an
> > implementation detail in the zswap which is doing the swapout path
> > directly in the pageout path.
> 
> Wait how did you reach this conclusion? I have explicitly said that we
> are not using uswapd like functionality in production. We are using
> this interface for proactive reclaim and proactive reclaim is not a
> workaround for implementation detail in the zswap.

Hmm, I must have missed the distinction between the two you have
mentioned. Correct me if I am wrong but "latency sensitive" workload is
the one that cannot use the high limit, right. For some reason I thought
that your pro-active reclaim usecase is also not compatible with the
throttling imposed by the high limit. Hence my conclusion above.
-- 
Michal Hocko
SUSE Labs


Re: Machine lockups on extreme memory pressure

2020-09-22 Thread Michal Hocko
On Tue 22-09-20 09:51:30, Shakeel Butt wrote:
> On Tue, Sep 22, 2020 at 9:34 AM Michal Hocko  wrote:
> >
> > On Tue 22-09-20 09:29:48, Shakeel Butt wrote:
[...]
> > > Anyways, what do you think of the in-kernel PSI based
> > > oom-kill trigger. I think Johannes had a prototype as well.
> >
> > We have talked about something like that in the past and established
> > that auto tuning for oom killer based on PSI is almost impossible to get
> > right for all potential workloads and that so this belongs to userspace.
> > The kernel's oom killer is there as a last resort when system gets close
> > to meltdown.
> 
> The system is already in meltdown state from the users perspective. I
> still think allowing the users to optionally set the oom-kill trigger
> based on PSI makes sense. Something like 'if all processes on the
> system are stuck for 60 sec, trigger oom-killer'.

We already do have watchdogs for that no? If you cannot really schedule
anything then soft lockup detector should fire. In a meltdown state like
that the reboot is likely the best way forward anyway.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: introduce per-memcg reclaim interface

2020-09-22 Thread Michal Hocko
On Tue 22-09-20 08:54:25, Shakeel Butt wrote:
> On Tue, Sep 22, 2020 at 4:49 AM Michal Hocko  wrote:
> >
> > On Mon 21-09-20 10:50:14, Shakeel Butt wrote:
[...]
> > > Let me add one more point. Even if the high limit reclaim is swift, it
> > > can still take 100s of usecs. Most of our jobs are anon-only and we
> > > use zswap. Compressing a page can take a couple usec, so 100s of usecs
> > > in limit reclaim is normal. For latency sensitive jobs, this amount of
> > > hiccups do matters.
> >
> > Understood. But isn't this an implementation detail of zswap? Can it
> > offload some of the heavy lifting to a different context and reduce the
> > general overhead?
> >
> 
> Are you saying doing the compression asynchronously? Similar to how
> the disk-based swap triggers the writeback and puts the page back to
> LRU, so the next time reclaim sees it, it will be instantly reclaimed?
> Or send the batch of pages to be compressed to a different CPU and
> wait for the completion?

Yes.

[...]

> > You are right that misconfigured limits can result in problems. But such
> > a configuration should be quite easy to spot which is not the case for
> > targetted reclaim calls which do not leave any footprints behind.
> > Existing interfaces are trying to not expose internal implementation
> > details as much as well. You are proposing a very targeted interface to
> > fine control the memory reclaim. There is a risk that userspace will
> > start depending on a specific reclaim implementation/behavior and future
> > changes would be prone to regressions in workloads relying on that. So
> > effectively, any user space memory reclaimer would need to be tuned to a
> > specific implementation of the memory reclaim.
> 
> I don't see the exposure of internal memory reclaim implementation.
> The interface is very simple. Reclaim a given amount of memory. Either
> the kernel will reclaim less memory or it will over reclaim. In case
> of reclaiming less memory, the user space can retry given there is
> enough reclaimable memory. For the over reclaim case, the user space
> will backoff for a longer time. How are the internal reclaim
> implementation details exposed?

In an ideal world yes. A feedback mechanism will be independent on the
particular implementation. But the reality tends to disagree quite
often. Once we provide a tool there will be users using it to the best
of their knowlege. Very often as a hammer. This is what the history of
kernel regressions and "we have to revert an obvious fix because
userspace depends on an undocumented behavior which happened to work for
some time" has thought us in a hard way.

I really do not want to deal with reports where a new heuristic in the
memory reclaim will break something just because the reclaim takes
slightly longer or over/under reclaims differently so the existing
assumptions break and the overall balancing from userspace breaks.

This might be a shiny exception of course. And please note that I am not
saying that the interface is completely wrong or unacceptable. I just
want to be absolutely sure we cannot move forward with the existing API
space that we have.

So far I have learned that you are primarily working around an
implementation detail in the zswap which is doing the swapout path
directly in the pageout path. That sounds like a very bad reason to add
a new interface. You are right that there are likely other usecases to
like this new interface - mostly to emulate drop_caches - but I believe
those are quite misguided as well and we should work harder to help
them out to use the existing APIs. Last but not least the memcg
background reclaim is something that should be possible without a new
interface.
-- 
Michal Hocko
SUSE Labs


Re: Machine lockups on extreme memory pressure

2020-09-22 Thread Michal Hocko
On Tue 22-09-20 09:29:48, Shakeel Butt wrote:
> On Tue, Sep 22, 2020 at 8:16 AM Michal Hocko  wrote:
> >
> > On Tue 22-09-20 06:37:02, Shakeel Butt wrote:
[...]
> > > I talked about this problem with Johannes at LPC 2019 and I think we
> > > talked about two potential solutions. First was to somehow give memory
> > > reserves to oomd and second was in-kernel PSI based oom-killer. I am
> > > not sure the first one will work in this situation but the second one
> > > might help.
> >
> > Why does your oomd depend on memory allocation?
> >
> 
> It does not but I think my concern was the potential allocations
> during syscalls.

So what is the problem then? Why your oomd cannot kill anything?

> Anyways, what do you think of the in-kernel PSI based
> oom-kill trigger. I think Johannes had a prototype as well.

We have talked about something like that in the past and established
that auto tuning for oom killer based on PSI is almost impossible to get
right for all potential workloads and that so this belongs to userspace.
The kernel's oom killer is there as a last resort when system gets close
to meltdown.
-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-22 Thread Michal Hocko
On Tue 22-09-20 15:12:57, Uladzislau Rezki wrote:
[...]
> > Mimicing a similar implementation shouldn't be all that hard
> > and you will get your own pool which doesn't affect other page allocator
> > users as much as a bonus.
> > 
> I see your point Michal. As i mentioned before, it is important to avoid of
> having such own pools, because the aim is not to waste memory resources. A
> page will be returned back to "page allocator" as soon as a scheduler place  
> our reclaim thread on a CPU and grace period is passed. So, the resource
> can be used for other needs. What is important.
> 
> Otherwise a memory footprint is increased what is bad for low memory
> conditions when OOM is involved. Just in case, it is a big issue for
> mobile devices.

Really, how much memory are we talking about here? Do you have any
estimation? How many pointers do you need to store at once? 10k (that
would be 20 pages per cpu? Doesn't sound too big to me. But again I do
not know the scale here. Also if you really care you can fine tune this
pool based on demand. All that is not a rocket science and it can be
tuned outside of the page allocator rather than other way around.

We will not move forward without any specific numbers here I am afraid.

[...]

> > Would a similar scaling as the page allocator feasible. Really I mostly
> > do care about shared nature of the pcp allocator list that one user can
> > easily monopolize with this API.
> > 
> I see your concern. pcplist can be monopolized by already existing API:
> 
> while (i < 100)
> __get_free_page(GFP_NOWAIT | __GFP_NOWARN);

They will usually not, because even non-sleeping allocations will refill
them unless the memory is scarce and memory reclaim is needed. As
replied to Paul in other email, this is not a question of correctness.
It is a matter of shifting the overhead around.

> > > Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626
> > 
> > Error 501
> > 
> Could you please elaborate? Do not want to speculate :)

It thrown 501 on me. lkml.org is quite unreliable. It works now. I will
read through that. Please use lore or lkml.kernel.org/r/$msg in future.

-- 
Michal Hocko
SUSE Labs


Re: Machine lockups on extreme memory pressure

2020-09-22 Thread Michal Hocko
On Tue 22-09-20 06:37:02, Shakeel Butt wrote:
[...]
> > I would recommend to focus on tracking down the who is blocking the
> > further progress.
> 
> I was able to find the CPU next in line for the list_lock from the
> dump. I don't think anyone is blocking the progress as such but more
> like the spinlock in the irq context is starving the spinlock in the
> process context. This is a high traffic machine and there are tens of
> thousands of potential network ACKs on the queue.

So there is a forward progress but it is too slow to have any reasonable
progress in userspace?

> I talked about this problem with Johannes at LPC 2019 and I think we
> talked about two potential solutions. First was to somehow give memory
> reserves to oomd and second was in-kernel PSI based oom-killer. I am
> not sure the first one will work in this situation but the second one
> might help.

Why does your oomd depend on memory allocation?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: introduce per-memcg reclaim interface

2020-09-22 Thread Michal Hocko
On Mon 21-09-20 10:50:14, Shakeel Butt wrote:
> On Mon, Sep 21, 2020 at 9:30 AM Michal Hocko  wrote:
> >
> > On Wed 09-09-20 14:57:52, Shakeel Butt wrote:
> > > Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> > >
> > > Use cases:
> > > --
> > >
> > > 1) Per-memcg uswapd:
> > >
> > > Usually applications consists of combination of latency sensitive and
> > > latency tolerant tasks. For example, tasks serving user requests vs
> > > tasks doing data backup for a database application. At the moment the
> > > kernel does not differentiate between such tasks when the application
> > > hits the memcg limits. So, potentially a latency sensitive user facing
> > > task can get stuck in high reclaim and be throttled by the kernel.
> > >
> > > Similarly there are cases of single process applications having two set
> > > of thread pools where threads from one pool have high scheduling
> > > priority and low latency requirement. One concrete example from our
> > > production is the VMM which have high priority low latency thread pool
> > > for the VCPUs while separate thread pool for stats reporting, I/O
> > > emulation, health checks and other managerial operations. The kernel
> > > memory reclaim does not differentiate between VCPU thread or a
> > > non-latency sensitive thread and a VCPU thread can get stuck in high
> > > reclaim.
> >
> > As those are presumably in the same cgroup what does prevent them to get
> > stuck behind shared resources with taken during the reclaim performed by
> > somebody else? I mean, memory reclaim might drop memory used by the high
> > priority task. Or they might simply stumble over same locks.
> >
> 
> Yes there are a lot of challenges in providing isolation between
> latency sensitive and latency tolerant jobs/threads. This proposal
> aims to solve one specific challenge memcg limit reclaim.

I am fully aware that a complete isolation is hard to achieve. I am just
trying evaluate how is this specific usecase worth a new interface that
we will have to maintain for ever. Especially when I suspect that the
interface will likely only paper over immediate problems rather than
offer a long term maintainable solution for it.

> > I am also more interested in actual numbers here. The high limit reclaim
> > is normally swift and should be mostly unnoticeable. If the reclaim gets
> > more expensive then it can get really noticeable for sure. But for the
> > later the same can happen with the external pro-activee reclaimer as
> 
> I think you meant 'uswapd' here instead of pro-active reclaimer.
>
> > well, right? So there is no real "guarantee". Do you have any numbers
> > from your workloads where you can demonstrate that the external reclaim
> > has saved you this amount of effective cpu time of the sensitive
> > workload? (Essentially measure how much time it has to consume in the
> > high limit reclaim)
> >
> 
> What we actually use in our production is the 'proactive reclaim'
> which I have explained in the original message but I will add a couple
> more sentences below.
> 
> For the uswapd use-case, let me point to the previous discussions and
> feature requests by others [1, 2]. One of the limiting factors of
> these previous proposals was the lack of CPU accounting of the
> background reclaimer which the current proposal solves by enabling the
> user space solution.
> 
> [1] https://lwn.net/Articles/753162/
> [2] http://lkml.kernel.org/r/20200219181219.54356-1-han...@cmpxchg.org

I remember those. My understanding was that the only problem is to
properly account for CPU on behalf of the reclaimed cgroup and that has
been work in progress for that.

Outsourcing all that to userspace surely sounds like an attractive
option but it comes with usual user API price. More on that later.

> Let me add one more point. Even if the high limit reclaim is swift, it
> can still take 100s of usecs. Most of our jobs are anon-only and we
> use zswap. Compressing a page can take a couple usec, so 100s of usecs
> in limit reclaim is normal. For latency sensitive jobs, this amount of
> hiccups do matters.

Understood. But isn't this an implementation detail of zswap? Can it
offload some of the heavy lifting to a different context and reduce the
general overhead?

> For the proactive reclaim, based on the refault medium, we define
> tolerable refault rate of the applications. Then we proactively
> reclaim memory from the applications and monitor the refault rates.
> Based on the refault rates, the memory overcommit manager controls the
> aggressiveness of 

Re: Machine lockups on extreme memory pressure

2020-09-22 Thread Michal Hocko
On Mon 21-09-20 11:35:35, Shakeel Butt wrote:
> Hi all,
> 
> We are seeing machine lockups due extreme memory pressure where the
> free pages on all the zones are way below the min watermarks. The stack
> of the stuck CPU looks like the following (I had to crash the machine to
> get the info).

sysrq+l didn't report anything?
 
>  #0 [ ] crash_nmi_callback
>  #1 [ ] nmi_handle
>  #2 [ ] default_do_nmi
>  #3 [ ] do_nmi
>  #4 [ ] end_repeat_nmi
> ---  ---
>  #5 [ ] queued_spin_lock_slowpath
>  #6 [ ] _raw_spin_lock
>  #7 [ ] cache_alloc_node
>  #8 [ ] fallback_alloc
>  #9 [ ] __kmalloc_node_track_caller
> #10 [ ] __alloc_skb
> #11 [ ] tcp_send_ack
> #12 [ ] tcp_delack_timer
> #13 [ ] run_timer_softirq
> #14 [ ] irq_exit
> #15 [ ] smp_apic_timer_interrupt
> #16 [ ] apic_timer_interrupt
> ---  ---
> #17 [ ] apic_timer_interrupt
> #18 [ ] _raw_spin_lock
> #19 [ ] vmpressure
> #20 [ ] shrink_node
> #21 [ ] do_try_to_free_pages
> #22 [ ] try_to_free_pages
> #23 [ ] __alloc_pages_direct_reclaim
> #24 [ ] __alloc_pages_nodemask
> #25 [ ] cache_grow_begin
> #26 [ ] fallback_alloc
> #27 [ ] __kmalloc_node_track_caller
> #28 [ ] __alloc_skb
> #29 [ ] tcp_sendmsg_locked
> #30 [ ] tcp_sendmsg
> #31 [ ] inet6_sendmsg
> #32 [ ] ___sys_sendmsg
> #33 [ ] sys_sendmsg
> #34 [ ] do_syscall_64
> 
> These are high traffic machines. Almost all the CPUs are stuck on the
> root memcg's vmpressure sr_lock and almost half of the CPUs are stuck
> on kmem cache node's list_lock in the IRQ.

Are you able to track down the lock holder?

> Note that the vmpressure sr_lock is irq-unsafe.

Which is ok because this is only triggered from the memory reclaim and
that cannot ever happen from the interrrupt context for obvoius reasons.

> Couple of months back, we observed a similar
> situation with swap locks which forces us to disable swap on global
> pressure. Since we do proactive reclaim disabling swap on global reclaim
> was not an issue. However now we have started seeing the same situation
> with other irq-unsafe locks like vmpressure sr_lock and almost all the
> slab shrinkers have irq-unsafe spinlocks. One of way to mitigate this
> is by converting all such locks (which can be taken in reclaim path)
> to be irq-safe but it does not seem like a maintainable solution.

This doesn't make much sense to be honest. We are not disabling IRQs
unless it is absolutely necessary.

> Please note that we are running user space oom-killer which is more
> aggressive than oomd/PSI but even that got stuck under this much memory
> pressure.
> 
> I am wondering if anyone else has seen a similar situation in production
> and if there is a recommended way to resolve this situation.

I would recommend to focus on tracking down the who is blocking the
further progress.
-- 
Michal Hocko
SUSE Labs


Re: [v4] mm: khugepaged: avoid overriding min_free_kbytes set by user

2020-09-22 Thread Michal Hocko
On Tue 22-09-20 10:07:26, Kirill A. Shutemov wrote:
> On Mon, Sep 21, 2020 at 12:07:23PM -0700, Vijay Balakrishna wrote:
> > > 
> > > I would recommend reposting the patch which adds heuristic for THP (if
> > > THP is enabled) into the hotplug path, arguing with the consistency and
> > > surprising results when adding memory decreases the value.
> > 
> > I hope my reposted patch
> > ([v3 1/2] mm: khugepaged: recalculate min_free_kbytes after memory hotplug
> > as expected by khugepaged)
> > change log is ok:
> > 
> > When memory is hotplug added or removed the min_free_kbytes must be
> > recalculated based on what is expected by khugepaged.  Currently
> > after hotplug, min_free_kbytes will be set to a lower default and higher
> > default set when THP enabled is lost.  This change restores min_free_kbytes
> > as expected for THP consumers.
> 
> Any scenario when hotremove would result in changing min_free_kbytes?

init_per_zone_wmark_min is called from both online and offline path. But
I believe the problem is not in the offlining path. A decrease wrt
previous auto tuned value is to be expected. The primary problem is that
the hotadding memory after boot (without any user configured value) will
decrease the value effectively because khugepaged tuning
(set_recommended_min_free_kbytes) is not called.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/memcontrol: Add the drop_cache interface for cgroup v2

2020-09-22 Thread Michal Hocko
On Tue 22-09-20 16:06:31, Yafang Shao wrote:
> On Tue, Sep 22, 2020 at 3:27 PM Michal Hocko  wrote:
[...]
> > What is the latency triggered by the memory reclaim? It should be mostly
> > a clean page cache right as drop_caches only drops clean pages. Or is
> > this more about [id]cache? Do you have any profiles where is the time
> > spent?
> >
> 
> Yes, we have analyzed the issues in the direct reclaim, but that is
> not the point.

Are those fixed?

> The point is that each case may take us several days to analyze, while
> the user can't wait, so they will use drop_caches to workaround it
> until we find the solution.

As I've said there are several options to achieve an immediate action.
Careful resource domains configuration will certainly help with that.
-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-22 Thread Michal Hocko
On Mon 21-09-20 20:35:53, Paul E. McKenney wrote:
> On Mon, Sep 21, 2020 at 06:03:18PM +0200, Michal Hocko wrote:
> > On Mon 21-09-20 08:45:58, Paul E. McKenney wrote:
> > > On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote:
> > > > On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote:
> > > > [...]
> > > > > Proposal
> > > > > 
> > > > > Introduce a lock-free function that obtain a page from the 
> > > > > per-cpu-lists
> > > > > on current CPU. It returns NULL rather than acquiring any non-raw 
> > > > > spinlock.
> > > > 
> > > > I was not happy about this solution when we have discussed this
> > > > last time and I have to say I am still not happy. This is exposing
> > > > an internal allocator optimization and allows a hard to estimate
> > > > consumption of pcp free pages. IIUC this run on pcp cache can be
> > > > controled directly from userspace (close(open) loop IIRC) which makes it
> > > > even bigger no-no.
> > > 
> > > Yes, I do well remember that you are unhappy with this approach.
> > > Unfortunately, thus far, there is no solution that makes all developers
> > > happy.  You might be glad to hear that we are also looking into other
> > > solutions, each of which makes some other developers unhappy.  So we
> > > are at least not picking on you alone.  :-/
> > 
> > No worries I do not feel like a whipping boy here. But do expect me to
> > argue against the approach. I would also appreciate it if there was some
> > more information on other attempts, why they have failed. E.g. why
> > pre-allocation is not an option that works well enough in most
> > reasonable workloads. I would also appreciate some more thoughts why we
> > need to optimize for heavy abusers of RCU (like close(open) extremes).
> 
> Not optimizing for them, but rather defending against them.  Uladzislau
> gave the example of low-memory phones.  And we have quite the array
> of defenses against other userspace bugs including MMUs, the "limit"
> command, and so on.
> 
> There have been quite a few successful attempts, starting from the
> introduction of blimit and RCU-bh more than 15 years ago, continuing
> through making call_rcu() jump-start grace periods, IPIing reluctant
> CPUs, tuning RCU callback offloading, and many others.  But these prior
> approaches have only taken us so far.

Thank you this is useful. It gives me some idea about the problem but I
still cannot picture how serious the problem really is. Is this a DoS
like scenario where an unprivileged task is able to monopolize the
system/CPU to do all the RCU heavy lifting? Are other RCU users deprived
from doing their portion? How much expediting helps? Does it really
solve the problem or it merely shifts it around as you will have hard
time to keep up with more and more tasks to hit the same path and the
pre-allocated memory is a very finite resource.

> Other approaches under consideration include making CONFIG_PREEMPT_COUNT
> unconditional and thus allowing call_rcu() and kvfree_rcu() to determine
> whether direct calls to the allocator are safe (some guy named Linus
> doesn't like this one),

I assume that the primary argument is the overhead, right? Do you happen
to have any reference?

> preallocation (Uladzislau covered this, and
> the amount that would need to be preallocated is excessive), deferring
> allocation to RCU_SOFTIRQ (this would also need CONFIG_PREEMPT_COUNT),
> and deferring to some clean context (which is the best we can do within
> the confines of RCU, but which can have issues with delay).

I have already replied to that so let's not split the discussion into
several fronts.

> So it is not the need to address this general problem that is new.
> Far from it!  What is new is the need for changes outside of RCU.
> 
> > > > I strongly agree with Thomas 
> > > > http://lkml.kernel.org/r/87tux4kefm@nanos.tec.linutronix.de
> > > > that this optimization is not aiming at reasonable workloads. Really, go
> > > > with pre-allocated buffer and fallback to whatever slow path you have
> > > > already. Exposing more internals of the allocator is not going to do any
> > > > good for long term maintainability.
> > > 
> > > I suggest that you carefully re-read the thread following that email.
> > 
> > I clearly remember Thomas not being particularly happy that you optimize
> > for a corner case. I do not remember there being a consensus that this
> > is the right approach. There was some consensus that this is better than
> > a 

Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-22 Thread Michal Hocko
[Cc Mel - the thread starts 
http://lkml.kernel.org/r/20200918194817.48921-1-ure...@gmail.com]

On Mon 21-09-20 21:48:19, Uladzislau Rezki wrote:
> Hello, Michal.
> 
> > >
> > > Yes, I do well remember that you are unhappy with this approach.
> > > Unfortunately, thus far, there is no solution that makes all developers
> > > happy.  You might be glad to hear that we are also looking into other
> > > solutions, each of which makes some other developers unhappy.  So we
> > > are at least not picking on you alone.  :-/
> > 
> > No worries I do not feel like a whipping boy here. But do expect me to
> > argue against the approach. I would also appreciate it if there was some
> > more information on other attempts, why they have failed. E.g. why
> > pre-allocation is not an option that works well enough in most
> > reasonable workloads.
> Pre-allocating has some drawbacks:
> 
> a) It is impossible to predict how many pages will be required to
>cover a demand that is controlled by different workloads on
>various systems.

Yes, this is not trivial but not a rocket science either. Remember that
you are relying on a very dumb watermark based pcp pool from the
allocator. Mimicing a similar implementation shouldn't be all that hard
and you will get your own pool which doesn't affect other page allocator
users as much as a bonus.

> b) Memory overhead since we do not know how much pages should be
>preloaded: 100, 200 or 300

Does anybody who really needs this optimization actually cares about 300
pages?

> As for memory overhead, it is important to reduce it because of
> embedded devices like phones, where a low memory condition is a
> big issue. In that sense pre-allocating is something that we strongly
> would like to avoid.

How big "machines" are we talking about here? I would expect that really
tiny machines would have hard times to really fill up thousands of pages
with pointers to free...

Would a similar scaling as the page allocator feasible. Really I mostly
do care about shared nature of the pcp allocator list that one user can
easily monopolize with this API.

> > I would also appreciate some more thoughts why we
> > need to optimize for heavy abusers of RCU (like close(open) extremes).
> > 
> I think here is a small misunderstanding. Please note, that is not only
> about performance and corner cases. There is a single argument support
> of the kvfree_rcu(ptr), where maintaining an array in time is needed.
> The fallback of the single argument case is extrimely slow.

This should be part of the changelog.
> 
> Single-argument details is here: https://lkml.org/lkml/2020/4/28/1626

Error 501

> > > > I strongly agree with Thomas 
> > > > http://lkml.kernel.org/r/87tux4kefm@nanos.tec.linutronix.de
> > > > that this optimization is not aiming at reasonable workloads. Really, go
> > > > with pre-allocated buffer and fallback to whatever slow path you have
> > > > already. Exposing more internals of the allocator is not going to do any
> > > > good for long term maintainability.
> > > 
> > > I suggest that you carefully re-read the thread following that email.
> > 
> > I clearly remember Thomas not being particularly happy that you optimize
> > for a corner case. I do not remember there being a consensus that this
> > is the right approach. There was some consensus that this is better than
> > a gfp flag. Still quite bad though if you ask me.
> > 
> > > Given a choice between making users unhappy and making developers
> > > unhappy, I will side with the users each and every time.
> > 
> > Well, let me rephrase. It is not only about me (as a developer) being
> > unhappy but also all the side effects this would have for users when
> > performance of their favorite workload declines for no apparent reason
> > just because pcp caches are depleted by an unrelated process.
> >
> If depleted, we have a special worker that charge it. From the other hand,
> the pcplist can be depleted by its nature, what _is_ not wrong. But just
> in case we secure it since you had a concern about it.

pcp free lists should ever get empty when we run out of memory and need
to reclaim. Otherwise they are constantly refilled/rebalanced on demand.
The fact that you are refilling them from outside just suggest that you
are operating on a wrong layer. Really, create your own pool of pages
and rebalance them based on the workload.

> Could you please specify a real test case or workload you are talking about?

I am not a performance expert but essentially any memory allocator heavy
workload might notice. I am pretty sure Mel would tell you more.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/memcontrol: Add the drop_cache interface for cgroup v2

2020-09-22 Thread Michal Hocko
On Tue 22-09-20 12:20:52, Yafang Shao wrote:
> On Mon, Sep 21, 2020 at 7:36 PM Michal Hocko  wrote:
> >
> > On Mon 21-09-20 19:23:01, Yafang Shao wrote:
> > > On Mon, Sep 21, 2020 at 7:05 PM Michal Hocko  wrote:
> > > >
> > > > On Mon 21-09-20 18:55:40, Yafang Shao wrote:
> > > > > On Mon, Sep 21, 2020 at 4:12 PM Michal Hocko  wrote:
> > > > > >
> > > > > > On Mon 21-09-20 16:02:55, zangchun...@bytedance.com wrote:
> > > > > > > From: Chunxin Zang 
> > > > > > >
> > > > > > > In the cgroup v1, we have 'force_mepty' interface. This is very
> > > > > > > useful for userspace to actively release memory. But the cgroup
> > > > > > > v2 does not.
> > > > > > >
> > > > > > > This patch reuse cgroup v1's function, but have a new name for
> > > > > > > the interface. Because I think 'drop_cache' may be is easier to
> > > > > > > understand :)
> > > > > >
> > > > > > This should really explain a usecase. Global drop_caches is a 
> > > > > > terrible
> > > > > > interface and it has caused many problems in the past. People have
> > > > > > learned to use it as a remedy to any problem they might see and 
> > > > > > cause
> > > > > > other problems without realizing that. This is the reason why we 
> > > > > > even
> > > > > > log each attempt to drop caches.
> > > > > >
> > > > > > I would rather not repeat the same mistake on the memcg level unless
> > > > > > there is a very strong reason for it.
> > > > > >
> > > > >
> > > > > I think we'd better add these comments above the function
> > > > > mem_cgroup_force_empty() to explain why we don't want to expose this
> > > > > interface in cgroup2, otherwise people will continue to send this
> > > > > proposal without any strong reason.
> > > >
> > > > I do not mind people sending this proposal.  "V1 used to have an
> > > > interface, we need it in v2 as well" is not really viable without
> > > > providing more reasoning on the specific usecase.
> > > >
> > > > _Any_ patch should have a proper justification. This is nothing really
> > > > new to the process and I am wondering why this is coming as a surprise.
> > > >
> > >
> > > Container users always want to drop cache in a specific container,
> > > because they used to use drop_caches to fix memory pressure issues.
> >
> > This is exactly the kind of problems we have seen in the past. There
> > should be zero reason to addre potential reclaim problems by dropping
> > page cache on the floor. There is a huge cargo cult about this
> > procedure and I have seen numerous reports when people complained about
> > performance afterwards just to learn that the dropped page cache was one
> > of the resons for that.
> >
> > > Although drop_caches can cause some unexpected issues, it could also
> > > fix some issues.
> >
> > "Some issues" is way too general. We really want to learn about those
> > issues and address them properly.
> >
> 
> One use case in our production environment is that some of our tasks
> become very latency sensitive from 7am to 10am, so before these tasks
> become active we will use drop_caches to drop page caches generated by
> other tasks at night to avoid these tasks triggering direct reclaim.
>
> The best way to do it is to fix the latency in direct reclaim, but it
> will take great effort.

What is the latency triggered by the memory reclaim? It should be mostly
a clean page cache right as drop_caches only drops clean pages. Or is
this more about [id]cache? Do you have any profiles where is the time
spent?

> while drop_caches give us an easier way to achieve the same goal.

It papers over real problems and that is my earlier comment about.

> IOW, drop_caches give the users an option to achieve their goal before
> they find a better solution.

You can achieve the same by a different configuration already. You can
isolate your page cache hungry overnight (likely a backup) workload into
its own memcg. You can either use an aggressive high limit during the
run or simply reduce the high/max limit after the work is done.

If you cannot isolate that easily then you can lower high limit
temporarily before your peak workload.

Really throwing all the page cache away just to prepare for a peak
workload sounds like a bad idea to me. You are potentially throwing
data that you have to spend time to refault again.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] memcg: introduce per-memcg reclaim interface

2020-09-21 Thread Michal Hocko
On Wed 09-09-20 14:57:52, Shakeel Butt wrote:
> Introduce an memcg interface to trigger memory reclaim on a memory cgroup.
> 
> Use cases:
> --
> 
> 1) Per-memcg uswapd:
> 
> Usually applications consists of combination of latency sensitive and
> latency tolerant tasks. For example, tasks serving user requests vs
> tasks doing data backup for a database application. At the moment the
> kernel does not differentiate between such tasks when the application
> hits the memcg limits. So, potentially a latency sensitive user facing
> task can get stuck in high reclaim and be throttled by the kernel.
> 
> Similarly there are cases of single process applications having two set
> of thread pools where threads from one pool have high scheduling
> priority and low latency requirement. One concrete example from our
> production is the VMM which have high priority low latency thread pool
> for the VCPUs while separate thread pool for stats reporting, I/O
> emulation, health checks and other managerial operations. The kernel
> memory reclaim does not differentiate between VCPU thread or a
> non-latency sensitive thread and a VCPU thread can get stuck in high
> reclaim.

As those are presumably in the same cgroup what does prevent them to get
stuck behind shared resources with taken during the reclaim performed by
somebody else? I mean, memory reclaim might drop memory used by the high
priority task. Or they might simply stumble over same locks.

I am also more interested in actual numbers here. The high limit reclaim
is normally swift and should be mostly unnoticeable. If the reclaim gets
more expensive then it can get really noticeable for sure. But for the
later the same can happen with the external pro-activee reclaimer as
well, right? So there is no real "guarantee". Do you have any numbers
from your workloads where you can demonstrate that the external reclaim
has saved you this amount of effective cpu time of the sensitive
workload? (Essentially measure how much time it has to consume in the
high limit reclaim)

To the feature itself, I am not yet convinced we want to have a feature
like that. It surely sounds easy to use and attractive for a better user
space control. It is also much well defined than drop_caches/force_empty
because it is not all or nothing. But it also sounds like something too
easy to use incorrectly (remember drop_caches). I am also a bit worried
about corner cases wich would be easier to hit - e.g. fill up the swap
limit and turn anonymous memory into unreclaimable and who knows what
else.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

2020-09-21 Thread Michal Hocko
On Mon 21-09-20 17:04:50, Christian Brauner wrote:
> On Mon, Sep 21, 2020 at 04:55:37PM +0200, Michal Hocko wrote:
> > On Mon 21-09-20 16:43:55, Christian Brauner wrote:
> > > On Mon, Sep 21, 2020 at 10:38:47AM -0400, Tejun Heo wrote:
> > > > Hello,
> > > > 
> > > > On Mon, Sep 21, 2020 at 04:28:34PM +0200, Michal Hocko wrote:
> > > > > Fundamentaly CLONE_INTO_CGROUP is similar to regular fork + move to 
> > > > > the
> > > > > target cgroup after the child gets executed. So in principle there
> > > > > shouldn't be any big difference. Except that the move has to be 
> > > > > explicit
> > > > > and the the child has to have enough privileges to move itself. I am 
> > > > > not
> > > > 
> > > > Yeap, they're supposed to be the same operations. We've never clearly
> > > > defined how the accounting gets split across moves because 1. it's
> > > > inherently blurry and difficult 2. doesn't make any practical 
> > > > difference for
> > > > the recommended and vast majority usage pattern which uses migration to 
> > > > seed
> > > > the new cgroup. CLONE_INTO_CGROUP doesn't change any of that.
> > > > 
> > > > > completely sure about CLONE_INTO_CGROUP model though. According to man
> > > > > clone(2) it seems that O_RDONLY for the target cgroup directory is
> > > > > sufficient. That seems much more relaxed IIUC and it would allow to 
> > > > > fork
> > > > > into a different cgroup while keeping a lot of resources in the 
> > > > > parent's
> > > > > proper.
> > > > 
> > > > If the man page is documenting that, it's wrong. cgroup_css_set_fork() 
> > > > has
> > > > an explicit cgroup_may_write() test on the destination cgroup.
> > > > CLONE_INTO_CGROUP should follow exactly the same rules as regular
> > > > migrations.
> > > 
> > > Indeed!
> > > The O_RDONLY mention on the manpage doesn't make sense but it is
> > > explained that the semantics are exactly the same for moving via the
> > > filesystem:
> > 
> > OK, if the semantic is the same as for the task migration then I do not
> > see any (new) problems. Care to point me where the actual check is
> > enforced? For the migration you need a write access to cgroup.procs but
> > if the API expects directory fd then I am not sure how that would expose
> > the same behavior.
> 
> kernel/cgroup/cgroup.c:cgroup_csset_fork()
> 
> So there's which is the first check for inode_permission() essentially:
> 
>   /*
>* Verify that we the target cgroup is writable for us. This is
>* usually done by the vfs layer but since we're not going through
>* the vfs layer here we need to do it "manually".
>*/
>   ret = cgroup_may_write(dst_cgrp, sb);
>   if (ret)
>   goto err;
> 
> and what you're referring to is checked right after in:
> 
>   ret = cgroup_attach_permissions(cset->dfl_cgrp, dst_cgrp, sb,
>   !(kargs->flags & CLONE_THREAD));
>   if (ret)
>   goto err;
> 
> which calls:
> 
>   ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, sb);
>   if (ret)
>   return ret;
> 
> That should be what you're looking for. I've also added selftests as
> always that verify this behavior under:
> 
> tools/testing/selftests/cgroup/
> 
> as soon as CLONE_INTO_CGROUP is detected on the kernel than all the
> usual tests are exercised using CLONE_INTO_CGROUP so we should've seen
> any regression hopefully.

Thanks a lot for this clarification! So I believe the only existing bug
is in documentation which should be explicit that the cgroup fd read
access is not sufficient because it also requires to have a write access
for cgroup.procs in the same directory at the time of fork. I will send
a patch if I find some time for that.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-21 Thread Michal Hocko
On Mon 21-09-20 08:45:58, Paul E. McKenney wrote:
> On Mon, Sep 21, 2020 at 09:47:16AM +0200, Michal Hocko wrote:
> > On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote:
> > [...]
> > > Proposal
> > > 
> > > Introduce a lock-free function that obtain a page from the per-cpu-lists
> > > on current CPU. It returns NULL rather than acquiring any non-raw 
> > > spinlock.
> > 
> > I was not happy about this solution when we have discussed this
> > last time and I have to say I am still not happy. This is exposing
> > an internal allocator optimization and allows a hard to estimate
> > consumption of pcp free pages. IIUC this run on pcp cache can be
> > controled directly from userspace (close(open) loop IIRC) which makes it
> > even bigger no-no.
> 
> Yes, I do well remember that you are unhappy with this approach.
> Unfortunately, thus far, there is no solution that makes all developers
> happy.  You might be glad to hear that we are also looking into other
> solutions, each of which makes some other developers unhappy.  So we
> are at least not picking on you alone.  :-/

No worries I do not feel like a whipping boy here. But do expect me to
argue against the approach. I would also appreciate it if there was some
more information on other attempts, why they have failed. E.g. why
pre-allocation is not an option that works well enough in most
reasonable workloads. I would also appreciate some more thoughts why we
need to optimize for heavy abusers of RCU (like close(open) extremes).

> > I strongly agree with Thomas 
> > http://lkml.kernel.org/r/87tux4kefm@nanos.tec.linutronix.de
> > that this optimization is not aiming at reasonable workloads. Really, go
> > with pre-allocated buffer and fallback to whatever slow path you have
> > already. Exposing more internals of the allocator is not going to do any
> > good for long term maintainability.
> 
> I suggest that you carefully re-read the thread following that email.

I clearly remember Thomas not being particularly happy that you optimize
for a corner case. I do not remember there being a consensus that this
is the right approach. There was some consensus that this is better than
a gfp flag. Still quite bad though if you ask me.

> Given a choice between making users unhappy and making developers
> unhappy, I will side with the users each and every time.

Well, let me rephrase. It is not only about me (as a developer) being
unhappy but also all the side effects this would have for users when
performance of their favorite workload declines for no apparent reason
just because pcp caches are depleted by an unrelated process.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

2020-09-21 Thread Michal Hocko
On Mon 21-09-20 16:41:34, Christian Brauner wrote:
> On Mon, Sep 21, 2020 at 03:42:00PM +0200, Michal Hocko wrote:
> > [Cc Tejun and Christian - this is a part of a larger discussion which is
> >  not directly related to this particular question so let me trim the
> >  original email to the bare minimum.]
> > 
> > On Fri 18-09-20 12:40:32, Peter Xu wrote:
> > [...]
> > > One issue is when we charge for cgroup we probably can't do that onto the 
> > > new
> > > mm/task, since copy_namespaces() is called after copy_mm().  I don't know
> > > enough about cgroup, I thought the child will inherit the parent's, but 
> > > I'm not
> > > sure.  Or, can we change that order of copy_namespaces() && copy_mm()?  I 
> > > don't
> > > see a problem so far but I'd like to ask first..
> > 
> > I suspect you are referring to CLONE_INTO_CGROUP, right? I have only now
> > learned about this feature so I am not deeply familiar with all the
> > details and I might be easily wrong. Normally all the cgroup aware
> > resources are accounted to the parent's cgroup. For memcg that includes
> > all the page tables, early CoW and other allocations with __GFP_ACCOUNT.
> > IIUC CLONE_INTO_CGROUP properly then this hasn't changed as the child is
> > associated to its new cgroup (and memcg) only in cgroup_post_fork. If
> > that is correct then we might have quite a lot of resources bound to
> > child's lifetime but accounted to the parent's memcg which can lead to
> > all sorts of interesting problems (e.g. unreclaimable memory - even by
> > the oom killer).
> > 
> > Christian, Tejun is this the expected semantic or I am just misreading
> > the code?
> 
> Hey Michal,
> 
> Thanks for the Cc!
> 
> If I understand your question correctly, then you are correct. The logic
> is split in three simple parts:
> 1. Child gets created and doesn't live in any cset
>- This should mean that resources are still charged against the
>  parent's memcg which is what you're asking afiu.
> 1. cgroup_can_fork()
>- create new or find existing matching cset for the child
> 3. cgroup_post_fork()
>- move/attach child to the new or found cset
> 
> _Purely from a CLONE_INTO_CGROUP perspective_ you should be ok to
> reverse the order of copy_mm() and copy_namespaces().

Switching the order wouldn't make much of a difference right. At least
not for memcg where all the accounted allocations will still go to
current's memcg.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

2020-09-21 Thread Michal Hocko
On Mon 21-09-20 16:43:55, Christian Brauner wrote:
> On Mon, Sep 21, 2020 at 10:38:47AM -0400, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Sep 21, 2020 at 04:28:34PM +0200, Michal Hocko wrote:
> > > Fundamentaly CLONE_INTO_CGROUP is similar to regular fork + move to the
> > > target cgroup after the child gets executed. So in principle there
> > > shouldn't be any big difference. Except that the move has to be explicit
> > > and the the child has to have enough privileges to move itself. I am not
> > 
> > Yeap, they're supposed to be the same operations. We've never clearly
> > defined how the accounting gets split across moves because 1. it's
> > inherently blurry and difficult 2. doesn't make any practical difference for
> > the recommended and vast majority usage pattern which uses migration to seed
> > the new cgroup. CLONE_INTO_CGROUP doesn't change any of that.
> > 
> > > completely sure about CLONE_INTO_CGROUP model though. According to man
> > > clone(2) it seems that O_RDONLY for the target cgroup directory is
> > > sufficient. That seems much more relaxed IIUC and it would allow to fork
> > > into a different cgroup while keeping a lot of resources in the parent's
> > > proper.
> > 
> > If the man page is documenting that, it's wrong. cgroup_css_set_fork() has
> > an explicit cgroup_may_write() test on the destination cgroup.
> > CLONE_INTO_CGROUP should follow exactly the same rules as regular
> > migrations.
> 
> Indeed!
> The O_RDONLY mention on the manpage doesn't make sense but it is
> explained that the semantics are exactly the same for moving via the
> filesystem:

OK, if the semantic is the same as for the task migration then I do not
see any (new) problems. Care to point me where the actual check is
enforced? For the migration you need a write access to cgroup.procs but
if the API expects directory fd then I am not sure how that would expose
the same behavior.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

2020-09-21 Thread Michal Hocko
On Mon 21-09-20 10:18:30, Peter Xu wrote:
> Hi, Michal,
> 
> On Mon, Sep 21, 2020 at 03:42:00PM +0200, Michal Hocko wrote:
[...]
> > I have only now
> > learned about this feature so I am not deeply familiar with all the
> > details and I might be easily wrong. Normally all the cgroup aware
> > resources are accounted to the parent's cgroup. For memcg that includes
> > all the page tables, early CoW and other allocations with __GFP_ACCOUNT.
> > IIUC CLONE_INTO_CGROUP properly then this hasn't changed as the child is
> > associated to its new cgroup (and memcg) only in cgroup_post_fork. If
> > that is correct then we might have quite a lot of resources bound to
> > child's lifetime but accounted to the parent's memcg which can lead to
> > all sorts of interesting problems (e.g. unreclaimable memory - even by
> > the oom killer).
> 
> Right that's one of the things that I'm confused too, on that if we always
> account to the parent, then when the child quits whether we uncharge them or
> not, and how..  Not sure whether the accounting of the parent could steadily
> grow as it continues the fork()s.
> 
> So is it by design that we account all these to the parents?

Let me try to clarify a bit further my concern.  Without CLONE_INTO_CGROUP
this makes some sense. Because both parent and child will live in
the same cgroup. All the charges are reference counted so they will
be released when the respective resource gets freed (e.g. page table
released or the backing page dropped) irrespective of the current cgroup
the owner is living in.

Fundamentaly CLONE_INTO_CGROUP is similar to regular fork + move to the
target cgroup after the child gets executed. So in principle there
shouldn't be any big difference. Except that the move has to be explicit
and the the child has to have enough privileges to move itself. I am not
completely sure about CLONE_INTO_CGROUP model though. According to man
clone(2) it seems that O_RDONLY for the target cgroup directory is
sufficient. That seems much more relaxed IIUC and it would allow to fork
into a different cgroup while keeping a lot of resources in the parent's
proper.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/4] mm: Trial do_wp_page() simplification

2020-09-21 Thread Michal Hocko
[Cc Tejun and Christian - this is a part of a larger discussion which is
 not directly related to this particular question so let me trim the
 original email to the bare minimum.]

On Fri 18-09-20 12:40:32, Peter Xu wrote:
[...]
> One issue is when we charge for cgroup we probably can't do that onto the new
> mm/task, since copy_namespaces() is called after copy_mm().  I don't know
> enough about cgroup, I thought the child will inherit the parent's, but I'm 
> not
> sure.  Or, can we change that order of copy_namespaces() && copy_mm()?  I 
> don't
> see a problem so far but I'd like to ask first..

I suspect you are referring to CLONE_INTO_CGROUP, right? I have only now
learned about this feature so I am not deeply familiar with all the
details and I might be easily wrong. Normally all the cgroup aware
resources are accounted to the parent's cgroup. For memcg that includes
all the page tables, early CoW and other allocations with __GFP_ACCOUNT.
IIUC CLONE_INTO_CGROUP properly then this hasn't changed as the child is
associated to its new cgroup (and memcg) only in cgroup_post_fork. If
that is correct then we might have quite a lot of resources bound to
child's lifetime but accounted to the parent's memcg which can lead to
all sorts of interesting problems (e.g. unreclaimable memory - even by
the oom killer).

Christian, Tejun is this the expected semantic or I am just misreading
the code?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/memcontrol: Add the drop_cache interface for cgroup v2

2020-09-21 Thread Michal Hocko
On Mon 21-09-20 19:23:01, Yafang Shao wrote:
> On Mon, Sep 21, 2020 at 7:05 PM Michal Hocko  wrote:
> >
> > On Mon 21-09-20 18:55:40, Yafang Shao wrote:
> > > On Mon, Sep 21, 2020 at 4:12 PM Michal Hocko  wrote:
> > > >
> > > > On Mon 21-09-20 16:02:55, zangchun...@bytedance.com wrote:
> > > > > From: Chunxin Zang 
> > > > >
> > > > > In the cgroup v1, we have 'force_mepty' interface. This is very
> > > > > useful for userspace to actively release memory. But the cgroup
> > > > > v2 does not.
> > > > >
> > > > > This patch reuse cgroup v1's function, but have a new name for
> > > > > the interface. Because I think 'drop_cache' may be is easier to
> > > > > understand :)
> > > >
> > > > This should really explain a usecase. Global drop_caches is a terrible
> > > > interface and it has caused many problems in the past. People have
> > > > learned to use it as a remedy to any problem they might see and cause
> > > > other problems without realizing that. This is the reason why we even
> > > > log each attempt to drop caches.
> > > >
> > > > I would rather not repeat the same mistake on the memcg level unless
> > > > there is a very strong reason for it.
> > > >
> > >
> > > I think we'd better add these comments above the function
> > > mem_cgroup_force_empty() to explain why we don't want to expose this
> > > interface in cgroup2, otherwise people will continue to send this
> > > proposal without any strong reason.
> >
> > I do not mind people sending this proposal.  "V1 used to have an
> > interface, we need it in v2 as well" is not really viable without
> > providing more reasoning on the specific usecase.
> >
> > _Any_ patch should have a proper justification. This is nothing really
> > new to the process and I am wondering why this is coming as a surprise.
> >
> 
> Container users always want to drop cache in a specific container,
> because they used to use drop_caches to fix memory pressure issues.

This is exactly the kind of problems we have seen in the past. There
should be zero reason to addre potential reclaim problems by dropping
page cache on the floor. There is a huge cargo cult about this
procedure and I have seen numerous reports when people complained about
performance afterwards just to learn that the dropped page cache was one
of the resons for that.

> Although drop_caches can cause some unexpected issues, it could also
> fix some issues.

"Some issues" is way too general. We really want to learn about those
issues and address them properly.

> So container users want to use it in containers as
> well.
> If this feature is not implemented in cgroup, they will ask you why
> but there is no explanation in the kernel.

There is no usecase that would really require it so far.

> Regarding the memory.high, it is not perfect as well, because you have
> to set it to 0 to drop_caches, and the processes in the containers
> have to reclaim pages as well because they reach the memory.high, but
> memory.force_empty won't make other processes to reclaim.

Since 536d3bf261a2 ("mm: memcontrol: avoid workload stalls when lowering
memory.high") the limit is set after the reclaim so the race window when
somebody would be pushed to high limit reclaim is reduced. But I do
agree this is just a workaround.

> That doesn't mean I agree to add this interface, while I really mean
> that if we discard one feature we'd better explain why.

We need to understand why somebody wants an interface because once it is
added it will have to be maintained for ever.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 02/13] mm: use page_off_lru()

2020-09-21 Thread Michal Hocko
On Fri 18-09-20 12:53:58, Yu Zhao wrote:
> On Fri, Sep 18, 2020 at 01:09:14PM +0200, Michal Hocko wrote:
> > On Fri 18-09-20 04:27:13, Yu Zhao wrote:
> > > On Fri, Sep 18, 2020 at 09:37:00AM +0200, Michal Hocko wrote:
> > > > On Thu 17-09-20 21:00:40, Yu Zhao wrote:
> > > > > This patch replaces the only open-coded __ClearPageActive() with
> > > > > page_off_lru(). There is no open-coded __ClearPageUnevictable()s.
> > > > > 
> > > > > Before this patch, we have:
> > > > >   __ClearPageActive()
> > > > >   add_page_to_lru_list()
> > > > > 
> > > > > After this patch, we have:
> > > > >   page_off_lru()
> > > > >   if PageUnevictable()
> > > > >   __ClearPageUnevictable()
> > > > >   else if PageActive()
> > > > >   __ClearPageActive()
> > > > >   add_page_to_lru_list()
> > > > > 
> > > > > Checking PageUnevictable() shouldn't be a problem because these two
> > > > > flags are mutually exclusive. Leaking either will trigger bad_page().
> > > > 
> > > > I am sorry but the changelog is really hard to grasp. What are you
> > > > trying to achieve, why and why it is safe. This should be a general
> > > > outline for any patch. I have already commented on the previous patch
> > > > and asked you for the explanation why removing __ClearPageActive from
> > > > this path is desirable and safe. I have specifically asked to clarify
> > > > the compound page situation as that is using its oen destructor in the
> > > > freeing path and that might result in page_off_lru to be not called.
> > > 
> > > Haven't I explained we are NOT removing __ClearPageActive()? Is my
> > > notion of the code structure above confusing you? Or 'open-coded'
> > > could mean different things?
> > 
> > Please read through my reply carefuly. I am not saying what you are
> > doing is wrong. I am expressing a lack of justification which is the
> > case throughout this patch series. You do not explain why we need it and
> > why reviewers should spend time on this. Because the review is not as
> > trivial as looking at the diff.
> 
> I appreciate your time. But if you are looking for some grand
> justification, I'm afraid I won't be able to give one, because, as it's
> titled, this is just a series of cleanup patches.

You likely had some reason to do that clean up, right? What was that? An
inconcistency in handling some of the page flags when it is moved around
LRU lists? Was the code too hard to reason about? Was it error prone?

Please do not take me wrong, I am not trying to discourage you from
clean up work. There is a lot of code that would benefit from clean ups.
But it certainly helps to outline your motivation and the goal you would
like to achieve. Without that it would boil down to guessing what you
might have thought or simly moving things around without a very good
long term reason.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/memcontrol: Add the drop_cache interface for cgroup v2

2020-09-21 Thread Michal Hocko
On Mon 21-09-20 18:55:40, Yafang Shao wrote:
> On Mon, Sep 21, 2020 at 4:12 PM Michal Hocko  wrote:
> >
> > On Mon 21-09-20 16:02:55, zangchun...@bytedance.com wrote:
> > > From: Chunxin Zang 
> > >
> > > In the cgroup v1, we have 'force_mepty' interface. This is very
> > > useful for userspace to actively release memory. But the cgroup
> > > v2 does not.
> > >
> > > This patch reuse cgroup v1's function, but have a new name for
> > > the interface. Because I think 'drop_cache' may be is easier to
> > > understand :)
> >
> > This should really explain a usecase. Global drop_caches is a terrible
> > interface and it has caused many problems in the past. People have
> > learned to use it as a remedy to any problem they might see and cause
> > other problems without realizing that. This is the reason why we even
> > log each attempt to drop caches.
> >
> > I would rather not repeat the same mistake on the memcg level unless
> > there is a very strong reason for it.
> >
> 
> I think we'd better add these comments above the function
> mem_cgroup_force_empty() to explain why we don't want to expose this
> interface in cgroup2, otherwise people will continue to send this
> proposal without any strong reason.

I do not mind people sending this proposal.  "V1 used to have an
interface, we need it in v2 as well" is not really viable without
providing more reasoning on the specific usecase.

_Any_ patch should have a proper justification. This is nothing really
new to the process and I am wondering why this is coming as a surprise.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/memcontrol: Add the drop_cache interface for cgroup v2

2020-09-21 Thread Michal Hocko
On Mon 21-09-20 16:02:55, zangchun...@bytedance.com wrote:
> From: Chunxin Zang 
> 
> In the cgroup v1, we have 'force_mepty' interface. This is very
> useful for userspace to actively release memory. But the cgroup
> v2 does not.
> 
> This patch reuse cgroup v1's function, but have a new name for
> the interface. Because I think 'drop_cache' may be is easier to
> understand :)

This should really explain a usecase. Global drop_caches is a terrible
interface and it has caused many problems in the past. People have
learned to use it as a remedy to any problem they might see and cause
other problems without realizing that. This is the reason why we even
log each attempt to drop caches.

I would rather not repeat the same mistake on the memcg level unless
there is a very strong reason for it.

> Signed-off-by: Chunxin Zang 
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 11 +++
>  mm/memcontrol.c |  5 +
>  2 files changed, 16 insertions(+)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> b/Documentation/admin-guide/cgroup-v2.rst
> index ce3e05e41724..fbff959c8116 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1181,6 +1181,17 @@ PAGE_SIZE multiple when read back.
>   high limit is used and monitored properly, this limit's
>   utility is limited to providing the final safety net.
>  
> +  memory.drop_cache
> +A write-only single value file which exists on non-root
> +cgroups.
> +
> +Provide a mechanism for users to actively trigger memory
> +reclaim. The cgroup will be reclaimed and as many pages
> +reclaimed as possible.
> +
> +It will broke low boundary. Because it tries to reclaim the
> +memory many times, until the memory drops to a certain level.
> +
>memory.oom.group
>   A read-write single value file which exists on non-root
>   cgroups.  The default value is "0".
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0b38b6ad547d..98646484efff 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6226,6 +6226,11 @@ static struct cftype memory_files[] = {
>   .write = memory_max_write,
>   },
>   {
> + .name = "drop_cache",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .write = mem_cgroup_force_empty_write,
> + },
> + {
>   .name = "events",
>   .flags = CFTYPE_NOT_ON_ROOT,
>   .file_offset = offsetof(struct mem_cgroup, events_file),
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs


Re: [RFC-PATCH 2/4] mm: Add __rcu_alloc_page_lockless() func.

2020-09-21 Thread Michal Hocko
On Fri 18-09-20 21:48:15, Uladzislau Rezki (Sony) wrote:
[...]
> Proposal
> 
> Introduce a lock-free function that obtain a page from the per-cpu-lists
> on current CPU. It returns NULL rather than acquiring any non-raw spinlock.

I was not happy about this solution when we have discussed this
last time and I have to say I am still not happy. This is exposing
an internal allocator optimization and allows a hard to estimate
consumption of pcp free pages. IIUC this run on pcp cache can be
controled directly from userspace (close(open) loop IIRC) which makes it
even bigger no-no.

I strongly agree with Thomas 
http://lkml.kernel.org/r/87tux4kefm@nanos.tec.linutronix.de
that this optimization is not aiming at reasonable workloads. Really, go
with pre-allocated buffer and fallback to whatever slow path you have
already. Exposing more internals of the allocator is not going to do any
good for long term maintainability.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v9 3/3] mm/madvise: introduce process_madvise() syscall: an external memory hinting API

2020-09-21 Thread Michal Hocko
On Mon 21-09-20 07:56:33, Christoph Hellwig wrote:
> On Mon, Aug 31, 2020 at 05:06:33PM -0700, Minchan Kim wrote:
> > There is usecase that System Management Software(SMS) want to give a
> > memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
> > case of Android, it is the ActivityManagerService.
> > 
> > The information required to make the reclaim decision is not known to
> > the app.  Instead, it is known to the centralized userspace
> > daemon(ActivityManagerService), and that daemon must be able to
> > initiate reclaim on its own without any app involvement.
> > 
> > To solve the issue, this patch introduces a new syscall process_madvise(2).
> > It uses pidfd of an external process to give the hint. It also supports
> > vector address range because Android app has thousands of vmas due to
> > zygote so it's totally waste of CPU and power if we should call the
> > syscall one by one for each vma.(With testing 2000-vma syscall vs
> > 1-vector syscall, it showed 15% performance improvement.  I think it
> > would be bigger in real practice because the testing ran very cache
> > friendly environment).
> 
> I'm really not sure this syscall is a good idea.  If you want central
> control you should implement an IPC mechanisms that allows your
> supervisor daemon to tell the application to perform the madvice
> instead of forcing the behavior on it.

Even though I am not entirely happy about the interface [1]. As it seems
I am in minority in my concern I backed off and decided to not block this
work because I do not see the problem with the functionality itself. And
I find it very useful for userspace driven memory management people are
asking for a long time.

This functionality shouldn't be much different from the standard memory
reclaim. It has some limitations (e.g. it can only handle mapped memory)
but allows to pro-actively swap out or reclaim disk based memory based
on a specific knowlege of the workload. Kernel is not able to do the
same.

[1] http://lkml.kernel.org/r/20200117115225.gv19...@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs


Re: [[PATCH]] mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged

2020-09-21 Thread Michal Hocko
On Fri 18-09-20 08:32:13, Vijay Balakrishna wrote:
> 
> 
> On 9/17/2020 10:45 PM, Michal Hocko wrote:
> > On Thu 17-09-20 11:03:56, Vijay Balakrishna wrote:
> > [...]
> > > > > The auto tuned value is incorrect post hotplug memory operation, in 
> > > > > our use
> > > > > case memoy hot add occurs very early during boot.
> > > > Define incorrect. What are the actual values? Have you tried to increase
> > > > the value manually after the hotplug?
> > > 
> > > In our case SoC with 8GB memory, system tuned min_free_kbytes
> > > - first to 22528
> > > - we perform memory hot add very early in boot
> > 
> > What was the original and after-the-hotplug size of memory and layout?
> > I suspect that all the hotplugged memory is in Movable zone, right?
> 
> Yes, added ~1.92GB as Movable type, booting with 6GB at start.
> 
> > 
> > > - now min_free_kbytes is 8703
> > > 
> > > Before looking at code, first I manually restored min_free_kbytes soon 
> > > after
> > > boot, reran stress and didn't notice symptoms I mentioned in change log.
> > 
> > This is really surprising and I strongly suspect that an earlier reclaim
> > just changed the timing enough so that workload has spread the memory
> > prpessure over a longer time and that might have been enough to recycle
> > some of the unreclaimable memory due to its natural life time. But this
> > is a pure speculation. Much more data would be needed to analyze this.
> > 
> > In any case your stress test is oveprovisioning your Normal zone and
> > increased min_free_kbytes just papers over the sizing problem.
> > 
> 
> It is a synthetic workload, likely not sized I need to check.  I feel having
> higher min_free_kbytes made GFP_ATOMIC allocations not to fail.

Yes a higher min_free_kbytes will help GFP_ATOMIC. But only to some
degree. But nobody should depend on an atomic allocation for
correctness. It is just way too easy to fail under a higher memory
pressure.

> I have seen
> NETDEV WATCHDOG timeout with stacktrace trying to allocate memory, looping
> in net rx receive path.

You should talk to net folks.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 02/13] mm: use page_off_lru()

2020-09-18 Thread Michal Hocko
On Fri 18-09-20 04:27:13, Yu Zhao wrote:
> On Fri, Sep 18, 2020 at 09:37:00AM +0200, Michal Hocko wrote:
[...]
> And I have asked this before: why does 'the compound page situation'
> even matter here? Perhaps if you could give a concrete example related
> to the code change and help me understand your concern?

Forgot to answer this part. The compound page concern is a misreading of
the patch on my end. I have missed you are using page_off_lru in this
(move_pages_to_lru) path and that you rely on release_pages to do the
clean up on you. I still find it rather awkward that page_off_lru has
such side effects but I am not deeply familiar with the reasoning
behind so I will rather shut up now.

[...]
> > > @@ -1860,14 +1859,11 @@ static unsigned noinline_for_stack 
> > > move_pages_to_lru(struct lruvec *lruvec,
> > >   lruvec = mem_cgroup_page_lruvec(page, pgdat);
> > >  
> > >   SetPageLRU(page);
> > > - lru = page_lru(page);
> > > -
> > >   add_page_to_lru_list(page, lruvec, lru);
> > >  
> > >   if (put_page_testzero(page)) {
> > >   __ClearPageLRU(page);
> > > - __ClearPageActive(page);
> > > - del_page_from_lru_list(page, lruvec, lru);
> > > + del_page_from_lru_list(page, lruvec, 
> > > page_off_lru(page));
> > >  
> > >       if (unlikely(PageCompound(page))) {
> > >   spin_unlock_irq(>lru_lock);
> > > -- 
> > > 2.28.0.681.g6f77f65b4e-goog
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 02/13] mm: use page_off_lru()

2020-09-18 Thread Michal Hocko
On Fri 18-09-20 04:27:13, Yu Zhao wrote:
> On Fri, Sep 18, 2020 at 09:37:00AM +0200, Michal Hocko wrote:
> > On Thu 17-09-20 21:00:40, Yu Zhao wrote:
> > > This patch replaces the only open-coded __ClearPageActive() with
> > > page_off_lru(). There is no open-coded __ClearPageUnevictable()s.
> > > 
> > > Before this patch, we have:
> > >   __ClearPageActive()
> > >   add_page_to_lru_list()
> > > 
> > > After this patch, we have:
> > >   page_off_lru()
> > >   if PageUnevictable()
> > >   __ClearPageUnevictable()
> > >   else if PageActive()
> > >   __ClearPageActive()
> > >   add_page_to_lru_list()
> > > 
> > > Checking PageUnevictable() shouldn't be a problem because these two
> > > flags are mutually exclusive. Leaking either will trigger bad_page().
> > 
> > I am sorry but the changelog is really hard to grasp. What are you
> > trying to achieve, why and why it is safe. This should be a general
> > outline for any patch. I have already commented on the previous patch
> > and asked you for the explanation why removing __ClearPageActive from
> > this path is desirable and safe. I have specifically asked to clarify
> > the compound page situation as that is using its oen destructor in the
> > freeing path and that might result in page_off_lru to be not called.
> 
> Haven't I explained we are NOT removing __ClearPageActive()? Is my
> notion of the code structure above confusing you? Or 'open-coded'
> could mean different things?

Please read through my reply carefuly. I am not saying what you are
doing is wrong. I am expressing a lack of justification which is the
case throughout this patch series. You do not explain why we need it and
why reviewers should spend time on this. Because the review is not as
trivial as looking at the diff.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/13] mm: clean up some lru related pieces

2020-09-18 Thread Michal Hocko
On Thu 17-09-20 21:00:38, Yu Zhao wrote:
> Hi Andrew,
> 
> I see you have taken this:
>   mm: use add_page_to_lru_list()/page_lru()/page_off_lru()
> Do you mind dropping it?
> 
> Michal asked to do a bit of additional work. So I thought I probably
> should create a series to do more cleanups I've been meaning to.
> 
> This series contains the change in the patch above and goes a few
> more steps farther. It's intended to improve readability and should
> not have any performance impacts. There are minor behavior changes in
> terms of debugging and error reporting, which I have all highlighted
> in the individual patches. All patches were properly tested on 5.8
> running Chrome OS, with various debug options turned on.
> 
> Michal,
> 
> Do you mind taking a looking at the entire series?

I have stopped at patch 3 as all patches until then are really missing
any justification. What is the point for all this to be done? The code
is far from trivial and just shifting around sounds like a risk. You are
removing ~50 LOC which is always nice but I am not sure the resulting
code is better maintainble or easier to read and understand. Just
consider __ClearPageLRU moving to page_off_lru patch. What is the
additional value of having the flag moved and burry it into a function
to have even more side effects? I found the way how __ClearPageLRU is
nicely close to removing it from LRU easier to follow. This is likely
subjective and other might think differently but as it is not clear what
is your actual goal here it is hard to judge pros and cons.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 03/13] mm: move __ClearPageLRU() into page_off_lru()

2020-09-18 Thread Michal Hocko
On Thu 17-09-20 21:00:41, Yu Zhao wrote:
> Now we have a total of three places that free lru pages when their
> references become zero (after we drop the reference from isolation).
> 
> Before this patch, they all do:
>   __ClearPageLRU()
>   page_off_lru()
>   del_page_from_lru_list()
> 
> After this patch, they become:
>   page_off_lru()
>   __ClearPageLRU()
>   del_page_from_lru_list()
> 
> This change should have no side effects.

Again, why this is desirable?

> Signed-off-by: Yu Zhao 
> ---
>  include/linux/mm_inline.h | 1 +
>  mm/swap.c | 2 --
>  mm/vmscan.c   | 1 -
>  3 files changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 8fc71e9d7bb0..be9418425e41 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -92,6 +92,7 @@ static __always_inline enum lru_list page_off_lru(struct 
> page *page)
>  {
>   enum lru_list lru;
>  
> + __ClearPageLRU(page);
>   if (PageUnevictable(page)) {
>   __ClearPageUnevictable(page);
>   lru = LRU_UNEVICTABLE;
> diff --git a/mm/swap.c b/mm/swap.c
> index 40bf20a75278..8362083f00c9 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -86,7 +86,6 @@ static void __page_cache_release(struct page *page)
>   spin_lock_irqsave(>lru_lock, flags);
>   lruvec = mem_cgroup_page_lruvec(page, pgdat);
>   VM_BUG_ON_PAGE(!PageLRU(page), page);
> - __ClearPageLRU(page);
>   del_page_from_lru_list(page, lruvec, page_off_lru(page));
>   spin_unlock_irqrestore(>lru_lock, flags);
>   }
> @@ -895,7 +894,6 @@ void release_pages(struct page **pages, int nr)
>  
>   lruvec = mem_cgroup_page_lruvec(page, locked_pgdat);
>   VM_BUG_ON_PAGE(!PageLRU(page), page);
> - __ClearPageLRU(page);
>   del_page_from_lru_list(page, lruvec, 
> page_off_lru(page));
>   }
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f257d2f61574..f9a186a96410 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1862,7 +1862,6 @@ static unsigned noinline_for_stack 
> move_pages_to_lru(struct lruvec *lruvec,
>   add_page_to_lru_list(page, lruvec, lru);
>  
>   if (put_page_testzero(page)) {
> - __ClearPageLRU(page);
>   del_page_from_lru_list(page, lruvec, 
> page_off_lru(page));
>  
>   if (unlikely(PageCompound(page))) {
> -- 
> 2.28.0.681.g6f77f65b4e-goog

-- 
Michal Hocko
SUSE Labs


<    4   5   6   7   8   9   10   11   12   13   >