Re: [PATCH v2 4/6] mm: replace vma->vm_flags indirect modification in ksm_madvise

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 08:57:48, Suren Baghdasaryan wrote:
> On Wed, Jan 25, 2023 at 1:38 AM 'Michal Hocko' via kernel-team
>  wrote:
> >
> > On Wed 25-01-23 00:38:49, Suren Baghdasaryan wrote:
> > > Replace indirect modifications to vma->vm_flags with calls to modifier
> > > functions to be able to track flag changes and to keep vma locking
> > > correctness. Add a BUG_ON check in ksm_madvise() to catch indirect
> > > vm_flags modification attempts.
> >
> > Those BUG_ONs scream to much IMHO. KSM is an MM internal code so I
> > gueess we should be willing to trust it.
> 
> Yes, but I really want to prevent an indirect misuse since it was not
> easy to find these. If you feel strongly about it I will remove them
> or if you have a better suggestion I'm all for it.

You can avoid that by making flags inaccesible directly, right?

-- 
Michal Hocko
SUSE Labs



Re: [PATCH v2 6/6] mm: export dump_mm()

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:51, Suren Baghdasaryan wrote:
> mmap_assert_write_locked() is used in vm_flags modifiers. Because
> mmap_assert_write_locked() uses dump_mm() and vm_flags are sometimes
> modified from from inside a module, it's necessary to export
> dump_mm() function.
> 
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 

> ---
>  mm/debug.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/debug.c b/mm/debug.c
> index 9d3d893dc7f4..96d594e16292 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -215,6 +215,7 @@ void dump_mm(const struct mm_struct *mm)
>   mm->def_flags, >def_flags
>   );
>  }
> +EXPORT_SYMBOL(dump_mm);
>  
>  static bool page_init_poisoning __read_mostly = true;
>  
> -- 
> 2.39.1

-- 
Michal Hocko
SUSE Labs



Re: [PATCH v2 5/6] mm: introduce mod_vm_flags_nolock and use it in untrack_pfn

2023-01-25 Thread Michal Hocko
mu();
> @@ -2391,7 +2391,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct 
> vm_area_struct *vma,
>   mmap_write_downgrade(mm);
>   }
>  
> - unmap_region(mm, _detach, vma, prev, next, start, end);
> + /*
> +  * We can free page tables without write-locking mmap_lock because VMAs
> +  * were isolated before we downgraded mmap_lock.
> +  */
> + unmap_region(mm, _detach, vma, prev, next, start, end, !downgrade);
>   /* Statistics and freeing VMAs */
>   mas_set(_detach, start);
>   remove_mt(mm, _detach);
> @@ -2704,7 +2708,7 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>  
>   /* Undo any partial mapping done by a device driver. */
>   unmap_region(mm, >mm_mt, vma, prev, next, vma->vm_start,
> -  vma->vm_end);
> +  vma->vm_end, true);
>   }
>   if (file && (vm_flags & VM_SHARED))
>   mapping_unmap_writable(file->f_mapping);
> @@ -3031,7 +3035,7 @@ void exit_mmap(struct mm_struct *mm)
>   tlb_gather_mmu_fullmm(, mm);
>   /* update_hiwater_rss(mm) here? but nobody should be looking */
>   /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> - unmap_vmas(, >mm_mt, vma, 0, ULONG_MAX);
> + unmap_vmas(, >mm_mt, vma, 0, ULONG_MAX, false);
>   mmap_read_unlock(mm);
>  
>   /*
> -- 
> 2.39.1

-- 
Michal Hocko
SUSE Labs



Re: [PATCH v2 4/6] mm: replace vma->vm_flags indirect modification in ksm_madvise

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:49, Suren Baghdasaryan wrote:
> Replace indirect modifications to vma->vm_flags with calls to modifier
> functions to be able to track flag changes and to keep vma locking
> correctness. Add a BUG_ON check in ksm_madvise() to catch indirect
> vm_flags modification attempts.

Those BUG_ONs scream to much IMHO. KSM is an MM internal code so I
gueess we should be willing to trust it.

> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs



Re: [PATCH v2 3/6] mm: replace vma->vm_flags direct modifications with modifier calls

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:48, Suren Baghdasaryan wrote:
> Replace direct modifications to vma->vm_flags with calls to modifier
> functions to be able to track flag changes and to keep vma locking
> correctness.

Is this a manual (git grep) based work or have you used Coccinele for
the patch generation?

My potentially incomplete check
$ git grep ">[[:space:]]*vm_flags[[:space:]]*[&|^]="

shows that nothing should be left after this. There is still quite a lot
of direct checks of the flags (more than 600). Maybe it would be good to
make flags accessible only via accessors which would also prevent any
future direct setting of those flags in uncontrolled way as well.

Anyway
Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs



Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:46, Suren Baghdasaryan wrote:
> vm_flags are among VMA attributes which affect decisions like VMA merging
> and splitting. Therefore all vm_flags modifications are performed after
> taking exclusive mmap_lock to prevent vm_flags updates racing with such
> operations. Introduce modifier functions for vm_flags to be used whenever
> flags are updated. This way we can better check and control correct
> locking behavior during these updates.
> 
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 

> ---
>  include/linux/mm.h   | 37 +
>  include/linux/mm_types.h |  8 +++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c2f62bdce134..b71f2809caac 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,6 +627,43 @@ static inline void vma_init(struct vm_area_struct *vma, 
> struct mm_struct *mm)
>   INIT_LIST_HEAD(>anon_vma_chain);
>  }
>  
> +/* Use when VMA is not part of the VMA tree and needs no locking */
> +static inline void init_vm_flags(struct vm_area_struct *vma,
> +  unsigned long flags)
> +{
> + vma->vm_flags = flags;
> +}
> +
> +/* Use when VMA is part of the VMA tree and modifications need coordination 
> */
> +static inline void reset_vm_flags(struct vm_area_struct *vma,
> +   unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + init_vm_flags(vma, flags);
> +}
> +
> +static inline void set_vm_flags(struct vm_area_struct *vma,
> + unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags |= flags;
> +}
> +
> +static inline void clear_vm_flags(struct vm_area_struct *vma,
> +   unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags &= ~flags;
> +}
> +
> +static inline void mod_vm_flags(struct vm_area_struct *vma,
> + unsigned long set, unsigned long clear)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags |= set;
> + vma->vm_flags &= ~clear;
> +}
> +
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
>  {
>   vma->vm_ops = NULL;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2d6d790d9bed..6c7c70bf50dd 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -491,7 +491,13 @@ struct vm_area_struct {
>* See vmf_insert_mixed_prot() for discussion.
>*/
>   pgprot_t vm_page_prot;
> - unsigned long vm_flags; /* Flags, see mm.h. */
> +
> + /*
> +  * Flags, see mm.h.
> +  * WARNING! Do not modify directly.
> +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> +  */
> + unsigned long vm_flags;
>  
>   /*
>* For areas with an address space and backing store,
> -- 
> 2.39.1

-- 
Michal Hocko
SUSE Labs



Re: [PATCH v2 2/6] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:47, Suren Baghdasaryan wrote:
> To simplify the usage of VM_LOCKED_CLEAR_MASK in clear_vm_flags(),
> replace it with VM_LOCKED_MASK bitmask and convert all users.
>
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 

> ---
>  include/linux/mm.h | 4 ++--
>  kernel/fork.c  | 2 +-
>  mm/hugetlb.c   | 4 ++--
>  mm/mlock.c | 6 +++---
>  mm/mmap.c  | 6 +++---
>  mm/mremap.c| 2 +-
>  6 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b71f2809caac..da62bdd627bf 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -421,8 +421,8 @@ extern unsigned int kobjsize(const void *objp);
>  /* This mask defines which mm->def_flags a process can inherit its parent */
>  #define VM_INIT_DEF_MASK VM_NOHUGEPAGE
>  
> -/* This mask is used to clear all the VMA flags used by mlock */
> -#define VM_LOCKED_CLEAR_MASK (~(VM_LOCKED | VM_LOCKONFAULT))
> +/* This mask represents all the VMA flag bits used by mlock */
> +#define VM_LOCKED_MASK   (VM_LOCKED | VM_LOCKONFAULT)
>  
>  /* Arch-specific flags to clear when updating VM flags on protection change 
> */
>  #ifndef VM_ARCH_CLEAR
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 6683c1b0f460..03d472051236 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -669,7 +669,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>   tmp->anon_vma = NULL;
>   } else if (anon_vma_fork(tmp, mpnt))
>   goto fail_nomem_anon_vma_fork;
> - tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> + clear_vm_flags(tmp, VM_LOCKED_MASK);
>   file = tmp->vm_file;
>   if (file) {
>   struct address_space *mapping = file->f_mapping;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d20c8b09890e..4ecdbad9a451 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6973,8 +6973,8 @@ static unsigned long page_table_shareable(struct 
> vm_area_struct *svma,
>   unsigned long s_end = sbase + PUD_SIZE;
>  
>   /* Allow segments to share if only one is marked locked */
> - unsigned long vm_flags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> - unsigned long svm_flags = svma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED_MASK;
> + unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED_MASK;
>  
>   /*
>* match the virtual addresses, permission and the alignment of the
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 0336f52e03d7..5c4fff93cd6b 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -497,7 +497,7 @@ static int apply_vma_lock_flags(unsigned long start, 
> size_t len,
>   if (vma->vm_start != tmp)
>   return -ENOMEM;
>  
> - newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + newflags = vma->vm_flags & ~VM_LOCKED_MASK;
>   newflags |= flags;
>   /* Here we know that  vma->vm_start <= nstart < vma->vm_end. */
>   tmp = vma->vm_end;
> @@ -661,7 +661,7 @@ static int apply_mlockall_flags(int flags)
>   struct vm_area_struct *vma, *prev = NULL;
>   vm_flags_t to_add = 0;
>  
> - current->mm->def_flags &= VM_LOCKED_CLEAR_MASK;
> + current->mm->def_flags &= ~VM_LOCKED_MASK;
>   if (flags & MCL_FUTURE) {
>   current->mm->def_flags |= VM_LOCKED;
>  
> @@ -681,7 +681,7 @@ static int apply_mlockall_flags(int flags)
>   for_each_vma(vmi, vma) {
>   vm_flags_t newflags;
>  
> - newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + newflags = vma->vm_flags & ~VM_LOCKED_MASK;
>   newflags |= to_add;
>  
>   /* Ignore errors */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d4abc6feced1..323bd253b25a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2671,7 +2671,7 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>   if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
>   is_vm_hugetlb_page(vma) ||
>   vma == get_gate_vma(current->mm))
> - vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> + clear_vm_flags(vma, VM_LOCKED_MASK);
>   else
>   mm->locked_vm += (len >> PAGE_SHIFT);
>   }
> @@ -3340,8 +3340,8 @@ static struct vm_area_struct *__install_special_mapping(
&g

Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-08 Thread Michal Hocko
On Thu 08-09-22 03:29:50, Kent Overstreet wrote:
> On Thu, Sep 08, 2022 at 09:12:45AM +0200, Michal Hocko wrote:
> > Then you have probably missed a huge part of my emails. Please
> > re-read. If those arguments are not clear, feel free to ask for
> > clarification. Reducing the whole my reasoning and objections to the
> > sentence above and calling that vapid and lazy is not only unfair but
> > also disrespectful.
> 
> What, where you complained about slab's page allocations showing up in the
> profile instead of slab, and I pointed out to you that actually each and every
> slab call is instrumented, and you're just seeing some double counting (that 
> we
> will no doubt fix?)
> 
> Or when you complained about allocation sites where it should actually be the
> caller that should be instrumented, and I pointed out that it'd be quite easy 
> to
> simply change that code to use _kmalloc() and slab_tag_add() directly, if it
> becomes an issue.
> 
> Of course, if we got that far, we'd have this code to thank for telling us 
> where
> to look!
> 
> Did I miss anything?

Feel free to reponse to specific arguments as I wrote them. I won't
repeat them again. Sure we can discuss how important/relevant those
are. And that _can_ be a productive discussion.

-- 
Michal Hocko
SUSE Labs



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-08 Thread Michal Hocko
On Thu 08-09-22 02:35:48, Kent Overstreet wrote:
> On Wed, Sep 07, 2022 at 09:45:18AM -0400, Steven Rostedt wrote:
> > On Wed, 7 Sep 2022 09:04:28 -0400
> > Kent Overstreet  wrote:
> > 
> > > On Wed, Sep 07, 2022 at 01:00:09PM +0200, Michal Hocko wrote:
> > > > Hmm, it seems that further discussion doesn't really make much sense
> > > > here. I know how to use my time better.  
> > > 
> > > Just a thought, but I generally find it more productive to propose ideas 
> > > than to
> > > just be disparaging.
> > > 
> > 
> > But it's not Michal's job to do so. He's just telling you that the given
> > feature is not worth the burden. He's telling you the issues that he has
> > with the patch set. It's the submitter's job to address those concerns and
> > not the maintainer's to tell you how to make it better.
> > 
> > When Linus tells us that a submission is crap, we don't ask him how to make
> > it less crap, we listen to why he called it crap, and then rewrite to be
> > not so crappy. If we cannot figure it out, it doesn't get in.
> 
> When Linus tells someone a submission is crap, he _always_ has a sound, and
> _specific_ technical justification for doing so.
> 
> "This code is going to be a considerable maintenance burden" is vapid, and 
> lazy.
> It's the kind of feedback made by someone who has looked at the number of 
> lines
> of code a patch touches and not much more.

Then you have probably missed a huge part of my emails. Please
re-read. If those arguments are not clear, feel free to ask for
clarification. Reducing the whole my reasoning and objections to the
sentence above and calling that vapid and lazy is not only unfair but
also disrespectful.

-- 
Michal Hocko
SUSE Labs



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-07 Thread Michal Hocko
On Tue 06-09-22 14:20:58, Kent Overstreet wrote:
[...]
> Otherwise, saying "code has to be maintained" is a little bit like saying 
> water
> is wet, and we're all engineers here, I think we know that :)

Hmm, it seems that further discussion doesn't really make much sense
here. I know how to use my time better.
-- 
Michal Hocko
SUSE Labs



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-06 Thread Michal Hocko
On Mon 05-09-22 11:03:35, Suren Baghdasaryan wrote:
> On Mon, Sep 5, 2022 at 1:12 AM Michal Hocko  wrote:
> >
> > On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> > > On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko  wrote:
> > [...]
> > > > Yes, tracking back the call trace would be really needed. The question
> > > > is whether this is really prohibitively expensive. How much overhead are
> > > > we talking about? There is no free lunch here, really.  You either have
> > > > the overhead during runtime when the feature is used or on the source
> > > > code level for all the future development (with a maze of macros and
> > > > wrappers).
> > >
> > > As promised, I profiled a simple code that repeatedly makes 10
> > > allocations/frees in a loop and measured overheads of code tagging,
> > > call stack capturing and tracing+BPF for page and slab allocations.
> > > Summary:
> > >
> > > Page allocations (overheads are compared to get_free_pages() duration):
> > > 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + 
> > > __alloc_tag_add)
> > > 8.8% lookup_page_ext
> > > 1237% call stack capture
> > > 139% tracepoint with attached empty BPF program
> >
> > Yes, I am not surprised that the call stack capturing is really
> > expensive comparing to the allocator fast path (which is really highly
> > optimized and I suspect that with 10 allocation/free loop you mostly get
> > your memory from the pcp lists). Is this overhead still _that_ visible
> > for somehow less microoptimized workloads which have to take slow paths
> > as well?
> 
> Correct, it's a comparison with the allocation fast path, so in a
> sense represents the worst case scenario. However at the same time the
> measurements are fair because they measure the overheads against the
> same meaningful baseline, therefore can be used for comparison.

Yes, I am not saying it is an unfair comparision. It is just not a
particularly practical one for real life situations. So I am not sure
you can draw many conclusions from that. Or let me put it differently.
There is not real point comparing the code tagging and stack unwiding
approaches because the later is simply more complex because it collects
more state. The main question is whether that additional state
collection is too expensive to be practically used.
 
> > Also what kind of stack unwinder is configured (I guess ORC)? This is
> > not my area but from what I remember the unwinder overhead varies
> > between ORC and FP.
> 
> I used whatever is default and didn't try other mechanisms. Don't
> think the difference would be orders of magnitude better though.
> 
> >
> > And just to make it clear. I do realize that an overhead from the stack
> > unwinding is unavoidable. And code tagging would logically have lower
> > overhead as it performs much less work. But the main point is whether
> > our existing stack unwiding approach is really prohibitively expensive
> > to be used for debugging purposes on production systems. I might
> > misremember but I recall people having bigger concerns with page_owner
> > memory footprint than the actual stack unwinder overhead.
> 
> That's one of those questions which are very difficult to answer (if
> even possible) because that would depend on the use scenario. If the
> workload allocates frequently then adding the overhead will likely
> affect it, otherwise might not be even noticeable. In general, in
> pre-production testing we try to minimize the difference in
> performance and memory profiles between the software we are testing
> and the production one. From that point of view, the smaller the
> overhead, the better. I know it's kinda obvious but unfortunately I
> have no better answer to that question.

This is clear but it doesn't really tell whether the existing tooling is
unusable for _your_ or any specific scenarios. Because when we are
talking about adding quite a lot of code and make our allocators APIs
more complicated to track the state then we should carefully weigh the
benefit and the cost. As replied to other email I am really skeptical
this patchset is at the final stage and the more allocators get covered
the more code we have to maintain. So there must be a very strong reason
to add it.

> For the memory overhead, in my early internal proposal with assumption
> of 1 instrumented allocation call sites, I've made some
> calculations for an 8GB 8-core system (quite typical for Android) and
> ended up with the following:
> 
> per-cpu counters  atomic counters
> page_ext references 16MB  

Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-06 Thread Michal Hocko
On Mon 05-09-22 19:46:49, Kent Overstreet wrote:
> On Mon, Sep 05, 2022 at 10:49:38AM +0200, Michal Hocko wrote:
> > This is really my main concern about this whole work. Not only it adds a
> > considerable maintenance burden to the core MM because
> 
> [citation needed]

I thought this was clear from the email content (the part you haven't
quoted here). But let me be explicit one more time for you.

I hope we can agree that in order for this kind of tracking to be useful
you need to cover _callers_ of the allocator or in the ideal world
the users/owner of the tracked memory (the later is sometimes much
harder/impossible to track when the memory is handed over from one peer
to another).

It is not particularly useful IMO to see that a large portion of the
memory has been allocated by say vmalloc or kvmalloc, right?  How
much does it really tell you that a lot of memory has been allocated
by kvmalloc or vmalloc? Yet, neither of the two is handled by the
proposed tracking and it would require additional code to be added and
_maintained_ to cover them. But that would be still far from complete,
we have bulk allocator, mempools etc.

If that was not enough some of those allocators are used by library code
like seq_file, networking pools, module loader and whatnot. So this
grows and effectively doubles the API space for many allocators as they
need both normal API and the one which can pass the tracking context
down the path to prevent double tracking. Right?

This in my book is a considerable maintenance burden. And especially for
the MM subsystem this means additional burden because we have a very
rich allocators APIs.

You are absolutely right that processing stack traces is PITA but that
allows to see the actual callers irrespectively how many layers of
indirection or library code it goes.

> > it adds on top of
> > our existing allocator layers complexity but it would need to spread beyond
> > MM to be useful because it is usually outside of MM where leaks happen.
> 
> If you want the tracking to happen at a different level of the call stack, 
> just
> call _kmalloc() directly and call alloc_tag_add()/sub() yourself.

As pointed above this just scales poorly and adds to the API space. Not
to mention that direct use of alloc_tag_add can just confuse layers
below which rely on the same thing.

Hope this makes it clearer.
-- 
Michal Hocko
SUSE Labs



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-05 Thread Michal Hocko
On Thu 01-09-22 16:15:02, Kent Overstreet wrote:
> On Thu, Sep 01, 2022 at 12:39:11PM -0700, Suren Baghdasaryan wrote:
> > kmemleak is known to be slow and it's even documented [1], so I hope I
> > can skip that part. For page_owner to provide the comparable
> > information we would have to capture the call stacks for all page
> > allocations unlike our proposal which allows to do that selectively
> > for specific call sites. I'll post the overhead numbers of call stack
> > capturing once I'm finished with profiling the latest code, hopefully
> > sometime tomorrow, in the worst case after the long weekend.
> 
> To expand on this further: we're stashing a pointer to the alloc_tag, which is
> defined at the allocation callsite. That's how we're able to decrement the
> proper counter on free, and why this beats any tracing based approach - with
> tracing you'd instead have to correlate allocate/free events. Ouch.
> 
> > > Yes, tracking back the call trace would be really needed. The question
> > > is whether this is really prohibitively expensive. How much overhead are
> > > we talking about? There is no free lunch here, really.  You either have
> > > the overhead during runtime when the feature is used or on the source
> > > code level for all the future development (with a maze of macros and
> > > wrappers).
> 
> The full call stack is really not what you want in most applications - that's
> what people think they want at first, and why page_owner works the way it 
> does,
> but it turns out that then combining all the different but related stack 
> traces
> _sucks_ (so why were you saving them in the first place?), and then you have 
> to
> do a separate memory allocate for each stack track, which destroys 
> performance.

I do agree that the full stack trace is likely not what you need. But
the portion of the stack that you need is not really clear because the
relevant part might be on a different level of the calltrace depending
on the allocation site. Take this as an example:
{traverse, seq_read_iter, single_open_size}->seq_buf_alloc -> kvmalloc -> 
kmalloc

This whole part of the stack is not really all that interesting and you
would have to allocate pretty high at the API layer to catch something
useful. And please remember that seq_file interface is heavily used in
throughout the kernel. I wouldn't suspect seq_file itself to be buggy,
that is well exercised code but its users can botch things and that is
where the leak would happen. There are many other examples like that
where the allocation is done at a lib/infrastructure layer (sysfs
framework, mempools network pool allocators and whatnot). We do care
about those users, really. Ad-hoc pool allocators built on top of the
core MM allocators are not really uncommon. And I am really skeptical we
really want to add all the tagging source code level changes to each and
every one of them.

This is really my main concern about this whole work. Not only it adds a
considerable maintenance burden to the core MM because it adds on top of
our existing allocator layers complexity but it would need to spread beyond
MM to be useful because it is usually outside of MM where leaks happen.
-- 
Michal Hocko
SUSE Labs



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-05 Thread Michal Hocko
On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
> On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko  wrote:
[...]
> > Yes, tracking back the call trace would be really needed. The question
> > is whether this is really prohibitively expensive. How much overhead are
> > we talking about? There is no free lunch here, really.  You either have
> > the overhead during runtime when the feature is used or on the source
> > code level for all the future development (with a maze of macros and
> > wrappers).
> 
> As promised, I profiled a simple code that repeatedly makes 10
> allocations/frees in a loop and measured overheads of code tagging,
> call stack capturing and tracing+BPF for page and slab allocations.
> Summary:
> 
> Page allocations (overheads are compared to get_free_pages() duration):
> 6.8% Codetag counter manipulations (__lazy_percpu_counter_add + 
> __alloc_tag_add)
> 8.8% lookup_page_ext
> 1237% call stack capture
> 139% tracepoint with attached empty BPF program

Yes, I am not surprised that the call stack capturing is really
expensive comparing to the allocator fast path (which is really highly
optimized and I suspect that with 10 allocation/free loop you mostly get
your memory from the pcp lists). Is this overhead still _that_ visible
for somehow less microoptimized workloads which have to take slow paths
as well?

Also what kind of stack unwinder is configured (I guess ORC)? This is
not my area but from what I remember the unwinder overhead varies
between ORC and FP.

And just to make it clear. I do realize that an overhead from the stack
unwinding is unavoidable. And code tagging would logically have lower
overhead as it performs much less work. But the main point is whether
our existing stack unwiding approach is really prohibitively expensive
to be used for debugging purposes on production systems. I might
misremember but I recall people having bigger concerns with page_owner
memory footprint than the actual stack unwinder overhead.
-- 
Michal Hocko
SUSE Labs



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Michal Hocko
On Thu 01-09-22 08:33:19, Suren Baghdasaryan wrote:
> On Thu, Sep 1, 2022 at 12:18 AM Michal Hocko  wrote:
[...]
> > So I find Peter's question completely appropriate while your response to
> > that not so much! Maybe ftrace is not the right tool for the intented
> > job. Maybe there are other ways and it would be really great to show
> > that those have been evaluated and they are not suitable for a), b) and
> > c) reasons.
> 
> That's fair.
> For memory tracking I looked into using kmemleak and page_owner which
> can't match the required functionality at an overhead acceptable for
> production and pre-production testing environments.

Being more specific would be really helpful. Especially when your cover
letter suggests that you rely on page_owner/memcg metadata as well to
match allocation and their freeing parts.

> traces + BPF I
> haven't evaluated myself but heard from other members of my team who
> tried using that in production environment with poor results. I'll try
> to get more specific information on that.

That would be helpful as well.

> > E.g. Oscar has been working on extending page_ext to track number of
> > allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
> > it can help in environments where page_ext can be enabled and it is
> > completely non-intrusive to the MM code.
> 
> Thanks for pointing out this work. I'll need to review and maybe
> profile it before making any claims.
> 
> >
> > If the page_ext overhead is not desirable/acceptable then I am sure
> > there are other options. E.g. kprobes/LivePatching framework can hook
> > into functions and alter their behavior. So why not use that for data
> > collection? Has this been evaluated at all?
> 
> I'm not sure how I can hook into say alloc_pages() to find out where
> it was called from without capturing the call stack (which would
> introduce an overhead at every allocation). Would love to discuss this
> or other alternatives if they can be done with low enough overhead.

Yes, tracking back the call trace would be really needed. The question
is whether this is really prohibitively expensive. How much overhead are
we talking about? There is no free lunch here, really.  You either have
the overhead during runtime when the feature is used or on the source
code level for all the future development (with a maze of macros and
wrappers).

Thanks!
-- 
Michal Hocko
SUSE Labs



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-09-01 Thread Michal Hocko
On Wed 31-08-22 15:01:54, Kent Overstreet wrote:
> On Wed, Aug 31, 2022 at 12:47:32PM +0200, Michal Hocko wrote:
> > On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> > > Whatever asking for an explanation as to why equivalent functionality
> > > cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.
> > 
> > Fully agreed and this is especially true for a change this size
> > 77 files changed, 3406 insertions(+), 703 deletions(-)
> 
> In the case of memory allocation accounting, you flat cannot do this with 
> ftrace
> - you could maybe do a janky version that isn't fully accurate, much slower,
> more complicated for the developer to understand and debug and more 
> complicated
> for the end user.
> 
> But please, I invite anyone who's actually been doing this with ftrace to
> demonstrate otherwise.
> 
> Ftrace just isn't the right tool for the job here - we're talking about adding
> per callsite accounting to some of the fastest fast paths in the kernel.
> 
> And the size of the changes for memory allocation accounting are much more
> reasonable:
>  33 files changed, 623 insertions(+), 99 deletions(-)
> 
> The code tagging library should exist anyways, it's been open coded half a 
> dozen
> times in the kernel already.
> 
> And once we've got that, the time stats code is _also_ far simpler than doing 
> it
> with ftrace would be. If anyone here has successfully debugged latency issues
> with ftrace, I'd really like to hear it. Again, for debugging latency issues 
> you
> want something that can always be on, and that's not cheap with ftrace - and
> never mind the hassle of correlating start and end wait trace events, builting
> up histograms, etc. - that's all handled here.
> 
> Cheap, simple, easy to use. What more could you want?

A big ad on a banner. But more seriously.

This patchset is _huge_ and touching a lot of different areas. It will
be not only hard to review but even harder to maintain longterm. So
it is completely reasonable to ask for potential alternatives with a
smaller code footprint. I am pretty sure you are aware of that workflow.

So I find Peter's question completely appropriate while your response to
that not so much! Maybe ftrace is not the right tool for the intented
job. Maybe there are other ways and it would be really great to show
that those have been evaluated and they are not suitable for a), b) and
c) reasons.

E.g. Oscar has been working on extending page_ext to track number of
allocations for specific calltrace[1]. Is this 1:1 replacement? No! But
it can help in environments where page_ext can be enabled and it is
completely non-intrusive to the MM code.

If the page_ext overhead is not desirable/acceptable then I am sure
there are other options. E.g. kprobes/LivePatching framework can hook
into functions and alter their behavior. So why not use that for data
collection? Has this been evaluated at all?

And please note that I am not claiming the presented work is approaching
the problem from a wrong direction. It might very well solve multiple
problems in a single go _but_ the long term code maintenance burden
really has to to be carefully evaluated and if we can achieve a
reasonable subset of the functionality with an existing infrastructure
then I would be inclined to sacrifice some portions with a considerably
smaller code footprint.

[1] http://lkml.kernel.org/r/20220901044249.4624-1-osalva...@suse.de

-- 
Michal Hocko
SUSE Labs



Re: [RFC PATCH 00/30] Code tagging framework and applications

2022-08-31 Thread Michal Hocko
On Wed 31-08-22 11:19:48, Mel Gorman wrote:
> On Wed, Aug 31, 2022 at 04:42:30AM -0400, Kent Overstreet wrote:
> > On Wed, Aug 31, 2022 at 09:38:27AM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 30, 2022 at 02:48:49PM -0700, Suren Baghdasaryan wrote:
> > > > ===
> > > > Code tagging framework
> > > > ===
> > > > Code tag is a structure identifying a specific location in the source 
> > > > code
> > > > which is generated at compile time and can be embedded in an 
> > > > application-
> > > > specific structure. Several applications of code tagging are included in
> > > > this RFC, such as memory allocation tracking, dynamic fault injection,
> > > > latency tracking and improved error code reporting.
> > > > Basically, it takes the old trick of "define a special elf section for
> > > > objects of a given type so that we can iterate over them at runtime" and
> > > > creates a proper library for it.
> > > 
> > > I might be super dense this morning, but what!? I've skimmed through the
> > > set and I don't think I get it.
> > > 
> > > What does this provide that ftrace/kprobes don't already allow?
> > 
> > You're kidding, right?
> 
> It's a valid question. From the description, it main addition that would
> be hard to do with ftrace or probes is catching where an error code is
> returned. A secondary addition would be catching all historical state and
> not just state since the tracing started.
> 
> It's also unclear *who* would enable this. It looks like it would mostly
> have value during the development stage of an embedded platform to track
> kernel memory usage on a per-application basis in an environment where it
> may be difficult to setup tracing and tracking. Would it ever be enabled
> in production? Would a distribution ever enable this? If it's enabled, any
> overhead cannot be disabled/enabled at run or boot time so anyone enabling
> this would carry the cost without never necessarily consuming the data.
> 
> It might be an ease-of-use thing. Gathering the information from traces
> is tricky and would need combining multiple different elements and that
> is development effort but not impossible.
> 
> Whatever asking for an explanation as to why equivalent functionality
> cannot not be created from ftrace/kprobe/eBPF/whatever is reasonable.

Fully agreed and this is especially true for a change this size
77 files changed, 3406 insertions(+), 703 deletions(-)

-- 
Michal Hocko
SUSE Labs



Re: [PATCH] mm, page_alloc: fix build_zonerefs_node()

2022-04-07 Thread Michal Hocko
On Thu 07-04-22 14:12:38, David Hildenbrand wrote:
> On 07.04.22 14:04, Michal Hocko wrote:
> > On Thu 07-04-22 13:58:44, David Hildenbrand wrote:
> > [...]
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 3589febc6d31..130a2feceddc 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, 
> >>> struct zoneref *zonerefs)
> >>>   do {
> >>>   zone_type--;
> >>>   zone = pgdat->node_zones + zone_type;
> >>> - if (managed_zone(zone)) {
> >>> - zoneref_set_zone(zone, [nr_zones++]);
> >>> - check_highest_zone(zone_type);
> >>> - }
> >>> + zoneref_set_zone(zone, [nr_zones++]);
> >>> + check_highest_zone(zone_type);
> >>>   } while (zone_type);
> >>>  
> >>>   return nr_zones;
> >>
> >> I don't think having !populated zones in the zonelist is a particularly
> >> good idea. Populated vs !populated changes only during page
> >> onlininge/offlining.
> >>
> >> If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ...
> > 
> > What kind of problem that would cause? The allocator wouldn't see any
> > pages at all so it would fallback to the next one. Maybe kswapd would
> > need some tweak to have a bail out condition but as mentioned in the
> > thread already. !populated or !managed for that matter are not all that
> > much different from completely depleted zones. The fact that we are
> > making that distinction has led to some bugs and I suspect it makes the
> > code more complex without a very good reason.
> 
> I assume performance problems. Assume you have an ordinary system with
> multiple NUMA nodes and no MOVABLE memory. Most nodes will only have
> ZONE_NORMAL. Yet, you'd include ZONE_DMA* and ZONE_MOVABLE that will
> always remain empty to be traversed on each and every allocation
> fallback. Of course, we could measure, but IMHO at least *that* part of
> memory onlining/offlining is not the complicated part :D

You've got a good point here. I guess there will be usecases that really
benefit from every single CPU cycle spared in the allocator hot path
which really depends on the zonelists traversing.

I have very briefly had a look at our remaining usage of managed_zone()
and there are not that many left. We have 2 in page_alloc.c via
has_managed_dma(). I guess we could drop that one and use __GFP_NOWARN
in dma_atomic_pool_init but this is nothing really earth shattering.
The remaining occurances are in vmscan.c:
- should_continue_reclaim, pgdat_balanced - required because
  they iterate all zones withing the zoneindex and need to
  decide whether they are balanced or not. We can make empty
  zones special case and make them always balanced
- kswapd_shrink_node - required because we would be increasing
  reclaim target for empty zones
- update_reclaim_active - required because we do not want to
  alter zone state if it is not a subject of the reclaim which
  empty zones are not by definition.
- balance_pgdat - first check is likely a microoptimization,
  reclaim_idx is needed to have a populated zone there
    - wakeup_kswapd - I dunno
- shrink_node, allow_direct_reclaim, lruvec_lru_size - microptimizations
- pgdat_watermark_boosted - microptimizations I suspect as empty
  zone shouldn't ever get watermark_boost
- pgdat_balanced - functionally dd

So we can get rid of quite some but we will still need some of them.
-- 
Michal Hocko
SUSE Labs



Re: [PATCH v2] mm, page_alloc: fix build_zonerefs_node()

2022-04-07 Thread Michal Hocko
[CC Mel]

On Thu 07-04-22 14:06:37, Juergen Gross wrote:
> Since commit 6aa303defb74 ("mm, vmscan: only allocate and reclaim from
> zones with pages managed by the buddy allocator") only zones with free
> memory are included in a built zonelist. This is problematic when e.g.
> all memory of a zone has been ballooned out when zonelists are being
> rebuilt.
> 
> The decision whether to rebuild the zonelists when onlining new memory
> is done based on populated_zone() returning 0 for the zone the memory
> will be added to. The new zone is added to the zonelists only, if it
> has free memory pages (managed_zone() returns a non-zero value) after
> the memory has been onlined. This implies, that onlining memory will
> always free the added pages to the allocator immediately, but this is
> not true in all cases: when e.g. running as a Xen guest the onlined
> new memory will be added only to the ballooned memory list, it will be
> freed only when the guest is being ballooned up afterwards.

Thanks this is much more clearer!
 
> Another problem with using managed_zone() for the decision whether a
> zone is being added to the zonelists is, that a zone with all memory
> used will in fact be removed from all zonelists in case the zonelists
> happen to be rebuilt.
> 
> Use populated_zone() when building a zonelist as it has been done
> before that commit.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with 
> pages managed by the buddy allocator")
> Reported-by: Marek Marczykowski-Górecki 
> Signed-off-by: Juergen Gross 
> Acked-by: Michal Hocko 
> ---
> V2:
> - updated commit message (Michal Hocko)
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bdc8f60ae462..3d0662af3289 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6128,7 +6128,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct 
> zoneref *zonerefs)
>   do {
>   zone_type--;
>   zone = pgdat->node_zones + zone_type;
> - if (managed_zone(zone)) {
> + if (populated_zone(zone)) {
>   zoneref_set_zone(zone, [nr_zones++]);
>   check_highest_zone(zone_type);
>   }
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs



Re: [PATCH] mm, page_alloc: fix build_zonerefs_node()

2022-04-07 Thread Michal Hocko
On Thu 07-04-22 13:58:44, David Hildenbrand wrote:
[...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3589febc6d31..130a2feceddc 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6112,10 +6112,8 @@ static int build_zonerefs_node(pg_data_t *pgdat, 
> > struct zoneref *zonerefs)
> > do {
> > zone_type--;
> > zone = pgdat->node_zones + zone_type;
> > -   if (managed_zone(zone)) {
> > -   zoneref_set_zone(zone, [nr_zones++]);
> > -   check_highest_zone(zone_type);
> > -   }
> > +   zoneref_set_zone(zone, [nr_zones++]);
> > +   check_highest_zone(zone_type);
> > } while (zone_type);
> >  
> > return nr_zones;
> 
> I don't think having !populated zones in the zonelist is a particularly
> good idea. Populated vs !populated changes only during page
> onlininge/offlining.
> 
> If I'm not wrong, with your patch we'd even include ZONE_DEVICE here ...

What kind of problem that would cause? The allocator wouldn't see any
pages at all so it would fallback to the next one. Maybe kswapd would
need some tweak to have a bail out condition but as mentioned in the
thread already. !populated or !managed for that matter are not all that
much different from completely depleted zones. The fact that we are
making that distinction has led to some bugs and I suspect it makes the
code more complex without a very good reason.

> I'd vote for going with the simple fix first, which should be good
> enough AFAIKT.

yes, see the other reply

-- 
Michal Hocko
SUSE Labs



Re: [PATCH] mm, page_alloc: fix build_zonerefs_node()

2022-04-07 Thread Michal Hocko
On Thu 07-04-22 13:40:25, Michal Hocko wrote:
[...]
> Now to your patch. I suspect this is not sufficient for the full hotplug
> situation. Consider a new NUMA node to be hotadded. hotadd_new_pgdat
> will call build_all_zonelists but the zone is not populated yet at that
> moment unless I am missing something. We do rely on online_pages to
> rebuild once pages are onlined - which usually means they are freed to
> the page allocator.

OK, I've managed to get lost in the code and misread the onlining part.
After re-reading the code I have concluded that the patch is good as is.
online_pages relies on zone_populated so it will pass and zonelists will
be regenerated even without any pages freed to the allocator. Sorry for
the confusion. But I guess this still proves my other point that the
code is really subtle and messy so I guess the less rebuilding we do the
better. There are two ways, go with your patch and do the clean up on
top or merge the two.

That being said
Acked-by: Michal Hocko 
to your patch with an improved changelog to be more specific about the
underlying problem.

Thanks and sorry for the confusion.
-- 
Michal Hocko
SUSE Labs



Re: [PATCH] mm, page_alloc: fix build_zonerefs_node()

2022-04-07 Thread Michal Hocko
On Thu 07-04-22 13:17:19, Juergen Gross wrote:
> On 07.04.22 13:07, Michal Hocko wrote:
> > On Thu 07-04-22 12:45:41, Juergen Gross wrote:
> > > On 07.04.22 12:34, Michal Hocko wrote:
> > > > Ccing Mel
> > > > 
> > > > On Thu 07-04-22 11:32:21, Juergen Gross wrote:
> > > > > Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
> > > > > initialization") only zones with free memory are included in a built
> > > > > zonelist. This is problematic when e.g. all memory of a zone has been
> > > > > ballooned out.
> > > > 
> > > > What is the actual problem there?
> > > 
> > > When running as Xen guest new hotplugged memory will not be onlined
> > > automatically, but only on special request. This is done in order to
> > > support adding e.g. the possibility to use another GB of memory, while
> > > adding only a part of that memory initially.
> > > 
> > > In case adding that memory is populating a new zone, the page allocator
> > > won't be able to use this memory when it is onlined, as the zone wasn't
> > > added to the zonelist, due to managed_zone() returning 0.
> > 
> > How is that memory onlined? Because "regular" onlining (online_pages())
> > does rebuild zonelists if their zone hasn't been populated before.
> 
> The Xen balloon driver has an own callback for onlining pages. The pages
> are just added to the ballooned-out page list without handing them to the
> allocator. This is done only when the guest is ballooned up.

OK, I see. Let me just rephrase to see whether we are on the same page.
Xen is overriding the online_page_callback to xen_online_page which
doesn't free pages to the page allocator which means that a zone might
remain unpopulated after onlining. This means that the default zone
lists rebuild is not done and later on when those pages are finally
released to the allocator there is no build_all_zonelists happening so
those freed pages are not really visible to the allocator via zonelists
fallback allocation.

Now to your patch. I suspect this is not sufficient for the full hotplug
situation. Consider a new NUMA node to be hotadded. hotadd_new_pgdat
will call build_all_zonelists but the zone is not populated yet at that
moment unless I am missing something. We do rely on online_pages to
rebuild once pages are onlined - which usually means they are freed to
the page allocator.

The zonelists building is kinda messy TBH. I have to say that I am not
really clear on Mel's 6aa303defb74 ("mm, vmscan: only allocate and
reclaim from zones with pages managed by the buddy allocator") because
as you have said unpoppulated zone is not (or shouldn't be) really all
that different from a depleted zone.

I think a better and more complete fix would be the following. In other
words the zonelists will be built for all present zones. Not sure
whether that is going to break 6aa303defb74 though.

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2a9627dc784c..880c455e2557 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1062,7 +1062,6 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
   struct zone *zone, struct memory_group *group)
 {
unsigned long flags;
-   int need_zonelists_rebuild = 0;
const int nid = zone_to_nid(zone);
int ret;
struct memory_notify arg;
@@ -1106,17 +1105,13 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
 * This means the page allocator ignores this zone.
 * So, zonelist must be updated after online.
 */
-   if (!populated_zone(zone)) {
-   need_zonelists_rebuild = 1;
+   if (!populated_zone(zone))
setup_zone_pageset(zone);
-   }
 
online_pages_range(pfn, nr_pages);
adjust_present_page_count(pfn_to_page(pfn), group, nr_pages);
 
node_states_set_node(nid, );
-   if (need_zonelists_rebuild)
-   build_all_zonelists(NULL);
 
/* Basic onlining is complete, allow allocation of onlined pages. */
undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
@@ -1985,10 +1980,8 @@ int __ref offline_pages(unsigned long start_pfn, 
unsigned long nr_pages,
/* reinitialise watermarks and update pcp limits */
init_per_zone_wmark_min();
 
-   if (!populated_zone(zone)) {
+   if (!populated_zone(zone))
zone_pcp_reset(zone);
-   build_all_zonelists(NULL);
-   }
 
node_states_clear_node(node, );
if (arg.status_change_nid >= 0) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..130a2feceddc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6112,10

Re: [PATCH] mm, page_alloc: fix build_zonerefs_node()

2022-04-07 Thread Michal Hocko
On Thu 07-04-22 12:45:41, Juergen Gross wrote:
> On 07.04.22 12:34, Michal Hocko wrote:
> > Ccing Mel
> > 
> > On Thu 07-04-22 11:32:21, Juergen Gross wrote:
> > > Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
> > > initialization") only zones with free memory are included in a built
> > > zonelist. This is problematic when e.g. all memory of a zone has been
> > > ballooned out.
> > 
> > What is the actual problem there?
> 
> When running as Xen guest new hotplugged memory will not be onlined
> automatically, but only on special request. This is done in order to
> support adding e.g. the possibility to use another GB of memory, while
> adding only a part of that memory initially.
> 
> In case adding that memory is populating a new zone, the page allocator
> won't be able to use this memory when it is onlined, as the zone wasn't
> added to the zonelist, due to managed_zone() returning 0.

How is that memory onlined? Because "regular" onlining (online_pages())
does rebuild zonelists if their zone hasn't been populated before.

-- 
Michal Hocko
SUSE Labs



Re: [PATCH] mm, page_alloc: fix build_zonerefs_node()

2022-04-07 Thread Michal Hocko
Ccing Mel

On Thu 07-04-22 11:32:21, Juergen Gross wrote:
> Since commit 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist
> initialization") only zones with free memory are included in a built
> zonelist. This is problematic when e.g. all memory of a zone has been
> ballooned out.

What is the actual problem there?

> Use populated_zone() when building a zonelist as it has been done
> before that commit.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 9d3be21bf9c0 ("mm, page_alloc: simplify zonelist initialization")

Did you mean to refer to 
6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with
pages managed by the buddy allocator")?

> Reported-by: Marek Marczykowski-Górecki 
> Signed-off-by: Juergen Gross 
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bdc8f60ae462..3d0662af3289 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6128,7 +6128,7 @@ static int build_zonerefs_node(pg_data_t *pgdat, struct 
> zoneref *zonerefs)
>   do {
>   zone_type--;
>   zone = pgdat->node_zones + zone_type;
> - if (managed_zone(zone)) {
> + if (populated_zone(zone)) {
>   zoneref_set_zone(zone, [nr_zones++]);
>   check_highest_zone(zone_type);
>   }
> -- 
> 2.34.1

-- 
Michal Hocko
SUSE Labs



Re: blocking vs. non-blocking mmu notifiers

2022-03-24 Thread Michal Hocko
On Wed 23-03-22 14:04:04, Jason Gunthorpe wrote:
> On Wed, Mar 23, 2022 at 05:49:43PM +0100, Michal Hocko wrote:
> > > The bug here is that prior to commit a81461b0546c ("xen/gntdev: update
> > > to new mmu_notifier semantic") wired the mn_invl_range_start() which
> > > takes a mutex to invalidate_page, which is defined to run in an atomic
> > > context.
> > 
> > Yeah, we have already identified that but quickly realized that the
> > whole mmu notifier overhaul which this fix depends on would be no no for
> > backporting to our older code base. So we are trying to find our way
> > around that.
> 
> IMHO you don't need everything, just commit 369ea8242c0f ("mm/rmap:
> update to new mmu_notifier semantic v2") which adds the missing
> start/end outside the lock for the page callbacks.
> 
> Then you can take safely a8146 into gntdev.

Thanks Jason!

-- 
Michal Hocko
SUSE Labs



Re: blocking vs. non-blocking mmu notifiers

2022-03-23 Thread Michal Hocko
On Wed 23-03-22 13:31:46, Jason Gunthorpe wrote:
> On Wed, Mar 23, 2022 at 10:45:30AM +0100, Michal Hocko wrote:
> > [Let me add more people to the CC list - I am not really sure who is the
> >  most familiar with all the tricks that mmu notifiers might do]
> > 
> > On Wed 23-03-22 09:43:59, Juergen Gross wrote:
> > > Hi,
> > > 
> > > during analysis of a customer's problem on a 4.12 based kernel
> > > (deadlock due to a blocking mmu notifier in a Xen driver) I came
> > > across upstream patches 93065ac753e4 ("mm, oom: distinguish
> > > blockable mode for mmu notifiers") et al.
> > > 
> > > The backtrace of the blocked tasks was typically something like:
> > > 
> > >  #0 [c9004222f228] __schedule at 817223e2
> > >  #1 [c9004222f2b8] schedule at 81722a02
> > >  #2 [c9004222f2c8] schedule_preempt_disabled at 81722d0a
> > >  #3 [c9004222f2d0] __mutex_lock at 81724104
> > >  #4 [c9004222f360] mn_invl_range_start at c01fd398 
> > > [xen_gntdev]
> > >  #5 [c9004222f398] __mmu_notifier_invalidate_page at 8123375a
> > >  #6 [c9004222f3c0] try_to_unmap_one at 812112cb
> > >  #7 [c9004222f478] rmap_walk_file at 812105cd
> > >  #8 [c9004222f4d0] try_to_unmap at 81212450
> > >  #9 [c9004222f508] shrink_page_list at 811e0755
> > > #10 [c9004222f5c8] shrink_inactive_list at 811e13cf
> > > #11 [c9004222f6a8] shrink_node_memcg at 811e241f
> > > #12 [c9004222f790] shrink_node at 811e29c5
> > > #13 [c9004222f808] do_try_to_free_pages at 811e2ee1
> > > #14 [c9004222f868] try_to_free_pages at 811e3248
> > > #15 [c9004222f8e8] __alloc_pages_slowpath at 81262c37
> > > #16 [c9004222f9f0] __alloc_pages_nodemask at 8121afc1
> > > #17 [c9004222fa48] alloc_pages_current at 8122f350
> > > #18 [c9004222fa78] __get_free_pages at 8121685a
> > > #19 [c9004222fa80] __pollwait at 8127e795
> > > #20 [c9004222faa8] evtchn_poll at c00e802b [xen_evtchn]
> > > #21 [c9004222fab8] do_sys_poll at 8127f953
> > > #22 [c9004222fec8] sys_ppoll at 81280478
> > > #23 [c9004222ff30] do_syscall_64 at 81004954
> > > #24 [c9004222ff50] entry_SYSCALL_64_after_hwframe at 818000b6
> > > 
> > > It was found that the notifier of the Xen gntdev driver was using a
> > > mutex resulting in the deadlock.
> 
> The bug here is that prior to commit a81461b0546c ("xen/gntdev: update
> to new mmu_notifier semantic") wired the mn_invl_range_start() which
> takes a mutex to invalidate_page, which is defined to run in an atomic
> context.

Yeah, we have already identified that but quickly realized that the
whole mmu notifier overhaul which this fix depends on would be no no for
backporting to our older code base. So we are trying to find our way
around that.

> > > Michal Hocko suggested that backporting above mentioned patch might
> > > help, as the mmu notifier call is happening in atomic context.
> 
> IIRC "blocking" was not supposed to be about atomic context or not, but
> more about time to completion/guarenteed completion as it is used from
> a memory reclaim path.

Yeah, when I was introducing that I was mostly concerned about oom path.
But it is essentially impossible to tell what are the existing
assumptions different mmu notifiers operate on.

> 
> > Just to be more explicit. The current upstream code calls
> > mmu_notifier_invalidate_range while the page table locks are held.
> > Are there any notifiers which could sleep in those? 
> 
> There should not be, that would be a bug that lockdep would find.

OK, this is reassuring. Thanks!
-- 
Michal Hocko
SUSE Labs



Re: blocking vs. non-blocking mmu notifiers

2022-03-23 Thread Michal Hocko
[Let me add more people to the CC list - I am not really sure who is the
 most familiar with all the tricks that mmu notifiers might do]

On Wed 23-03-22 09:43:59, Juergen Gross wrote:
> Hi,
> 
> during analysis of a customer's problem on a 4.12 based kernel
> (deadlock due to a blocking mmu notifier in a Xen driver) I came
> across upstream patches 93065ac753e4 ("mm, oom: distinguish
> blockable mode for mmu notifiers") et al.
> 
> The backtrace of the blocked tasks was typically something like:
> 
>  #0 [c9004222f228] __schedule at 817223e2
>  #1 [c9004222f2b8] schedule at 81722a02
>  #2 [c9004222f2c8] schedule_preempt_disabled at 81722d0a
>  #3 [c9004222f2d0] __mutex_lock at 81724104
>  #4 [c9004222f360] mn_invl_range_start at c01fd398 [xen_gntdev]
>  #5 [c9004222f398] __mmu_notifier_invalidate_page at 8123375a
>  #6 [c9004222f3c0] try_to_unmap_one at 812112cb
>  #7 [c9004222f478] rmap_walk_file at 812105cd
>  #8 [c9004222f4d0] try_to_unmap at 81212450
>  #9 [c9004222f508] shrink_page_list at 811e0755
> #10 [c9004222f5c8] shrink_inactive_list at 811e13cf
> #11 [c9004222f6a8] shrink_node_memcg at 811e241f
> #12 [c9004222f790] shrink_node at 811e29c5
> #13 [c9004222f808] do_try_to_free_pages at 811e2ee1
> #14 [c9004222f868] try_to_free_pages at 811e3248
> #15 [c9004222f8e8] __alloc_pages_slowpath at 81262c37
> #16 [c9004222f9f0] __alloc_pages_nodemask at 8121afc1
> #17 [c9004222fa48] alloc_pages_current at 8122f350
> #18 [c9004222fa78] __get_free_pages at 8121685a
> #19 [c9004222fa80] __pollwait at 8127e795
> #20 [c9004222faa8] evtchn_poll at c00e802b [xen_evtchn]
> #21 [c9004222fab8] do_sys_poll at 8127f953
> #22 [c9004222fec8] sys_ppoll at 81280478
> #23 [c9004222ff30] do_syscall_64 at 81004954
> #24 [c9004222ff50] entry_SYSCALL_64_after_hwframe at 818000b6
> 
> It was found that the notifier of the Xen gntdev driver was using a
> mutex resulting in the deadlock.
> 
> Michal Hocko suggested that backporting above mentioned patch might
> help, as the mmu notifier call is happening in atomic context.
> 
> Looking into that I was wondering whether try_to_unmap_one() shouldn't
> call mmu_notifier_invalidate_range_start_nonblock() instead of
> mmu_notifier_invalidate_range_start() if this is true. Otherwise I
> can't see how this deadlock could be avoided.

Just to be more explicit. The current upstream code calls
mmu_notifier_invalidate_range while the page table locks are held.
Are there any notifiers which could sleep in those? In other words
should we use the nonblock start/stop in try_to_unmap?

> Any thoughts?
> 
> 
> Juergen






-- 
Michal Hocko
SUSE Labs



Re: [PATCH v2 3/5] mm/page_alloc: move pages to tail in move_to_free_list()

2020-10-06 Thread Michal Hocko
On Mon 05-10-20 14:15:32, David Hildenbrand wrote:
> Whenever we move pages between freelists via move_to_free_list()/
> move_freepages_block(), we don't actually touch the pages:
> 1. Page isolation doesn't actually touch the pages, it simply isolates
>pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
>When undoing isolation, we move the pages back to the target list.
> 2. Page stealing (steal_suitable_fallback()) moves free pages directly
>between lists without touching them.
> 3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves
>free pages directly between freelists without touching them.
> 
> We already place pages to the tail of the freelists when undoing isolation
> via __putback_isolated_page(), let's do it in any case (e.g., if order <=
> pageblock_order) and document the behavior. To simplify, let's move the
> pages to the tail for all move_to_free_list()/move_freepages_block() users.
> 
> In 2., the target list is empty, so there should be no change. In 3.,
> we might observe a change, however, highatomic is more concerned about
> allocations succeeding than cache hotness - if we ever realize this
> change degrades a workload, we can special-case this instance and add a
> proper comment.
> 
> This change results in all pages getting onlined via online_pages() to
> be placed to the tail of the freelist.
> 
> Reviewed-by: Oscar Salvador 
> Acked-by: Pankaj Gupta 
> Reviewed-by: Wei Yang 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: Scott Cheloha 
> Cc: Michael Ellerman 
> Signed-off-by: David Hildenbrand 

Much simpler!
Acked-by: Michal Hocko 

Thanks!

> ---
>  mm/page_alloc.c | 10 +++---
>  mm/page_isolation.c |  5 +
>  2 files changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index df5ff0cd6df1..b187e46cf640 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -901,13 +901,17 @@ static inline void add_to_free_list_tail(struct page 
> *page, struct zone *zone,
>   area->nr_free++;
>  }
>  
> -/* Used for pages which are on another list */
> +/*
> + * Used for pages which are on another list. Move the pages to the tail
> + * of the list - so the moved pages won't immediately be considered for
> + * allocation again (e.g., optimization for memory onlining).
> + */
>  static inline void move_to_free_list(struct page *page, struct zone *zone,
>unsigned int order, int migratetype)
>  {
>   struct free_area *area = >free_area[order];
>  
> - list_move(>lru, >free_list[migratetype]);
> + list_move_tail(>lru, >free_list[migratetype]);
>  }
>  
>  static inline void del_page_from_free_list(struct page *page, struct zone 
> *zone,
> @@ -2340,7 +2344,7 @@ static inline struct page 
> *__rmqueue_cma_fallback(struct zone *zone,
>  #endif
>  
>  /*
> - * Move the free pages in a range to the free lists of the requested type.
> + * Move the free pages in a range to the freelist tail of the requested type.
>   * Note that start_page and end_pages are not aligned on a pageblock
>   * boundary. If alignment is required, use move_freepages_block()
>   */
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index abfe26ad59fd..83692b937784 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -106,6 +106,11 @@ static void unset_migratetype_isolate(struct page *page, 
> unsigned migratetype)
>* If we isolate freepage with more than pageblock_order, there
>* should be no freepage in the range, so we could avoid costly
>* pageblock scanning for freepage moving.
> +  *
> +  * We didn't actually touch any of the isolated pages, so place them
> +  * to the tail of the freelist. This is an optimization for memory
> +  * onlining - just onlined memory won't immediately be considered for
> +  * allocation.
>*/
>   if (!isolated_page) {
>   nr_pages = move_freepages_block(zone, page, migratetype, NULL);
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs



Re: [PATCH v1 3/5] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

2020-10-05 Thread Michal Hocko
On Fri 02-10-20 17:20:09, David Hildenbrand wrote:
> On 02.10.20 15:24, Michal Hocko wrote:
> > On Mon 28-09-20 20:21:08, David Hildenbrand wrote:
> >> Page isolation doesn't actually touch the pages, it simply isolates
> >> pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
> >>
> >> We already place pages to the tail of the freelists when undoing
> >> isolation via __putback_isolated_page(), let's do it in any case
> >> (e.g., if order <= pageblock_order) and document the behavior.
> >>
> >> Add a "to_tail" parameter to move_freepages_block() but introduce a
> >> a new move_to_free_list_tail() - similar to add_to_free_list_tail().
> >>
> >> This change results in all pages getting onlined via online_pages() to
> >> be placed to the tail of the freelist.
> > 
> > Is there anything preventing to do this unconditionally? Or in other
> > words is any of the existing callers of move_freepages_block benefiting
> > from adding to the head?
> 
> 1. mm/page_isolation.c:set_migratetype_isolate()
> 
> We move stuff to the MIGRATE_ISOLATE list, we don't care about the order
> there.
> 
> 2. steal_suitable_fallback():
> 
> I don't think we care too much about the order when already stealing
> pageblocks ... and the freelist is empty I guess?
> 
> 3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock()
> 
> Not sure if we really care.

Honestly, I have no idea. I can imagine that some atomic high order
workloads (e.g. in net) might benefit from cache line hot pages but I am
not sure this is really observable. If yes it would likely be better to
have this documented than relying on wild guess. If we do not have any
evidence then I would vote for simplicity first and go with
unconditional add_to_tail which would simply your patch a bit.

Maybe Vlastimil or Mel would have a better picture.

-- 
Michal Hocko
SUSE Labs



Re: [PATCH v1 5/5] mm/memory_hotplug: update comment regarding zone shuffling

2020-10-02 Thread Michal Hocko
On Mon 28-09-20 20:21:10, David Hildenbrand wrote:
> As we no longer shuffle via generic_online_page() and when undoing
> isolation, we can simplify the comment.
> 
> We now effectively shuffle only once (properly) when onlining new
> memory.
> 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Signed-off-by: David Hildenbrand 

Acked-by: Michal Hocko 

> ---
>  mm/memory_hotplug.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9db80ee29caa..c589bd8801bb 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -859,13 +859,10 @@ int __ref online_pages(unsigned long pfn, unsigned long 
> nr_pages,
>   undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
>  
>   /*
> -  * When exposing larger, physically contiguous memory areas to the
> -  * buddy, shuffling in the buddy (when freeing onlined pages, putting
> -  * them either to the head or the tail of the freelist) is only helpful
> -  * for maintaining the shuffle, but not for creating the initial
> -  * shuffle. Shuffle the whole zone to make sure the just onlined pages
> -  * are properly distributed across the whole freelist. Make sure to
> -  * shuffle once pageblocks are no longer isolated.
> +  * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
> +  * the tail of the freelist when undoing isolation). Shuffle the whole
> +  * zone to make sure the just onlined pages are properly distributed
> +  * across the whole freelist - to create an initial shuffle.
>*/
>   shuffle_zone(zone);
>  
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs



Re: [PATCH v1 4/5] mm/page_alloc: place pages to tail in __free_pages_core()

2020-10-02 Thread Michal Hocko
On Mon 28-09-20 20:21:09, David Hildenbrand wrote:
> __free_pages_core() is used when exposing fresh memory to the buddy
> during system boot and when onlining memory in generic_online_page().
> 
> generic_online_page() is used in two cases:
> 
> 1. Direct memory onlining in online_pages().
> 2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
>balloon and virtio-mem), when parts of a section are kept
>fake-offline to be fake-onlined later on.
> 
> In 1, we already place pages to the tail of the freelist. Pages will be
> freed to MIGRATE_ISOLATE lists first and moved to the tail of the freelists
> via undo_isolate_page_range().
> 
> In 2, we currently don't implement a proper rule. In case of virtio-mem,
> where we currently always online MAX_ORDER - 1 pages, the pages will be
> placed to the HEAD of the freelist - undesireable. While the hyper-v
> balloon calls generic_online_page() with single pages, usually it will
> call it on successive single pages in a larger block.
> 
> The pages are fresh, so place them to the tail of the freelists and avoid
> the PCP. In __free_pages_core(), remove the now superflouos call to
> set_page_refcounted() and add a comment regarding page initialization and
> the refcount.
> 
> Note: In 2. we currently don't shuffle. If ever relevant (page shuffling
> is usually of limited use in virtualized environments), we might want to
> shuffle after a sequence of generic_online_page() calls in the
> relevant callers.

It took some time to get through all the freeing paths with subtle
differences but this looks reasonable. You are mentioning that this
influences a boot time free memory ordering as well but only very
briefly. I do not expect this to make a huge difference but who knows.
It makes some sense to add pages in the order they show up in the
physical address ordering.

> Reviewed-by: Vlastimil Babka 
> Reviewed-by: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: "K. Y. Srinivasan" 
> Cc: Haiyang Zhang 
> Cc: Stephen Hemminger 
> Cc: Wei Liu 
> Signed-off-by: David Hildenbrand 

That being said I do not see any fundamental problems.
Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 37 -
>  1 file changed, 24 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d5a5f528b8ca..8a2134fe9947 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -270,7 +270,8 @@ bool pm_suspended_storage(void)
>  unsigned int pageblock_order __read_mostly;
>  #endif
>  
> -static void __free_pages_ok(struct page *page, unsigned int order);
> +static void __free_pages_ok(struct page *page, unsigned int order,
> + fop_t fop_flags);
>  
>  /*
>   * results with 256, 32 in the lowmem_reserve sysctl:
> @@ -682,7 +683,7 @@ static void bad_page(struct page *page, const char 
> *reason)
>  void free_compound_page(struct page *page)
>  {
>   mem_cgroup_uncharge(page);
> - __free_pages_ok(page, compound_order(page));
> + __free_pages_ok(page, compound_order(page), FOP_NONE);
>  }
>  
>  void prep_compound_page(struct page *page, unsigned int order)
> @@ -1419,17 +1420,15 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   spin_unlock(>lock);
>  }
>  
> -static void free_one_page(struct zone *zone,
> - struct page *page, unsigned long pfn,
> - unsigned int order,
> - int migratetype)
> +static void free_one_page(struct zone *zone, struct page *page, unsigned 
> long pfn,
> +   unsigned int order, int migratetype, fop_t fop_flags)
>  {
>   spin_lock(>lock);
>   if (unlikely(has_isolate_pageblock(zone) ||
>   is_migrate_isolate(migratetype))) {
>   migratetype = get_pfnblock_migratetype(page, pfn);
>   }
> - __free_one_page(page, pfn, zone, order, migratetype, FOP_NONE);
> + __free_one_page(page, pfn, zone, order, migratetype, fop_flags);
>   spin_unlock(>lock);
>  }
>  
> @@ -1507,7 +1506,8 @@ void __meminit reserve_bootmem_region(phys_addr_t 
> start, phys_addr_t end)
>   }
>  }
>  
> -static void __free_pages_ok(struct page *page, unsigned int order)
> +static void __free_pages_ok(struct page *page, unsigned int order,
> + fop_t fop_flags)
>  {
>   unsigned long flags;
>   int migratetype;
> @@ -1519,7 +1519,8 @@ static void __free_pages_ok(struct p

Re: [PATCH v1 3/5] mm/page_alloc: always move pages to the tail of the freelist in unset_migratetype_isolate()

2020-10-02 Thread Michal Hocko
On Mon 28-09-20 20:21:08, David Hildenbrand wrote:
> Page isolation doesn't actually touch the pages, it simply isolates
> pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
> 
> We already place pages to the tail of the freelists when undoing
> isolation via __putback_isolated_page(), let's do it in any case
> (e.g., if order <= pageblock_order) and document the behavior.
> 
> Add a "to_tail" parameter to move_freepages_block() but introduce a
> a new move_to_free_list_tail() - similar to add_to_free_list_tail().
> 
> This change results in all pages getting onlined via online_pages() to
> be placed to the tail of the freelist.

Is there anything preventing to do this unconditionally? Or in other
words is any of the existing callers of move_freepages_block benefiting
from adding to the head?

> Reviewed-by: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: Scott Cheloha 
> Cc: Michael Ellerman 
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/page-isolation.h |  4 ++--
>  mm/page_alloc.c| 35 +++---
>  mm/page_isolation.c| 12 +---
>  3 files changed, 35 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 572458016331..3eca9b3c5305 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -36,8 +36,8 @@ static inline bool is_migrate_isolate(int migratetype)
>  struct page *has_unmovable_pages(struct zone *zone, struct page *page,
>int migratetype, int flags);
>  void set_pageblock_migratetype(struct page *page, int migratetype);
> -int move_freepages_block(struct zone *zone, struct page *page,
> - int migratetype, int *num_movable);
> +int move_freepages_block(struct zone *zone, struct page *page, int 
> migratetype,
> +  bool to_tail, int *num_movable);
>  
>  /*
>   * Changes migrate type in [start_pfn, end_pfn) to be MIGRATE_ISOLATE.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9e3ed4a6f69a..d5a5f528b8ca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -905,6 +905,15 @@ static inline void move_to_free_list(struct page *page, 
> struct zone *zone,
>   list_move(>lru, >free_list[migratetype]);
>  }
>  
> +/* Used for pages which are on another list */
> +static inline void move_to_free_list_tail(struct page *page, struct zone 
> *zone,
> +   unsigned int order, int migratetype)
> +{
> + struct free_area *area = >free_area[order];
> +
> + list_move_tail(>lru, >free_list[migratetype]);
> +}
> +
>  static inline void del_page_from_free_list(struct page *page, struct zone 
> *zone,
>  unsigned int order)
>  {
> @@ -2338,9 +2347,9 @@ static inline struct page 
> *__rmqueue_cma_fallback(struct zone *zone,
>   * Note that start_page and end_pages are not aligned on a pageblock
>   * boundary. If alignment is required, use move_freepages_block()
>   */
> -static int move_freepages(struct zone *zone,
> -   struct page *start_page, struct page *end_page,
> -   int migratetype, int *num_movable)
> +static int move_freepages(struct zone *zone, struct page *start_page,
> +   struct page *end_page, int migratetype,
> +   bool to_tail, int *num_movable)
>  {
>   struct page *page;
>   unsigned int order;
> @@ -2371,7 +2380,10 @@ static int move_freepages(struct zone *zone,
>   VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>  
>   order = page_order(page);
> - move_to_free_list(page, zone, order, migratetype);
> + if (to_tail)
> + move_to_free_list_tail(page, zone, order, migratetype);
> + else
> + move_to_free_list(page, zone, order, migratetype);
>   page += 1 << order;
>   pages_moved += 1 << order;
>   }
> @@ -2379,8 +2391,8 @@ static int move_freepages(struct zone *zone,
>   return pages_moved;
>  }
>  
> -int move_freepages_block(struct zone *zone, struct page *page,
> - int migratetype, int *num_movable)
> +int move_freepages_block(struct zone *zone, struct page *page, int 
> migratetype,
> +  bool to_tail, int *num_movable)
>  {
>   unsigned long start_pfn

Re: [PATCH v1 2/5] mm/page_alloc: place pages to tail in __putback_isolated_page()

2020-10-02 Thread Michal Hocko
On Mon 28-09-20 20:21:07, David Hildenbrand wrote:
> __putback_isolated_page() already documents that pages will be placed to
> the tail of the freelist - this is, however, not the case for
> "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be
> the case for all existing users.
> 
> This change affects two users:
> - free page reporting
> - page isolation, when undoing the isolation (including memory onlining).
> 
> This behavior is desireable for pages that haven't really been touched
> lately, so exactly the two users that don't actually read/write page
> content, but rather move untouched pages.
> 
> The new behavior is especially desirable for memory onlining, where we
> allow allocation of newly onlined pages via undo_isolate_page_range()
> in online_pages(). Right now, we always place them to the head of the
> free list, resulting in undesireable behavior: Assume we add
> individual memory chunks via add_memory() and online them right away to
> the NORMAL zone. We create a dependency chain of unmovable allocations
> e.g., via the memmap. The memmap of the next chunk will be placed onto
> previous chunks - if the last block cannot get offlined+removed, all
> dependent ones cannot get offlined+removed. While this can already be
> observed with individual DIMMs, it's more of an issue for virtio-mem
> (and I suspect also ppc DLPAR).
> 
> Document that this should only be used for optimizations, and no code
> should realy on this for correction (if the order of freepage lists
> ever changes).
> 
> We won't care about page shuffling: memory onlining already properly
> shuffles after onlining. free page reporting doesn't care about
> physically contiguous ranges, and there are already cases where page
> isolation will simply move (physically close) free pages to (currently)
> the head of the freelists via move_freepages_block() instead of
> shuffling. If this becomes ever relevant, we should shuffle the whole
> zone when undoing isolation of larger ranges, and after
> free_contig_range().
> 
> Reviewed-by: Alexander Duyck 
> Reviewed-by: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Cc: Scott Cheloha 
> Cc: Michael Ellerman 
> Signed-off-by: David Hildenbrand 

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index daab90e960fe..9e3ed4a6f69a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -89,6 +89,18 @@ typedef int __bitwise fop_t;
>   */
>  #define FOP_SKIP_REPORT_NOTIFY   ((__force fop_t)BIT(0))
>  
> +/*
> + * Place the (possibly merged) page to the tail of the freelist. Will ignore
> + * page shuffling (relevant code - e.g., memory onlining - is expected to
> + * shuffle the whole zone).
> + *
> + * Note: No code should rely onto this flag for correctness - it's purely
> + *   to allow for optimizations when handing back either fresh pages
> + *   (memory onlining) or untouched pages (page isolation, free page
> + *   reporting).
> + */
> +#define FOP_TO_TAIL  ((__force fop_t)BIT(1))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_FRACTION (8)
> @@ -1038,7 +1050,9 @@ static inline void __free_one_page(struct page *page, 
> unsigned long pfn,
>  done_merging:
>   set_page_order(page, order);
>  
> - if (is_shuffle_order(order))
> + if (fop_flags & FOP_TO_TAIL)
> + to_tail = true;
> + else if (is_shuffle_order(order))
>   to_tail = shuffle_pick_tail();
>   else
>   to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
> @@ -3300,7 +3314,7 @@ void __putback_isolated_page(struct page *page, 
> unsigned int order, int mt)
>  
>   /* Return isolated page to tail of freelist. */
>   __free_one_page(page, page_to_pfn(page), zone, order, mt,
> - FOP_SKIP_REPORT_NOTIFY);
> + FOP_SKIP_REPORT_NOTIFY | FOP_TO_TAIL);
>  }
>  
>  /*
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs



Re: [PATCH v1 1/5] mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag

2020-10-02 Thread Michal Hocko
On Mon 28-09-20 20:21:06, David Hildenbrand wrote:
> Let's prepare for additional flags and avoid long parameter lists of bools.
> Follow-up patches will also make use of the flags in __free_pages_ok(),
> however, I wasn't able to come up with a better name for the type - should
> be good enough for internal purposes.
> 
> Reviewed-by: Alexander Duyck 
> Reviewed-by: Vlastimil Babka 
> Reviewed-by: Oscar Salvador 
> Cc: Andrew Morton 
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Michal Hocko 
> Cc: Dave Hansen 
> Cc: Vlastimil Babka 
> Cc: Wei Yang 
> Cc: Oscar Salvador 
> Cc: Mike Rapoport 
> Signed-off-by: David Hildenbrand 

Hopefully this will not wrack the generated code. But considering we
would need another parameter there is not much choice left.

Acked-by: Michal Hocko 

> ---
>  mm/page_alloc.c | 28 
>  1 file changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index df90e3654f97..daab90e960fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -77,6 +77,18 @@
>  #include "shuffle.h"
>  #include "page_reporting.h"
>  
> +/* Free One Page flags: for internal, non-pcp variants of free_pages(). */
> +typedef int __bitwise fop_t;
> +
> +/* No special request */
> +#define FOP_NONE ((__force fop_t)0)
> +
> +/*
> + * Skip free page reporting notification for the (possibly merged) page. 
> (will
> + * *not* mark the page reported, only skip the notification).
> + */
> +#define FOP_SKIP_REPORT_NOTIFY   ((__force fop_t)BIT(0))
> +
>  /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
>  static DEFINE_MUTEX(pcp_batch_high_lock);
>  #define MIN_PERCPU_PAGELIST_FRACTION (8)
> @@ -948,10 +960,9 @@ buddy_merge_likely(unsigned long pfn, unsigned long 
> buddy_pfn,
>   * -- nyc
>   */
>  
> -static inline void __free_one_page(struct page *page,
> - unsigned long pfn,
> - struct zone *zone, unsigned int order,
> - int migratetype, bool report)
> +static inline void __free_one_page(struct page *page, unsigned long pfn,
> +struct zone *zone, unsigned int order,
> +int migratetype, fop_t fop_flags)
>  {
>   struct capture_control *capc = task_capc(zone);
>   unsigned long buddy_pfn;
> @@ -1038,7 +1049,7 @@ static inline void __free_one_page(struct page *page,
>   add_to_free_list(page, zone, order, migratetype);
>  
>   /* Notify page reporting subsystem of freed page */
> - if (report)
> + if (!(fop_flags & FOP_SKIP_REPORT_NOTIFY))
>   page_reporting_notify_free(order);
>  }
>  
> @@ -1379,7 +1390,7 @@ static void free_pcppages_bulk(struct zone *zone, int 
> count,
>   if (unlikely(isolated_pageblocks))
>   mt = get_pageblock_migratetype(page);
>  
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt, true);
> + __free_one_page(page, page_to_pfn(page), zone, 0, mt, FOP_NONE);
>   trace_mm_page_pcpu_drain(page, 0, mt);
>   }
>   spin_unlock(>lock);
> @@ -1395,7 +1406,7 @@ static void free_one_page(struct zone *zone,
>   is_migrate_isolate(migratetype))) {
>   migratetype = get_pfnblock_migratetype(page, pfn);
>   }
> - __free_one_page(page, pfn, zone, order, migratetype, true);
> + __free_one_page(page, pfn, zone, order, migratetype, FOP_NONE);
>   spin_unlock(>lock);
>  }
>  
> @@ -3288,7 +3299,8 @@ void __putback_isolated_page(struct page *page, 
> unsigned int order, int mt)
>   lockdep_assert_held(>lock);
>  
>   /* Return isolated page to tail of freelist. */
> - __free_one_page(page, page_to_pfn(page), zone, order, mt, false);
> + __free_one_page(page, page_to_pfn(page), zone, order, mt,
> + FOP_SKIP_REPORT_NOTIFY);
>  }
>  
>  /*
> -- 
> 2.26.2

-- 
Michal Hocko
SUSE Labs



Re: [PATCH 3/3] memory: introduce an option to force onlining of hotplug memory

2020-07-24 Thread Michal Hocko
On Thu 23-07-20 19:39:54, David Hildenbrand wrote:
[...]
> Yeah, might require some code churn. It just feels wrong to involve
> buddy concepts (e.g., onlining pages, calling memory notifiers, exposing
> memory block devices) and introducing hacks (forced onlining) just to
> get a memmap+identity mapping+iomem resource. I think reserving such a
> region during boot as suggested is the easiest approach, but I am
> *absolutely* not an expert on all these XEN-specific things :)

I am late to the discussion but FTR I completely agree.
-- 
Michal Hocko
SUSE Labs



Re: [Xen-devel] [PATCH 0/3] Remove __online_page_set_limits()

2019-09-09 Thread Michal Hocko
On Sun 08-09-19 03:17:01, Souptick Joarder wrote:
> __online_page_set_limits() is a dummy function and an extra call
> to this can be avoided.
> 
> As both of the callers are now removed, __online_page_set_limits()
> can be removed permanently.
> 
> Souptick Joarder (3):
>   hv_ballon: Avoid calling dummy function __online_page_set_limits()
>   xen/ballon: Avoid calling dummy function __online_page_set_limits()
>   mm/memory_hotplug.c: Remove __online_page_set_limits()
> 
>  drivers/hv/hv_balloon.c| 1 -
>  drivers/xen/balloon.c  | 1 -
>  include/linux/memory_hotplug.h | 1 -
>  mm/memory_hotplug.c| 5 -
>  4 files changed, 8 deletions(-)

To the whole series
Acked-by: Michal Hocko 

Thanks!
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/34] put_user_pages(): miscellaneous call sites

2019-08-07 Thread Michal Hocko
On Wed 07-08-19 10:37:26, Jan Kara wrote:
> On Fri 02-08-19 12:14:09, John Hubbard wrote:
> > On 8/2/19 7:52 AM, Jan Kara wrote:
> > > On Fri 02-08-19 07:24:43, Matthew Wilcox wrote:
> > > > On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote:
> > > > > On Fri 02-08-19 11:12:44, Michal Hocko wrote:
> > > > > > On Thu 01-08-19 19:19:31, john.hubb...@gmail.com wrote:
> > > > > > [...]
> > > > > > > 2) Convert all of the call sites for get_user_pages*(), to
> > > > > > > invoke put_user_page*(), instead of put_page(). This involves 
> > > > > > > dozens of
> > > > > > > call sites, and will take some time.
> > > > > > 
> > > > > > How do we make sure this is the case and it will remain the case in 
> > > > > > the
> > > > > > future? There must be some automagic to enforce/check that. It is 
> > > > > > simply
> > > > > > not manageable to do it every now and then because then 3) will 
> > > > > > simply
> > > > > > be never safe.
> > > > > > 
> > > > > > Have you considered coccinele or some other scripted way to do the
> > > > > > transition? I have no idea how to deal with future changes that 
> > > > > > would
> > > > > > break the balance though.
> > 
> > Hi Michal,
> > 
> > Yes, I've thought about it, and coccinelle falls a bit short (it's not smart
> > enough to know which put_page()'s to convert). However, there is a debug
> > option planned: a yet-to-be-posted commit [1] uses struct page extensions
> > (obviously protected by CONFIG_DEBUG_GET_USER_PAGES_REFERENCES) to add
> > a redundant counter. That allows:
> > 
> > void __put_page(struct page *page)
> > {
> > ...
> > /* Someone called put_page() instead of put_user_page() */
> > WARN_ON_ONCE(atomic_read(_ext->pin_count) > 0);
> > 
> > > > > 
> > > > > Yeah, that's why I've been suggesting at LSF/MM that we may need to 
> > > > > create
> > > > > a gup wrapper - say vaddr_pin_pages() - and track which sites dropping
> > > > > references got converted by using this wrapper instead of gup. The
> > > > > counterpart would then be more logically named as unpin_page() or 
> > > > > whatever
> > > > > instead of put_user_page().  Sure this is not completely foolproof 
> > > > > (you can
> > > > > create new callsite using vaddr_pin_pages() and then just drop refs 
> > > > > using
> > > > > put_page()) but I suppose it would be a high enough barrier for missed
> > > > > conversions... Thoughts?
> > 
> > The debug option above is still a bit simplistic in its implementation
> > (and maybe not taking full advantage of the data it has), but I think
> > it's preferable, because it monitors the "core" and WARNs.
> > 
> > Instead of the wrapper, I'm thinking: documentation and the passage of
> > time, plus the debug option (perhaps enhanced--probably once I post it
> > someone will notice opportunities), yes?
> 
> So I think your debug option and my suggested renaming serve a bit
> different purposes (and thus both make sense). If you do the renaming, you
> can just grep to see unconverted sites. Also when someone merges new GUP
> user (unaware of the new rules) while you switch GUP to use pins instead of
> ordinary references, you'll get compilation error in case of renaming
> instead of hard to debug refcount leak without the renaming. And such
> conflict is almost bound to happen given the size of GUP patch set... Also
> the renaming serves against the "coding inertia" - i.e., GUP is around for
> ages so people just use it without checking any documentation or comments.
> After switching how GUP works, what used to be correct isn't anymore so
> renaming the function serves as a warning that something has really
> changed.

Fully agreed!

> Your refcount debug patches are good to catch bugs in the conversions done
> but that requires you to be able to excercise the code path in the first
> place which may require particular HW or so, and you also have to enable
> the debug option which means you already aim at verifying the GUP
> references are treated properly.
> 
>   Honza
> 
> -- 
> Jan Kara 
> SUSE Labs, CR

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH 00/34] put_user_pages(): miscellaneous call sites

2019-08-02 Thread Michal Hocko
On Thu 01-08-19 19:19:31, john.hubb...@gmail.com wrote:
[...]
> 2) Convert all of the call sites for get_user_pages*(), to
> invoke put_user_page*(), instead of put_page(). This involves dozens of
> call sites, and will take some time.

How do we make sure this is the case and it will remain the case in the
future? There must be some automagic to enforce/check that. It is simply
not manageable to do it every now and then because then 3) will simply
be never safe.

Have you considered coccinele or some other scripted way to do the
transition? I have no idea how to deal with future changes that would
break the balance though.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm: fix regression with deferred struct page init

2019-06-28 Thread Michal Hocko
On Fri 28-06-19 19:38:13, Juergen Gross wrote:
> On 28.06.19 17:17, Michal Hocko wrote:
> > On Thu 20-06-19 18:08:21, Juergen Gross wrote:
> > > Commit 0e56acae4b4dd4a9 ("mm: initialize MAX_ORDER_NR_PAGES at a time
> > > instead of doing larger sections") is causing a regression on some
> > > systems when the kernel is booted as Xen dom0.
> > > 
> > > The system will just hang in early boot.
> > > 
> > > Reason is an endless loop in get_page_from_freelist() in case the first
> > > zone looked at has no free memory. deferred_grow_zone() is always
> > 
> > Could you explain how we ended up with the zone having no memory? Is
> > xen "stealing" memblock memory without adding it to memory.reserved?
> > In other words, how do we end up with an empty zone that has non zero
> > end_pfn?
> 
> Why do you think Xen is stealing the memory in an odd way?
> 
> Doesn't deferred_init_mem_pfn_range_in_zone() return false when no free
> memory is found? So exactly if the memory was added to memory.reserved
> that will happen.

You are right. I managed to confuse myself and thought that __next_mem_range
return index to both memblock types. But I am wrong here and it excludes
type_b regions. I should have read the documentation. My bad and sorry
for the confusion.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm: fix regression with deferred struct page init

2019-06-28 Thread Michal Hocko
On Thu 20-06-19 18:08:21, Juergen Gross wrote:
> Commit 0e56acae4b4dd4a9 ("mm: initialize MAX_ORDER_NR_PAGES at a time
> instead of doing larger sections") is causing a regression on some
> systems when the kernel is booted as Xen dom0.
> 
> The system will just hang in early boot.
> 
> Reason is an endless loop in get_page_from_freelist() in case the first
> zone looked at has no free memory. deferred_grow_zone() is always

Could you explain how we ended up with the zone having no memory? Is
xen "stealing" memblock memory without adding it to memory.reserved?
In other words, how do we end up with an empty zone that has non zero
end_pfn?

> returning true due to the following code snipplet:
> 
>   /* If the zone is empty somebody else may have cleared out the zone */
>   if (!deferred_init_mem_pfn_range_in_zone(, zone, , ,
>first_deferred_pfn)) {
>   pgdat->first_deferred_pfn = ULONG_MAX;
>   pgdat_resize_unlock(pgdat, );
>   return true;
>   }
> 
> This in turn results in the loop as get_page_from_freelist() is
> assuming forward progress can be made by doing some more struct page
> initialization.

The patch looks correct. The code is subtle but the comment helps.

> Cc: Alexander Duyck 
> Fixes: 0e56acae4b4dd4a9 ("mm: initialize MAX_ORDER_NR_PAGES at a time instead 
> of doing larger sections")
> Suggested-by: Alexander Duyck 
> Signed-off-by: Juergen Gross 

Acked-by: Michal Hocko 
> ---
>  mm/page_alloc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d66bc8abe0af..8e3bc949ebcc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1826,7 +1826,8 @@ deferred_grow_zone(struct zone *zone, unsigned int 
> order)
>first_deferred_pfn)) {
>   pgdat->first_deferred_pfn = ULONG_MAX;
>   pgdat_resize_unlock(pgdat, );
> -     return true;
> + /* Retry only once. */
> + return first_deferred_pfn != ULONG_MAX;
>   }
>  
>   /*
> -- 
> 2.16.4

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFCv2 0/4] mm/memory_hotplug: Introduce memory block types

2018-12-20 Thread Michal Hocko
On Thu 20-12-18 13:58:16, David Hildenbrand wrote:
> On 30.11.18 18:59, David Hildenbrand wrote:
> > This is the second approach, introducing more meaningful memory block
> > types and not changing online behavior in the kernel. It is based on
> > latest linux-next.
> > 
> > As we found out during dicussion, user space should always handle onlining
> > of memory, in any case. However in order to make smart decisions in user
> > space about if and how to online memory, we have to export more information
> > about memory blocks. This way, we can formulate rules in user space.
> > 
> > One such information is the type of memory block we are talking about.
> > This helps to answer some questions like:
> > - Does this memory block belong to a DIMM?
> > - Can this DIMM theoretically ever be unplugged again?
> > - Was this memory added by a balloon driver that will rely on balloon
> >   inflation to remove chunks of that memory again? Which zone is advised?
> > - Is this special standby memory on s390x that is usually not automatically
> >   onlined?
> > 
> > And in short it helps to answer to some extend (excluding zone imbalances)
> > - Should I online this memory block?
> > - To which zone should I online this memory block?
> > ... of course special use cases will result in different anwers. But that's
> > why user space has control of onlining memory.
> > 
> > More details can be found in Patch 1 and Patch 3.
> > Tested on x86 with hotplugged DIMMs. Cross-compiled for PPC and s390x.
> > 
> > 
> > Example:
> > $ udevadm info -q all -a /sys/devices/system/memory/memory0
> > KERNEL=="memory0"
> > SUBSYSTEM=="memory"
> > DRIVER==""
> > ATTR{online}=="1"
> > ATTR{phys_device}=="0"
> > ATTR{phys_index}==""
> > ATTR{removable}=="0"
> > ATTR{state}=="online"
> > ATTR{type}=="boot"
> > ATTR{valid_zones}=="none"
> > $ udevadm info -q all -a /sys/devices/system/memory/memory90
> > KERNEL=="memory90"
> > SUBSYSTEM=="memory"
> > DRIVER==""
> > ATTR{online}=="1"
> > ATTR{phys_device}=="0"
> > ATTR{phys_index}=="005a"
> > ATTR{removable}=="1"
> > ATTR{state}=="online"
> > ATTR{type}=="dimm"
> > ATTR{valid_zones}=="Normal"
> > 
> > 
> > RFC -> RFCv2:
> > - Now also taking care of PPC (somehow missed it :/ )
> > - Split the series up to some degree (some ideas on how to split up patch 3
> >   would be very welcome)
> > - Introduce more memory block types. Turns out abstracting too much was
> >   rather confusing and not helpful. Properly document them.
> > 
> > Notes:
> > - I wanted to convert the enum of types into a named enum but this
> >   provoked all kinds of different errors. For now, I am doing it just like
> >   the other types (e.g. online_type) we are using in that context.
> > - The "removable" property should never have been named like that. It
> >   should have been "offlinable". Can we still rename that? E.g. boot memory
> >   is sometimes marked as removable ...
> > 
> 
> 
> Any feedback regarding the suggested block types would be very much
> appreciated!

I still do not like this much to be honest. I just didn't get to think
through this properly. My fear is that this is conflating an actual API
with the current implementation and as such will cause problems in
future. But I haven't really looked into your patches closely so I might
be wrong. Anyway I won't be able to look into it by the end of year.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v1] mm/memory_hotplug: drop "online" parameter from add_memory_resource()

2018-11-23 Thread Michal Hocko
On Fri 23-11-18 13:58:16, David Hildenbrand wrote:
> On 23.11.18 13:54, Michal Hocko wrote:
> > On Fri 23-11-18 13:37:40, David Hildenbrand wrote:
> >> User space should always be in charge of how to online memory and
> >> if memory should be onlined automatically in the kernel. Let's drop the
> >> parameter to overwrite this - XEN passes memhp_auto_online, just like
> >> add_memory(), so we can directly use that instead internally.
> > 
> > Heh, I wanted to get rid of memhp_auto_online so much and now we have it
> > in the core memory_hotplug. Not a win on my side I would say :/
> 
> That is actually a good point: Can we remove memhp_auto_online or is it
> already some sort of kernel ABI?
> 
> (as it is exported via /sys/devices/system/memory/auto_online_blocks)

I have tried and there was a pushback [1]. That led to a rework of the
sysfs semantic of onlining btw. The biggest objection against removing was
that the userspace might be too slow to online memory and memmaps could
eat the available memory and trigger OOM. That is why I've started
working on the self hosted memmpas but failed to finish it. Fortunatelly
Oscar is brave enough to continue in that work.

[1] http://lkml.kernel.org/r/20170227092817.23571-1-mho...@kernel.org
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v1] mm/memory_hotplug: drop "online" parameter from add_memory_resource()

2018-11-23 Thread Michal Hocko
On Fri 23-11-18 13:37:40, David Hildenbrand wrote:
> User space should always be in charge of how to online memory and
> if memory should be onlined automatically in the kernel. Let's drop the
> parameter to overwrite this - XEN passes memhp_auto_online, just like
> add_memory(), so we can directly use that instead internally.

Heh, I wanted to get rid of memhp_auto_online so much and now we have it
in the core memory_hotplug. Not a win on my side I would say :/
On the other hand this can be seen as a cleanup because it removes that
ambiguity that some callers might be unaware of the memhp_auto_online
leading to a different behavior.

> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> Cc: Stefano Stabellini 
> Cc: Andrew Morton 
> Cc: Dan Williams 
> Cc: Oscar Salvador 
> Cc: Pavel Tatashin 
> Cc: Michal Hocko 
> Cc: David Hildenbrand 
> Cc: Joonsoo Kim 
> Cc: Arun KS 
> Cc: Mathieu Malaterre 
> Cc: Stephen Rothwell 
> Signed-off-by: David Hildenbrand 

Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v1 7/8] PM / Hibernate: use pfn_to_online_page()

2018-11-19 Thread Michal Hocko
On Mon 19-11-18 11:16:15, David Hildenbrand wrote:
> Let's use pfn_to_online_page() instead of pfn_to_page() when checking
> for saveable pages to not save/restore offline memory sections.
> 
> Cc: "Rafael J. Wysocki" 
> Cc: Pavel Machek 
> Cc: Len Brown 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Michal Hocko 
> Cc: "Michael S. Tsirkin" 
> Suggested-by: Michal Hocko 
> Signed-off-by: David Hildenbrand 

I have only a very vague understanding of this specific code but I do
not really see any real reason for checking offlined ranges. Also
offline pfn ranges might have uninitialized struct pages so making
any decisions of the struct page is basically undefined behavior.

Acked-by: Michal Hocko 

> ---
>  kernel/power/snapshot.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 640b2034edd6..87e6dd57819f 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1215,8 +1215,8 @@ static struct page *saveable_highmem_page(struct zone 
> *zone, unsigned long pfn)
>   if (!pfn_valid(pfn))
>   return NULL;
>  
> - page = pfn_to_page(pfn);
> - if (page_zone(page) != zone)
> + page = pfn_to_online_page(pfn);
> + if (!page || page_zone(page) != zone)
>   return NULL;
>  
>   BUG_ON(!PageHighMem(page));
> @@ -1277,8 +1277,8 @@ static struct page *saveable_page(struct zone *zone, 
> unsigned long pfn)
>   if (!pfn_valid(pfn))
>   return NULL;
>  
> - page = pfn_to_page(pfn);
> - if (page_zone(page) != zone)
> + page = pfn_to_online_page(pfn);
> + if (!page || page_zone(page) != zone)
>   return NULL;
>  
>   BUG_ON(PageHighMem(page));
> -- 
> 2.17.2

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC 6/6] PM / Hibernate: exclude all PageOffline() pages

2018-11-15 Thread Michal Hocko
On Wed 14-11-18 22:17:04, David Hildenbrand wrote:
[...]
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index b0308a2c6000..01db1d13481a 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1222,7 +1222,7 @@ static struct page *saveable_highmem_page(struct zone 
> *zone, unsigned long pfn)
>   BUG_ON(!PageHighMem(page));
>  
>   if (swsusp_page_is_forbidden(page) ||  swsusp_page_is_free(page) ||
> - PageReserved(page))
> + PageReserved(page) || PageOffline(page))
>   return NULL;
>  
>   if (page_is_guard(page))
> @@ -1286,6 +1286,9 @@ static struct page *saveable_page(struct zone *zone, 
> unsigned long pfn)
>   if (swsusp_page_is_forbidden(page) || swsusp_page_is_free(page))
>   return NULL;
>  
> + if (PageOffline(page))
> + return NULL;
> +
>   if (PageReserved(page)
>   && (!kernel_page_present(page) || pfn_is_nosave(pfn)))
>   return NULL;

Btw. now that you are touching this file could you also make
s@pfn_to_page@pfn_to_online_page@ please? We really do not want to touch
offline pfn ranges in general. A separate patch for that of course.

Thanks!
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC 2/6] mm: convert PG_balloon to PG_offline

2018-11-15 Thread Michal Hocko
[Cc Konstantin - the patch is 
http://lkml.kernel.org/r/20181114211704.6381-3-da...@redhat.com]

On Thu 15-11-18 10:21:13, David Hildenbrand wrote:
> On 15.11.18 03:07, Mike Rapoport wrote:
> > On Wed, Nov 14, 2018 at 11:49:15PM +0100, David Hildenbrand wrote:
> >> On 14.11.18 23:23, Matthew Wilcox wrote:
> >>> On Wed, Nov 14, 2018 at 10:17:00PM +0100, David Hildenbrand wrote:
> >>>> Rename PG_balloon to PG_offline. This is an indicator that the page is
> >>>> logically offline, the content stale and that it should not be touched
> >>>> (e.g. a hypervisor would have to allocate backing storage in order for 
> >>>> the
> >>>> guest to dump an unused page).  We can then e.g. exclude such pages from
> >>>> dumps.
> >>>>
> >>>> In following patches, we will make use of this bit also in other balloon
> >>>> drivers.  While at it, document PGTABLE.
> >>>
> >>> Thank you for documenting PGTABLE.  I didn't realise I also had this
> >>> document to update when I added PGTABLE.
> >>
> >> Thank you for looking into this :)
> >>
> >>>
> >>>> +++ b/Documentation/admin-guide/mm/pagemap.rst
> >>>> @@ -78,6 +78,8 @@ number of times a page is mapped.
> >>>>  23. BALLOON
> >>>>  24. ZERO_PAGE
> >>>>  25. IDLE
> >>>> +26. PGTABLE
> >>>> +27. OFFLINE
> >>>
> >>> So the offline *user* bit is new ... even though the *kernel* bit
> >>> just renames the balloon bit.  I'm not sure how I feel about this.
> >>> I'm going to think about it some more.  Could you share your decision
> >>> process with us?
> >>
> >> BALLOON was/is documented as
> >>
> >> "23 - BALLOON
> >> balloon compaction page
> >> "
> >>
> >> and only includes all virtio-ballon pages after the non-lru migration
> >> feature has been implemented for ballooned pages. Since then, this flag
> >> does basically no longer stands for what it actually was supposed to do.
> > 
> > Perhaps I missing something, but how the user should interpret "23" when he
> > reads /proc/kpageflags?
> 
> Looking at the history in more detail:
> 
> commit 09316c09dde33aae14f34489d9e3d243ec0d5938
> Author: Konstantin Khlebnikov 
> Date:   Thu Oct 9 15:29:32 2014 -0700
> 
> mm/balloon_compaction: add vmstat counters and kpageflags bit
> 
> Always mark pages with PageBalloon even if balloon compaction is
> disabled
> and expose this mark in /proc/kpageflags as KPF_BALLOON.
> 
> 
> So KPF_BALLOON was exposed when virtio-balloon pages were always marked
> with PG_balloon. So the documentation is actually wrong ("balloon page"
> vs. "balloon compaction page").
> 
> I have no idea who actually used that information. I suspect this was
> just some debugging aid.
> 
> > 
> >> To not break uapi I decided to not rename it but instead to add a new flag.
> > 
> > I've got a feeling that uapi was anyway changed for the BALLON flag
> > meaning.
> 
> Yes. If we *replace* KPF_BALLOON by KPF_OFFLINE
> 
> a) Some applications might no longer compile (I guess that's ok)
> b) Some old applications will treat KPF_OFFLINE like KPF_BALLOON (which
> should at least for virtio-balloon usage until now be fine - it is just
> more generic)

I do not think any compilation could break. If the semantic of the flag
is preserved then everything should work as expected.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC 3/6] kexec: export PG_offline to VMCOREINFO

2018-11-15 Thread Michal Hocko
On Thu 15-11-18 12:52:13, Borislav Petkov wrote:
> On Thu, Nov 15, 2018 at 12:20:40PM +0100, David Hildenbrand wrote:
> > Sorry to say, but that is the current practice without which
> > makedumpfile would not be able to work at all. (exclude user pages,
> > exclude page cache, exclude buddy pages). Let's not reinvent the wheel
> > here. This is how dumping works forever.
> 
> Sorry, but "we've always done this in the past" doesn't make it better.
> 
> > I don't see how there should be "set of pages which do not have
> > PG_offline".
> 
> It doesn't have to be a set of pages. Think a (mmconfig perhaps) region
> which the kdump kernel should completely skip because poking in it in
> the kdump kernel, causes all kinds of havoc like machine checks. etc.
> We've had and still have one issue like that.
> 
> But let me clarify my note: I don't want to be discussing with you the
> design of makedumpfile and how it should or should not work - that ship
> has already sailed. Apparently there are valid reasons to do it this
> way.
> 
> I was *simply* stating that it feels wrong to export mm flags like that.
> 
> But as I said already, that is mm guys' call and looking at how we're
> already exporting a bunch of stuff in the vmcoreinfo - including other
> mm flags - I guess one more flag doesn't matter anymore.

I am not familiar with kexec to judge this particular patch but we
cannot simply define any range for these pages (same as for hwpoison
ones) because they can be almost anywhere in the available memory range.
Then there can be countless of them. There is no other way to rule them
out but to check the page state.

I am not really sure what is the concern here exactly. Kdump is so
closly tight to the specific kernel version that the api exported
specifically for its purpose cannot be seriously considered an ABI.
Kdump has to adopt all the time.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm: convert totalram_pages, totalhigh_pages and managed_pages to atomic.

2018-10-22 Thread Michal Hocko
On Mon 22-10-18 22:53:22, Arun KS wrote:
> Remove managed_page_count_lock spinlock and instead use atomic
> variables.

I assume this has been auto-generated. If yes, it would be better to
mention the script so that people can review it and regenerate for
comparision. Such a large change is hard to review manually.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-19 Thread Michal Hocko
On Thu 18-10-18 19:18:25, Andrew Morton wrote:
[...]
> So this patch needs more work, yes?

Yes, I've talked to Arun (he is offline until next week) offlist and he
will play with this some more.

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-11 Thread Michal Hocko
On Thu 11-10-18 10:07:02, Vlastimil Babka wrote:
> On 10/10/18 6:56 PM, Arun KS wrote:
> > On 2018-10-10 21:00, Vlastimil Babka wrote:
> >> On 10/5/18 10:10 AM, Arun KS wrote:
> >>> When free pages are done with higher order, time spend on
> >>> coalescing pages by buddy allocator can be reduced. With
> >>> section size of 256MB, hot add latency of a single section
> >>> shows improvement from 50-60 ms to less than 1 ms, hence
> >>> improving the hot add latency by 60%. Modify external
> >>> providers of online callback to align with the change.
> >>>
> >>> Signed-off-by: Arun KS 
> >>
> >> [...]
> >>
> >>> @@ -655,26 +655,44 @@ void __online_page_free(struct page *page)
> >>>  }
> >>>  EXPORT_SYMBOL_GPL(__online_page_free);
> >>>
> >>> -static void generic_online_page(struct page *page)
> >>> +static int generic_online_page(struct page *page, unsigned int order)
> >>>  {
> >>> - __online_page_set_limits(page);
> >>
> >> This is now not called anymore, although the xen/hv variants still do
> >> it. The function seems empty these days, maybe remove it as a followup
> >> cleanup?
> >>
> >>> - __online_page_increment_counters(page);
> >>> - __online_page_free(page);
> >>> + __free_pages_core(page, order);
> >>> + totalram_pages += (1UL << order);
> >>> +#ifdef CONFIG_HIGHMEM
> >>> + if (PageHighMem(page))
> >>> + totalhigh_pages += (1UL << order);
> >>> +#endif
> >>
> >> __online_page_increment_counters() would have used
> >> adjust_managed_page_count() which would do the changes under
> >> managed_page_count_lock. Are we safe without the lock? If yes, there
> >> should perhaps be a comment explaining why.
> > 
> > Looks unsafe without managed_page_count_lock. I think better have a 
> > similar implementation of free_boot_core() in memory_hotplug.c like we 
> > had in version 1 of patch. And use adjust_managed_page_count() instead 
> > of page_zone(page)->managed_pages += nr_pages;
> > 
> > https://lore.kernel.org/patchwork/patch/989445/
> 
> Looks like deferred_free_range() has the same problem calling
> __free_pages_core() to adjust zone->managed_pages.

deferred initialization has one thread per node AFAIR so we cannot race
on managed_pages updates. Well, unless some of the mentioned can run
that early which I dunno.

> __free_pages_bootmem() is OK because at that point the system is still
> single-threaded?
> Could be solved by moving that out of __free_pages_core().
> 
> But do we care about readers potentially seeing a store tear? If yes
> then maybe these counters should be converted to atomics...

I wanted to suggest that already but I have no idea whether the lock
instructions would cause more overhead.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-11 Thread Michal Hocko
On Thu 11-10-18 07:59:32, Arun KS wrote:
> On 2018-10-10 23:03, Michal Hocko wrote:
> > On Wed 10-10-18 22:26:41, Arun KS wrote:
> > > On 2018-10-10 21:00, Vlastimil Babka wrote:
> > > > On 10/5/18 10:10 AM, Arun KS wrote:
> > > > > When free pages are done with higher order, time spend on
> > > > > coalescing pages by buddy allocator can be reduced. With
> > > > > section size of 256MB, hot add latency of a single section
> > > > > shows improvement from 50-60 ms to less than 1 ms, hence
> > > > > improving the hot add latency by 60%. Modify external
> > > > > providers of online callback to align with the change.
> > > > >
> > > > > Signed-off-by: Arun KS 
> > > >
> > > > [...]
> > > >
> > > > > @@ -655,26 +655,44 @@ void __online_page_free(struct page *page)
> > > > >  }
> > > > >  EXPORT_SYMBOL_GPL(__online_page_free);
> > > > >
> > > > > -static void generic_online_page(struct page *page)
> > > > > +static int generic_online_page(struct page *page, unsigned int order)
> > > > >  {
> > > > > - __online_page_set_limits(page);
> > > >
> > > > This is now not called anymore, although the xen/hv variants still do
> > > > it. The function seems empty these days, maybe remove it as a followup
> > > > cleanup?
> > > >
> > > > > - __online_page_increment_counters(page);
> > > > > - __online_page_free(page);
> > > > > + __free_pages_core(page, order);
> > > > > + totalram_pages += (1UL << order);
> > > > > +#ifdef CONFIG_HIGHMEM
> > > > > + if (PageHighMem(page))
> > > > > + totalhigh_pages += (1UL << order);
> > > > > +#endif
> > > >
> > > > __online_page_increment_counters() would have used
> > > > adjust_managed_page_count() which would do the changes under
> > > > managed_page_count_lock. Are we safe without the lock? If yes, there
> > > > should perhaps be a comment explaining why.
> > > 
> > > Looks unsafe without managed_page_count_lock.
> > 
> > Why does it matter actually? We cannot online/offline memory in
> > parallel. This is not the case for the boot where we initialize memory
> > in parallel on multiple nodes. So this seems to be safe currently unless
> > I am missing something. A comment explaining that would be helpful
> > though.
> 
> Other main callers of adjust_manage_page_count(),
> 
> static inline void free_reserved_page(struct page *page)
> {
> __free_reserved_page(page);
> adjust_managed_page_count(page, 1);
> }
> 
> static inline void mark_page_reserved(struct page *page)
> {
> SetPageReserved(page);
> adjust_managed_page_count(page, -1);
> }
> 
> Won't they race with memory hotplug?
> 
> Few more,
> ./drivers/xen/balloon.c:519:adjust_managed_page_count(page, -1);
> ./drivers/virtio/virtio_balloon.c:175:  adjust_managed_page_count(page, -1);
> ./drivers/virtio/virtio_balloon.c:196:  adjust_managed_page_count(page, 1);
> ./mm/hugetlb.c:2158:adjust_managed_page_count(page, 1 <<
> h->order);

They can, and I have missed those.

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-10 Thread Michal Hocko
On Wed 10-10-18 22:26:41, Arun KS wrote:
> On 2018-10-10 21:00, Vlastimil Babka wrote:
> > On 10/5/18 10:10 AM, Arun KS wrote:
> > > When free pages are done with higher order, time spend on
> > > coalescing pages by buddy allocator can be reduced. With
> > > section size of 256MB, hot add latency of a single section
> > > shows improvement from 50-60 ms to less than 1 ms, hence
> > > improving the hot add latency by 60%. Modify external
> > > providers of online callback to align with the change.
> > > 
> > > Signed-off-by: Arun KS 
> > 
> > [...]
> > 
> > > @@ -655,26 +655,44 @@ void __online_page_free(struct page *page)
> > >  }
> > >  EXPORT_SYMBOL_GPL(__online_page_free);
> > > 
> > > -static void generic_online_page(struct page *page)
> > > +static int generic_online_page(struct page *page, unsigned int order)
> > >  {
> > > - __online_page_set_limits(page);
> > 
> > This is now not called anymore, although the xen/hv variants still do
> > it. The function seems empty these days, maybe remove it as a followup
> > cleanup?
> > 
> > > - __online_page_increment_counters(page);
> > > - __online_page_free(page);
> > > + __free_pages_core(page, order);
> > > + totalram_pages += (1UL << order);
> > > +#ifdef CONFIG_HIGHMEM
> > > + if (PageHighMem(page))
> > > + totalhigh_pages += (1UL << order);
> > > +#endif
> > 
> > __online_page_increment_counters() would have used
> > adjust_managed_page_count() which would do the changes under
> > managed_page_count_lock. Are we safe without the lock? If yes, there
> > should perhaps be a comment explaining why.
> 
> Looks unsafe without managed_page_count_lock.

Why does it matter actually? We cannot online/offline memory in
parallel. This is not the case for the boot where we initialize memory
in parallel on multiple nodes. So this seems to be safe currently unless
I am missing something. A comment explaining that would be helpful
though.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-09 Thread Michal Hocko
On Tue 09-10-18 15:24:13, Arun KS wrote:
> On 2018-10-09 14:59, Michal Hocko wrote:
> > On Fri 05-10-18 13:40:05, Arun KS wrote:
> > > When free pages are done with higher order, time spend on
> > > coalescing pages by buddy allocator can be reduced. With
> > > section size of 256MB, hot add latency of a single section
> > > shows improvement from 50-60 ms to less than 1 ms, hence
> > > improving the hot add latency by 60%. Modify external
> > > providers of online callback to align with the change.
> > 
> > Acked-by: Michal Hocko 
> > 
> > Thanks for your patience with all the resubmission.
> 
> Hello Michal,
> 
> I got the below email few days back. Do I still need to resubmit the patch?
> or is it already on the way to merge?

It is sitting in the queue for now. Andrew collects acks and
reviewed-bys and add them to the patch.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 1/2] memory_hotplug: Free pages as higher order

2018-10-09 Thread Michal Hocko
On Fri 05-10-18 13:40:05, Arun KS wrote:
> When free pages are done with higher order, time spend on
> coalescing pages by buddy allocator can be reduced. With
> section size of 256MB, hot add latency of a single section
> shows improvement from 50-60 ms to less than 1 ms, hence
> improving the hot add latency by 60%. Modify external
> providers of online callback to align with the change.

Acked-by: Michal Hocko 

Thanks for your patience with all the resubmission.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v5 2/2] mm/page_alloc: remove software prefetching in __free_pages_core

2018-10-09 Thread Michal Hocko
On Fri 05-10-18 13:40:06, Arun KS wrote:
> They not only increase the code footprint, they actually make things
> slower rather than faster. Remove them as contemporary hardware doesn't
> need any hint.

I agree with the change but it is much better to add some numbers
whenever arguing about performance impact.

> 
> Suggested-by: Dan Williams 
> Signed-off-by: Arun KS 
> ---
>  mm/page_alloc.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7ab5274..90db431 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1258,14 +1258,10 @@ void __free_pages_core(struct page *page, unsigned 
> int order)
>   struct page *p = page;
>   unsigned int loop;
>  
> - prefetchw(p);
> - for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
> - prefetchw(p + 1);
> + for (loop = 0; loop < nr_pages ; loop++, p++) {
>   __ClearPageReserved(p);
>   set_page_count(p, 0);
>   }
> - __ClearPageReserved(p);
> - set_page_count(p, 0);
>  
>   page_zone(page)->managed_pages += nr_pages;
>   set_page_refcounted(page);
> -- 
> 1.9.1
> 

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v4] memory_hotplug: Free pages as higher order

2018-10-04 Thread Michal Hocko
On Wed 03-10-18 19:09:39, Arun KS wrote:
[...]
> +static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
> +{
> + unsigned long end = start + nr_pages;
> + int order, ret, onlined_pages = 0;
> +
> + while (start < end) {
> + order = min(MAX_ORDER - 1UL, __ffs(start));
> +
> + while (start + (1UL << order) > end)
> + order--;

this really made me scratch my head. Wouldn't it be much simpler to do
the following?
order = min(MAX_ORDER - 1, get_order(end - start))?

> +
> + ret = (*online_page_callback)(pfn_to_page(start), order);
> + if (!ret)
> + onlined_pages += (1UL << order);
> + else if (ret > 0)
> + onlined_pages += ret;
> +
> + start += (1UL << order);
> + }
> + return onlined_pages;
>  }
[...]
> -static void __init __free_pages_boot_core(struct page *page, unsigned int 
> order)
> +void __free_pages_core(struct page *page, unsigned int order)
>  {
>   unsigned int nr_pages = 1 << order;
>   struct page *p = page;
>   unsigned int loop;
>  
> - prefetchw(p);
> - for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
> - prefetchw(p + 1);
> + for (loop = 0; loop < nr_pages; loop++, p++) {
>   __ClearPageReserved(p);
>   set_page_count(p, 0);
>   }
> - __ClearPageReserved(p);
> - set_page_count(p, 0);
>  
>   page_zone(page)->managed_pages += nr_pages;
>   set_page_refcounted(page);

I think this is wort a separate patch as it is unrelated to the patch.

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-04 Thread Michal Hocko
On Wed 03-10-18 19:00:29, David Hildenbrand wrote:
[...]
> Let me rephrase: You state that user space has to make the decision and
> that user should be able to set/reconfigure rules. That is perfectly fine.
> 
> But then we should give user space access to sufficient information to
> make a decision. This might be the type of memory as we learned (what
> some part of this patch proposes), but maybe later more, e.g. to which
> physical device memory belongs (e.g. to hotplug it all movable or all
> normal) ...

I am pretty sure that user knows he/she wants to use ballooning in
HyperV or Xen, or that the memory hotplug should be used as a "RAS"
feature to allow add and remove DIMMs for reliability. Why shouldn't we
have a package to deploy an appropriate set of udev rules for each of
those usecases? I am pretty sure you need some other plumbing to enable
them anyway (e.g. RAS would require to have movable_node kernel
parameters, ballooning a kernel module etc.).

Really, one udev script to rule them all will simply never work.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-04 Thread Michal Hocko
On Wed 03-10-18 19:14:05, David Hildenbrand wrote:
> On 03/10/2018 16:34, Vitaly Kuznetsov wrote:
> > Dave Hansen  writes:
> > 
> >> On 10/03/2018 06:52 AM, Vitaly Kuznetsov wrote:
> >>> It is more than just memmaps (e.g. forking udev process doing memory
> >>> onlining also needs memory) but yes, the main idea is to make the
> >>> onlining synchronous with hotplug.
> >>
> >> That's a good theoretical concern.
> >>
> >> But, is it a problem we need to solve in practice?
> > 
> > Yes, unfortunately. It was previously discovered that when we try to
> > hotplug tons of memory to a low memory system (this is a common scenario
> > with VMs) we end up with OOM because for all new memory blocks we need
> > to allocate page tables, struct pages, ... and we need memory to do
> > that. The userspace program doing memory onlining also needs memory to
> > run and in case it prefers to fork to handle hundreds of notfifications
> > ... well, it may get OOMkilled before it manages to online anything.
> > 
> > Allocating all kernel objects from the newly hotplugged blocks would
> > definitely help to manage the situation but as I said this won't solve
> > the 'forking udev' problem completely (it will likely remain in
> > 'extreme' cases only. We can probably work around it by onlining with a
> > dedicated process which doesn't do memory allocation).
> > 
> 
> I guess the problem is even worse. We always have two phases
> 
> 1. add memory - requires memory allocation
> 2. online memory - might require memory allocations e.g. for slab/slub
> 
> So if we just added memory but don't have sufficient memory to start a
> user space process to trigger onlining, then we most likely also don't
> have sufficient memory to online the memory right away (in some scenarios).
> 
> We would have to allocate all new memory for 1 and 2 from the memory to
> be onlined. I guess the latter part is less trivial.
> 
> So while onlining the memory from the kernel might make things a little
> more robust, we would still have the chance for OOM / onlining failing.

Yes, _theoretically_. Is this a practical problem for reasonable
configurations though? I mean, this will never be perfect and we simply
cannot support all possible configurations. We should focus on
reasonable subset of them. From my practical experience the vast
majority of memory is consumed by memmaps (roughly 1.5%). That is not a
lot but I agree that allocating that from the zone normal and off node
is not great. Especially the second part which is noticeable for whole
node hotplug.

I have a feeling that arguing about fork not able to proceed or OOMing
for the memory hotplug is a bit of a stretch and a sign a of
misconfiguration.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Michal Hocko
On Wed 03-10-18 15:52:24, Vitaly Kuznetsov wrote:
[...]
> > As David said some of the memory cannot be onlined without further steps
> > (e.g. when it is standby as David called it) and then I fail to see how
> > eBPF help in any way.
> 
> and also, we can fight till the end of days here trying to come up with
> an onlining solution which would work for everyone and eBPF would move
> this decision to distro level.

The point is that there is _no_ general onlining solution. This is
basically policy which belongs to the userspace.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Michal Hocko
On Tue 02-10-18 17:25:19, David Hildenbrand wrote:
> On 02/10/2018 15:47, Michal Hocko wrote:
[...]
> > Zone imbalance is an inherent problem of the highmem zone. It is
> > essentially the highmem zone we all loved so much back in 32b days.
> > Yes the movable zone doesn't have any addressing limitations so it is a
> > bit more relaxed but considering the hotplug scenarios I have seen so
> > far people just want to have full NUMA nodes movable to allow replacing
> > DIMMs. And then we are back to square one and the zone imbalance issue.
> > You have those regardless where memmaps are allocated from.
> 
> Unfortunately yes. And things get more complicated as you are adding a
> whole DIMMs and get notifications in the granularity of memory blocks.
> Usually you are not interested in onlining any memory block of that DIMM
> as MOVABLE as soon as you would have to online one memory block of that
> DIMM as NORMAL - because that can already block the whole DIMM.

For the purpose of the hotremove, yes. But as Dave has noted people are
(ab)using zone movable for other purposes - e.g. large pages.
 
[...]
> > Then the immediate question would be why to use memory hotplug for that
> > at all? Why don't you simply start with a huge pre-allocated physical
> > address space and balloon memory in an out per demand. Why do you want
> > to inject new memory during the runtime?
> 
> Let's assume you have a guest with 20GB size and eventually want to
> allow to grow it to 4TB. You would have to allocate metadata for 4TB
> right from the beginning. That's definitely now what we want. That is
> why memory hotplug is used by e.g. XEN or Hyper-V. With Hyper-V, the
> hypervisor even tells you at which places additional memory has been
> made available.

Then you have to live with the fact that your hot added memory will be
self hosted and find a way for ballooning to work with that. The price
would be that some part of the memory is not really balloonable in the
end.

> >> 1. is a reason why distributions usually don't configure
> >> "MEMORY_HOTPLUG_DEFAULT_ONLINE", because you really want the option for
> >> MOVABLE zone. That however implies, that e.g. for x86, you have to
> >> handle all new memory in user space, especially also HyperV memory.
> >> There, you then have to check for things like "isHyperV()" to decide
> >> "oh, yes, this should definitely not go to the MOVABLE zone".
> > 
> > Why do you need a generic hotplug rule in the first place? Why don't you
> > simply provide different set of rules for different usecases? Let users
> > decide which usecase they prefer rather than try to be clever which
> > almost always hits weird corner cases.
> > 
> 
> Memory hotplug has to work as reliable as we can out of the box. Letting
> the user make simple decisions like "oh, I am on hyper-V, I want to
> online memory to the normal zone" does not feel right.

Users usually know what is their usecase and then it is just a matter of
plumbing (e.g. distribution can provide proper tools to deploy those
usecases) to chose the right and for user obscure way to make it work.

> But yes, we
> should definitely allow to make modifications. So some sane default rule
> + possible modification is usually a good idea.
> 
> I think Dave has a point with using MOVABLE for huge page use cases. And
> there might be other corner cases as you correctly state.
> 
> I wonder if this patch itself minus modifying online/offline might make
> sense. We can then implement simple rules in user space
> 
> if (normal) {
>   /* customers expect hotplugged DIMMs to be unpluggable */
>   online_movable();
> } else if (paravirt) {
>   /* paravirt memory should as default always go to the NORMAL */
>   online();
> } else {
>   /* standby memory will never get onlined automatically */
> }
> 
> Compared to having to guess what is to be done (isKVM(), isHyperV,
> isS390 ...) and failing once this is no longer unique (e.g. virtio-mem
> and ACPI support for x86 KVM).

I am worried that exporing a type will just push us even further to the
corner. The current design is really simple and 2 stage and that is good
because it allows for very different usecases. The more specific the API
be the more likely we are going to hit "I haven't even dreamed somebody
would be using hotplug for this thing". And I would bet this will happen
sooner or later.

Just look at how the whole auto onlining screwed the API to workaround
an implementation detail. It has created a one purpose behavior that
doesn't suite many usecases. Yet we have to live with that because
somebody really relies on it. Let's not repeat same errors.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-03 Thread Michal Hocko
On Wed 03-10-18 15:38:04, Vitaly Kuznetsov wrote:
> David Hildenbrand  writes:
> 
> > On 02/10/2018 15:47, Michal Hocko wrote:
> ...
> >> 
> >> Why do you need a generic hotplug rule in the first place? Why don't you
> >> simply provide different set of rules for different usecases? Let users
> >> decide which usecase they prefer rather than try to be clever which
> >> almost always hits weird corner cases.
> >> 
> >
> > Memory hotplug has to work as reliable as we can out of the box. Letting
> > the user make simple decisions like "oh, I am on hyper-V, I want to
> > online memory to the normal zone" does not feel right. But yes, we
> > should definitely allow to make modifications.
> 
> Last time I was thinking about the imperfectness of the auto-online
> solution we have and any other solution we're able to suggest an idea
> came to my mind - what if we add an eBPF attach point to the
> auto-onlining mechanism effecively offloading decision-making to
> userspace. We'll of couse need to provide all required data (e.g. how
> memory blocks are aligned with physical DIMMs as it makes no sense to
> online part of DIMM as normal and the rest as movable as it's going to
> be impossible to unplug such DIMM anyways).

And how does that differ from the notification mechanism we have? Just
by not relying on the process scheduling? If yes then this revolves
around the implementation detail that you care about time-to-hot-add
vs. time-to-online. And that is a solveable problem - just allocate
memmaps from the hot-added memory.

As David said some of the memory cannot be onlined without further steps
(e.g. when it is standby as David called it) and then I fail to see how
eBPF help in any way.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-02 Thread Michal Hocko
On Mon 01-10-18 11:34:25, David Hildenbrand wrote:
> On 01/10/2018 10:40, Michal Hocko wrote:
> > On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
> > [...]
> > 
> > I haven't read the patch itself but I just wanted to note one thing
> > about this part
> > 
> >> For paravirtualized devices it is relevant that memory is onlined as
> >> quickly as possible after adding - and that it is added to the NORMAL
> >> zone. Otherwise, it could happen that too much memory in a row is added
> >> (but not onlined), resulting in out-of-memory conditions due to the
> >> additional memory for "struct pages" and friends. MOVABLE zone as well
> >> as delays might be very problematic and lead to crashes (e.g. zone
> >> imbalance).
> > 
> > I have proposed (but haven't finished this due to other stuff) a
> > solution for this. Newly added memory can host memmaps itself and then
> > you do not have the problem in the first place. For vmemmap it would
> > have an advantage that you do not really have to beg for 2MB pages to
> > back the whole section but you would get it for free because the initial
> > part of the section is by definition properly aligned and unused.
> 
> So the plan is to "host metadata for new memory on the memory itself".
> Just want to note that this is basically impossible for s390x with the
> current mechanisms. (added memory is dead, until onlining notifies the
> hypervisor and memory is allocated). It will also be problematic for
> paravirtualized memory devices (e.g. XEN's "not backed by the
> hypervisor" hacks).

OK, I understand that not all usecases can use self memmap hosting
others do not have much choice left though. You have to allocate from
somewhere. Well and alternative would be to have no memmap until
onlining but I am not sure how much work that would be.

> This would only be possible for memory DIMMs, memory that is completely
> accessible as far as I can see. Or at least, some specified "first part"
> is accessible.
> 
> Other problems are other metadata like extended struct pages and friends.

I wouldn't really worry about extended struct pages. Those should be
used for debugging purposes mostly. Ot at least that was the case last
time I've checked.

> (I really like the idea of adding memory without allocating memory in
> the hypervisor in the first place, please keep me tuned).
> 
> And please note: This solves some problematic part ("adding too much
> memory to the movable zone or not onlining it"), but not the issue of
> zone imbalance in the first place. And not one issue I try to tackle
> here: don't add paravirtualized memory to the movable zone.

Zone imbalance is an inherent problem of the highmem zone. It is
essentially the highmem zone we all loved so much back in 32b days.
Yes the movable zone doesn't have any addressing limitations so it is a
bit more relaxed but considering the hotplug scenarios I have seen so
far people just want to have full NUMA nodes movable to allow replacing
DIMMs. And then we are back to square one and the zone imbalance issue.
You have those regardless where memmaps are allocated from.

> > I yet have to think about the whole proposal but I am missing the most
> > important part. _Who_ is going to use the new exported information and
> > for what purpose. You said that distributions have hard time to
> > distinguish different types of onlinining policies but isn't this
> > something that is inherently usecase specific?
> > 
> 
> Let's think about a distribution. We have a clash of use cases here
> (just what you describe). What I propose solves one part of it ("handle
> what you know how to handle right in the kernel").
> 
> 1. Users of DIMMs usually expect that they can be unplugged again. That
> is why you want to control how to online memory in user space (== add it
> to the movable zone).

Which is only true if you really want to hotremove them. I am not going
to tell how much I believe in this usecase but movable policy is not
generally applicable here.

> 2. Users of standby memory (s390) expect that memory will never be
> onlined automatically. It will be onlined manually.

yeah

> 3. Users of paravirtualized devices (esp. Hyper-V) don't care about
> memory unplug in the sense of MOVABLE at all. They (or Hyper-V!) will
> add a whole bunch of memory and expect that everything works fine. So
> that memory is onlined immediately and that memory is added to the
> NORMAL zone. Users never want the MOVABLE zone.

Then the immediate question would be why to use memory hotplug for that
at all? Why don't you simply start with a huge pre-allocated physical
address space and balloon memory in an out per demand. 

Re: [Xen-devel] [PATCH RFC] mm/memory_hotplug: Introduce memory block types

2018-10-01 Thread Michal Hocko
On Fri 28-09-18 17:03:57, David Hildenbrand wrote:
[...]

I haven't read the patch itself but I just wanted to note one thing
about this part

> For paravirtualized devices it is relevant that memory is onlined as
> quickly as possible after adding - and that it is added to the NORMAL
> zone. Otherwise, it could happen that too much memory in a row is added
> (but not onlined), resulting in out-of-memory conditions due to the
> additional memory for "struct pages" and friends. MOVABLE zone as well
> as delays might be very problematic and lead to crashes (e.g. zone
> imbalance).

I have proposed (but haven't finished this due to other stuff) a
solution for this. Newly added memory can host memmaps itself and then
you do not have the problem in the first place. For vmemmap it would
have an advantage that you do not really have to beg for 2MB pages to
back the whole section but you would get it for free because the initial
part of the section is by definition properly aligned and unused.

I yet have to think about the whole proposal but I am missing the most
important part. _Who_ is going to use the new exported information and
for what purpose. You said that distributions have hard time to
distinguish different types of onlinining policies but isn't this
something that is inherently usecase specific?
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] memory_hotplug: Free pages as higher order

2018-09-25 Thread Michal Hocko
On Tue 25-09-18 11:59:09, Vlastimil Babka wrote:
[...]
> This seems like almost complete copy of __free_pages_boot_core(), could
> you do some code reuse instead? I think Michal Hocko also suggested that.

Yes, please try to reuse as much code as possible

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 11:12:40, Jerome Glisse wrote:
[...]
> I am fine with Michal patch, i already said so couple month ago first time
> this discussion did pop up, Michal you can add:
> 
> Reviewed-by: Jérôme Glisse 

So I guess the below is the patch you were talking about?

From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 24 Aug 2018 15:32:24 +0200
Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition
 non-blocking mode

If invalidate_range_start is called for !blocking mode then all
callbacks have to guarantee they will no block/sleep. The same obviously
applies to invalidate_range_end because this operation pairs with the
former and they are called from the same context. Make sure this is
appropriately documented.

Signed-off-by: Michal Hocko 
---
 include/linux/mmu_notifier.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 133ba78820ee..698e371aafe3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -153,7 +153,9 @@ struct mmu_notifier_ops {
 *
 * If blockable argument is set to false then the callback cannot
 * sleep and has to return with -EAGAIN. 0 should be returned
-* otherwise.
+* otherwise. Please note that if invalidate_range_start approves
+* a non-blocking behavior then the same applies to
+* invalidate_range_end.
 *
 */
int (*invalidate_range_start)(struct mmu_notifier *mn,
-- 
2.18.0

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 15:44:03, Christian König wrote:
> Am 24.08.2018 um 15:40 schrieb Michal Hocko:
> > On Fri 24-08-18 15:28:33, Christian König wrote:
> > > Am 24.08.2018 um 15:24 schrieb Michal Hocko:
> > > > On Fri 24-08-18 15:10:08, Christian König wrote:
> > > > > Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > > > > > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > > > > > [...]
> > > > > > > > Thiking about it some more, I can imagine that a notifier 
> > > > > > > > callback which
> > > > > > > > performs an allocation might trigger a memory reclaim and that 
> > > > > > > > in turn
> > > > > > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > > > > > shouldn't really allocate memory. They are called from deep MM 
> > > > > > > > code
> > > > > > > > paths and this would be extremely deadlock prone. Maybe Jerome 
> > > > > > > > can come
> > > > > > > > up some more realistic scenario. If not then I would propose to 
> > > > > > > > simplify
> > > > > > > > the locking here. We have lockdep to catch self deadlocks and 
> > > > > > > > it is
> > > > > > > > always better to handle a specific issue rather than having a 
> > > > > > > > code
> > > > > > > > without a clear indication how it can recurse.
> > > > > > > Well I agree that we should probably fix that, but I have some 
> > > > > > > concerns to
> > > > > > > remove the existing workaround.
> > > > > > > 
> > > > > > > See we added that to get rid of a real problem in a customer 
> > > > > > > environment and
> > > > > > > I don't want to that to show up again.
> > > > > > It would really help to know more about that case and fix it 
> > > > > > properly
> > > > > > rather than workaround it like this. Anyway, let me think how to 
> > > > > > handle
> > > > > > the non-blocking notifier invocation then. I was not able to come up
> > > > > > with anything remotely sane yet.
> > > > > With avoiding allocating memory in the write lock path I don't see an 
> > > > > issue
> > > > > any more with that.
> > > > > 
> > > > > All what the write lock path does now is adding items to a linked 
> > > > > lists,
> > > > > arrays etc
> > > > Can we change it to non-sleepable lock then?
> > > No, the write side doesn't sleep any more, but the read side does.
> > > 
> > > See amdgpu_mn_invalidate_node() and that is where you actually need to
> > > handle the non-blocking flag correctly.
> > Ohh, right you are. We already handle that by bailing out before calling
> > amdgpu_mn_invalidate_node in !blockable mode.
> 
> Yeah, that is sufficient.
> 
> It could be improved because we have something like 90% chance that
> amdgpu_mn_invalidate_node() actually doesn't need to do anything.
> 
> But I can take care of that when the patch set has landed.
> 
> > So does this looks good to you?
> 
> Yeah, that looks perfect to me. Reviewed-by: Christian König
> 

Cool! Thanks for your guidance and patience with me. Here is the full
patch. Feel free to take it and route per your preference.

From 4e297bf5a55906ee369d70bee9f7beeb3cba74bb Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 24 Aug 2018 15:45:52 +0200
Subject: [PATCH] drm/amd: clarify amdgpu_mn_read_lock !blocking mode
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Tetsuo has noticed that 93065ac753e4 ("mm, oom: distinguish blockable
mode for mmu notifiers") !blocking case for amdgpu_mn_read_lock is
incomplete because it might sleep on the notifier lock. This is true
but as it turned out from the discussion with Christian this doesn't
really matter.

The amd notifier lock doesn't block in the exclusive mode. It only ever
sleeps with the read lock inside amdgpu_mn_invalidate_node. That one
is not called in !blockable state so while we might sleep on notifier
read_lock this will only be for a short while. The same applies on the
notifier lock.

Therefore remove blockable handling from amdgpu_mn_read_lock and
document it pro

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 15:28:33, Christian König wrote:
> Am 24.08.2018 um 15:24 schrieb Michal Hocko:
> > On Fri 24-08-18 15:10:08, Christian König wrote:
> > > Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > > > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > > > [...]
> > > > > > Thiking about it some more, I can imagine that a notifier callback 
> > > > > > which
> > > > > > performs an allocation might trigger a memory reclaim and that in 
> > > > > > turn
> > > > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > > > shouldn't really allocate memory. They are called from deep MM code
> > > > > > paths and this would be extremely deadlock prone. Maybe Jerome can 
> > > > > > come
> > > > > > up some more realistic scenario. If not then I would propose to 
> > > > > > simplify
> > > > > > the locking here. We have lockdep to catch self deadlocks and it is
> > > > > > always better to handle a specific issue rather than having a code
> > > > > > without a clear indication how it can recurse.
> > > > > Well I agree that we should probably fix that, but I have some 
> > > > > concerns to
> > > > > remove the existing workaround.
> > > > > 
> > > > > See we added that to get rid of a real problem in a customer 
> > > > > environment and
> > > > > I don't want to that to show up again.
> > > > It would really help to know more about that case and fix it properly
> > > > rather than workaround it like this. Anyway, let me think how to handle
> > > > the non-blocking notifier invocation then. I was not able to come up
> > > > with anything remotely sane yet.
> > > With avoiding allocating memory in the write lock path I don't see an 
> > > issue
> > > any more with that.
> > > 
> > > All what the write lock path does now is adding items to a linked lists,
> > > arrays etc
> > Can we change it to non-sleepable lock then?
> 
> No, the write side doesn't sleep any more, but the read side does.
> 
> See amdgpu_mn_invalidate_node() and that is where you actually need to
> handle the non-blocking flag correctly.

Ohh, right you are. We already handle that by bailing out before calling
amdgpu_mn_invalidate_node in !blockable mode. So does this looks good to
you?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..48fa152231be 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,11 +180,15 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  */
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
+   /*
+* We can take sleepable lock even on !blockable mode because
+* read_lock is only ever take from this path and the notifier
+* lock never really sleeps. In fact the only reason why the
+* later is sleepable is because the notifier itself might sleep
+* in amdgpu_mn_invalidate_node but blockable mode is handled
+* before calling into that path.
+*/
+   mutex_lock(>read_lock);
if (atomic_inc_return(>recursion) == 1)
down_read_non_owner(>lock);
mutex_unlock(>read_lock);
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> On 2018/08/24 20:36, Michal Hocko wrote:
> >> That is, this API seems to be currently used by only out-of-tree users. 
> >> Since
> >> we can't check that nobody has memory allocation dependency, I think that
> >> hmm_invalidate_range_start() should return -EAGAIN if blockable == false 
> >> for now.
> > 
> > The code expects that the invalidate_range_end doesn't block if
> > invalidate_range_start hasn't blocked. That is the reason why the end
> > callback doesn't have blockable parameter. If this doesn't hold then the
> > whole scheme is just fragile because those two calls should pair.
> > 
> That is
> 
>   More worrisome part in that patch is that I don't know whether using
>   trylock if blockable == false at entry is really sufficient.
> 
> . Since those two calls should pair, I think that we need to determine whether
> we need to return -EAGAIN at start call by evaluating both calls.

Yes, and I believe I have done that audit. Module my misunderstanding of
the code.

> Like mn_invl_range_start() involves schedule_delayed_work() which could be
> blocked on memory allocation under OOM situation,

It doesn't because that code path is not invoked for the !blockable
case.

> I worry that (currently
> out-of-tree) users of this API are involving work / recursion.

I do not give a slightest about out-of-tree modules. They will have to
accomodate to the new API. I have no problems to extend the
documentation and be explicit about this expectation.
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 133ba78820ee..698e371aafe3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -153,7 +153,9 @@ struct mmu_notifier_ops {
 *
 * If blockable argument is set to false then the callback cannot
 * sleep and has to return with -EAGAIN. 0 should be returned
-* otherwise.
+* otherwise. Please note that if invalidate_range_start approves
+* a non-blocking behavior then the same applies to
+* invalidate_range_end.
 *
 */
int (*invalidate_range_start)(struct mmu_notifier *mn,


> And hmm_release() says that
> 
>   /*
>* Drop mirrors_sem so callback can wait on any pending
>* work that might itself trigger mmu_notifier callback
>* and thus would deadlock with us.
>*/
> 
> and keeps "all operations protected by hmm->mirrors_sem held for write are
> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
> held
> for read will sleep (and in the worst case involves memory allocation
> dependency)".

Yes and so what? The clear expectation is that neither of the range
notifiers do not sleep in !blocking mode. I really fail to see what you
are trying to say.

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 15:10:08, Christian König wrote:
> Am 24.08.2018 um 15:01 schrieb Michal Hocko:
> > On Fri 24-08-18 14:52:26, Christian König wrote:
> > > Am 24.08.2018 um 14:33 schrieb Michal Hocko:
> > [...]
> > > > Thiking about it some more, I can imagine that a notifier callback which
> > > > performs an allocation might trigger a memory reclaim and that in turn
> > > > might trigger a notifier to be invoked and recurse. But notifier
> > > > shouldn't really allocate memory. They are called from deep MM code
> > > > paths and this would be extremely deadlock prone. Maybe Jerome can come
> > > > up some more realistic scenario. If not then I would propose to simplify
> > > > the locking here. We have lockdep to catch self deadlocks and it is
> > > > always better to handle a specific issue rather than having a code
> > > > without a clear indication how it can recurse.
> > > Well I agree that we should probably fix that, but I have some concerns to
> > > remove the existing workaround.
> > > 
> > > See we added that to get rid of a real problem in a customer environment 
> > > and
> > > I don't want to that to show up again.
> > It would really help to know more about that case and fix it properly
> > rather than workaround it like this. Anyway, let me think how to handle
> > the non-blocking notifier invocation then. I was not able to come up
> > with anything remotely sane yet.
> 
> With avoiding allocating memory in the write lock path I don't see an issue
> any more with that.
> 
> All what the write lock path does now is adding items to a linked lists,
> arrays etc

Can we change it to non-sleepable lock then?
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 14:52:26, Christian König wrote:
> Am 24.08.2018 um 14:33 schrieb Michal Hocko:
[...]
> > Thiking about it some more, I can imagine that a notifier callback which
> > performs an allocation might trigger a memory reclaim and that in turn
> > might trigger a notifier to be invoked and recurse. But notifier
> > shouldn't really allocate memory. They are called from deep MM code
> > paths and this would be extremely deadlock prone. Maybe Jerome can come
> > up some more realistic scenario. If not then I would propose to simplify
> > the locking here. We have lockdep to catch self deadlocks and it is
> > always better to handle a specific issue rather than having a code
> > without a clear indication how it can recurse.
> 
> Well I agree that we should probably fix that, but I have some concerns to
> remove the existing workaround.
> 
> See we added that to get rid of a real problem in a customer environment and
> I don't want to that to show up again.

It would really help to know more about that case and fix it properly
rather than workaround it like this. Anyway, let me think how to handle
the non-blocking notifier invocation then. I was not able to come up
with anything remotely sane yet.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 14:18:44, Christian König wrote:
> Am 24.08.2018 um 14:03 schrieb Michal Hocko:
> > On Fri 24-08-18 13:57:52, Christian König wrote:
> > > Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > > > On Fri 24-08-18 13:43:16, Christian König wrote:
> > [...]
> > > > > That won't work like this there might be multiple
> > > > > invalidate_range_start()/invalidate_range_end() pairs open at the 
> > > > > same time.
> > > > > E.g. the lock might be taken recursively and that is illegal for a
> > > > > rw_semaphore.
> > > > I am not sure I follow. Are you saying that one invalidate_range might
> > > > trigger another one from the same path?
> > > No, but what can happen is:
> > > 
> > > invalidate_range_start(A,B);
> > > invalidate_range_start(C,D);
> > > ...
> > > invalidate_range_end(C,D);
> > > invalidate_range_end(A,B);
> > > 
> > > Grabbing the read lock twice would be illegal in this case.
> > I am sorry but I still do not follow. What is the context the two are
> > called from?
> 
> I don't have the slightest idea.
> 
> > Can you give me an example. I simply do not see it in the
> > code, mostly because I am not familiar with it.
> 
> I'm neither.
> 
> We stumbled over that by pure observation and after discussing the problem
> with Jerome came up with this solution.
> 
> No idea where exactly that case comes from, but I can confirm that it indeed
> happens.

Thiking about it some more, I can imagine that a notifier callback which
performs an allocation might trigger a memory reclaim and that in turn
might trigger a notifier to be invoked and recurse. But notifier
shouldn't really allocate memory. They are called from deep MM code
paths and this would be extremely deadlock prone. Maybe Jerome can come
up some more realistic scenario. If not then I would propose to simplify
the locking here. We have lockdep to catch self deadlocks and it is
always better to handle a specific issue rather than having a code
without a clear indication how it can recurse.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 13:57:52, Christian König wrote:
> Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > On Fri 24-08-18 13:43:16, Christian König wrote:
[...]
> > > That won't work like this there might be multiple
> > > invalidate_range_start()/invalidate_range_end() pairs open at the same 
> > > time.
> > > E.g. the lock might be taken recursively and that is illegal for a
> > > rw_semaphore.
> > I am not sure I follow. Are you saying that one invalidate_range might
> > trigger another one from the same path?
> 
> No, but what can happen is:
> 
> invalidate_range_start(A,B);
> invalidate_range_start(C,D);
> ...
> invalidate_range_end(C,D);
> invalidate_range_end(A,B);
> 
> Grabbing the read lock twice would be illegal in this case.

I am sorry but I still do not follow. What is the context the two are
called from? Can you give me an example. I simply do not see it in the
code, mostly because I am not familiar with it.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 13:43:16, Christian König wrote:
> Am 24.08.2018 um 13:32 schrieb Michal Hocko:
> > On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
> > > Two more worries for this patch.
> > > 
> > > 
> > > 
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > > > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> > > >*
> > > >* @amn: our notifier
> > > >*/
> > > > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
> > > > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
> > > >   {
> > > > -   mutex_lock(>read_lock);
> > > > +   if (blockable)
> > > > +   mutex_lock(>read_lock);
> > > > +   else if (!mutex_trylock(>read_lock))
> > > > +   return -EAGAIN;
> > > > +
> > > >  if (atomic_inc_return(>recursion) == 1)
> > > >  down_read_non_owner(>lock);
> > > Why don't we need to use trylock here if blockable == false ?
> > > Want comment why it is safe to use blocking lock here.
> > Hmm, I am pretty sure I have checked the code but it was quite confusing
> > so I might have missed something. Double checking now, it seems that
> > this read_lock is not used anywhere else and it is not _the_ lock we are
> > interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
> > it is taken in exclusive mode for expensive operations.
> 
> The write side of the lock is only taken in the command submission IOCTL.
> 
> So you actually don't need to change anything here (even the proposed
> changes are overkill) since we can't tear down the struct_mm while an IOCTL
> is still using.

I am not so sure. We are not in the mm destruction phase yet. This is
mostly about the oom context which might fire right during the IOCTL. If
any of the path which is holding the write lock blocks for unbound
amount of time or even worse allocates a memory then we are screwed. So
we need to back of when blockable = false.

> > Is that correct Christian? If this is correct then we need to update the
> > locking here. I am struggling to grasp the ref counting part. Why cannot
> > all readers simply take the lock rather than rely on somebody else to
> > take it? 1ed3d2567c800 didn't really help me to understand the locking
> > scheme here so any help would be appreciated.
> 
> That won't work like this there might be multiple
> invalidate_range_start()/invalidate_range_end() pairs open at the same time.
> E.g. the lock might be taken recursively and that is illegal for a
> rw_semaphore.

I am not sure I follow. Are you saying that one invalidate_range might
trigger another one from the same path?
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
[...]
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, 
> > struct mm_struct *mm)
> > up_write(>mirrors_sem);
> >  }
> > 
> > -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> > +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> >struct mm_struct *mm,
> >unsigned long start,
> > -  unsigned long end)
> > +  unsigned long end,
> > +  bool blockable)
> >  {
> > struct hmm *hmm = mm->hmm;
> > 
> > VM_BUG_ON(!hmm);
> > 
> > atomic_inc(>sequence);
> > +
> > +   return 0;
> >  }
> > 
> >  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> 
> This assumes that hmm_invalidate_range_end() does not have memory
> allocation dependency. But hmm_invalidate_range() from
> hmm_invalidate_range_end() involves
> 
> down_read(>mirrors_sem);
> list_for_each_entry(mirror, >mirrors, list)
> mirror->ops->sync_cpu_device_pagetables(mirror, action,
> start, end);
> up_read(>mirrors_sem);
> 
> sequence. What is surprising is that there is no in-tree user who assigns
> sync_cpu_device_pagetables field.

Yes HMM doesn't have any in tree user yet.

>   $ grep -Fr sync_cpu_device_pagetables *
>   Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize 
> page tables
>   include/linux/hmm.h: * will get callbacks through 
> sync_cpu_device_pagetables() operation (see
>   include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page 
> tables
>   include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct 
> hmm_mirror *mirror,
>   include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() 
> callback, so that CPU page
>   mm/hmm.c:   mirror->ops->sync_cpu_device_pagetables(mirror, 
> action,
> 
> That is, this API seems to be currently used by only out-of-tree users. Since
> we can't check that nobody has memory allocation dependency, I think that
> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for 
> now.

The code expects that the invalidate_range_end doesn't block if
invalidate_range_start hasn't blocked. That is the reason why the end
callback doesn't have blockable parameter. If this doesn't hold then the
whole scheme is just fragile because those two calls should pair.

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 19:54:19, Tetsuo Handa wrote:
> Two more worries for this patch.
> 
> 
> 
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > @@ -178,12 +178,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> >   *
> >   * @amn: our notifier
> >   */
> > -static void amdgpu_mn_read_lock(struct amdgpu_mn *amn)
> > +static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
> >  {
> > -   mutex_lock(>read_lock);
> > +   if (blockable)
> > +   mutex_lock(>read_lock);
> > +   else if (!mutex_trylock(>read_lock))
> > +   return -EAGAIN;
> > +
> > if (atomic_inc_return(>recursion) == 1)
> > down_read_non_owner(>lock);
> 
> Why don't we need to use trylock here if blockable == false ?
> Want comment why it is safe to use blocking lock here.

Hmm, I am pretty sure I have checked the code but it was quite confusing
so I might have missed something. Double checking now, it seems that
this read_lock is not used anywhere else and it is not _the_ lock we are
interested about. It is the amn->lock (amdgpu_mn_lock) which matters as
it is taken in exclusive mode for expensive operations.

Is that correct Christian? If this is correct then we need to update the
locking here. I am struggling to grasp the ref counting part. Why cannot
all readers simply take the lock rather than rely on somebody else to
take it? 1ed3d2567c800 didn't really help me to understand the locking
scheme here so any help would be appreciated.

I am wondering why we cannot do
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index e55508b39496..93034178673d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -180,14 +180,11 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  */
 static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
 {
-   if (blockable)
-   mutex_lock(>read_lock);
-   else if (!mutex_trylock(>read_lock))
-   return -EAGAIN;
-
-   if (atomic_inc_return(>recursion) == 1)
-   down_read_non_owner(>lock);
-   mutex_unlock(>read_lock);
+   if (!down_read_trylock(>lock)) {
+   if (!blockable)
+   return -EAGAIN;
+   down_read(amn->lock);
+   }
 
return 0;
 }
@@ -199,8 +196,7 @@ static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool 
blockable)
  */
 static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
 {
-   if (atomic_dec_return(>recursion) == 0)
-   up_read_non_owner(>lock);
+   up_read(>lock);
 }
 
 /**

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] xen/gntdev: fix up blockable calls to mn_invl_range_start

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 07:03:28, Juergen Gross wrote:
> On 23/08/18 21:09, Michal Hocko wrote:
> > On Thu 23-08-18 10:06:53, Boris Ostrovsky wrote:
> >> On 08/23/2018 09:51 AM, Michal Hocko wrote:
> >>> On Thu 23-08-18 22:44:07, Tetsuo Handa wrote:
> >>>> On 2018/08/23 21:07, Michal Hocko wrote:
> >>>>> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> >>>>> index 57390c7666e5..e7d8bb1bee2a 100644
> >>>>> --- a/drivers/xen/gntdev.c
> >>>>> +++ b/drivers/xen/gntdev.c
> >>>>> @@ -519,21 +519,20 @@ static int mn_invl_range_start(struct 
> >>>>> mmu_notifier *mn,
> >>>>> struct gntdev_grant_map *map;
> >>>>> int ret = 0;
> >>>>>  
> >>>>> -   /* TODO do we really need a mutex here? */
> >>>>> if (blockable)
> >>>>> mutex_lock(>lock);
> >>>>> else if (!mutex_trylock(>lock))
> >>>>> return -EAGAIN;
> >>>>>  
> >>>>> list_for_each_entry(map, >maps, next) {
> >>>>> -   if (in_range(map, start, end)) {
> >>>>> +   if (!blockable && in_range(map, start, end)) {
> >>>> This still looks strange. Prior to 93065ac753e4, in_range() test was
> >>>> inside unmap_if_in_range(). But this patch removes in_range() test
> >>>> if blockable == true. That is, unmap_if_in_range() will unconditionally
> >>>> unmap if blockable == true, which seems to be an unexpected change.
> >>> You are right. I completely forgot I've removed in_range there. Does
> >>> this look any better?
> >>>
> >>> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> >>> index e7d8bb1bee2a..30f81004ea63 100644
> >>> --- a/drivers/xen/gntdev.c
> >>> +++ b/drivers/xen/gntdev.c
> >>> @@ -525,14 +525,20 @@ static int mn_invl_range_start(struct mmu_notifier 
> >>> *mn,
> >>>   return -EAGAIN;
> >>>  
> >>>   list_for_each_entry(map, >maps, next) {
> >>> - if (!blockable && in_range(map, start, end)) {
> >>> + if (in_range(map, start, end)) {
> >>> + if (blockable)
> >>> + continue;
> >>> +
> >>>   ret = -EAGAIN;
> >>>   goto out_unlock;
> >>>   }
> >>>   unmap_if_in_range(map, start, end);
> >>
> >>
> >> (I obviously missed that too with my R-b).
> >>
> >> This will never get anything done either. How about
> > 
> > Yeah. I was half way out and posted a complete garbage. Sorry about
> > that!
> > 
> > Michal repeat after me
> > Never post patches when in hurry! Never post patches when in hurry!
> > Never post patches when in hurry! Never post patches when in hurry!
> > Never post patches when in hurry! Never post patches when in hurry!
> > Never post patches when in hurry! Never post patches when in hurry!
> > Never post patches when in hurry! Never post patches when in hurry! 
> > 
> > What I really meant was this
> > 
> > diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> > index e7d8bb1bee2a..6fcc5a44f29d 100644
> > --- a/drivers/xen/gntdev.c
> > +++ b/drivers/xen/gntdev.c
> > @@ -525,17 +525,25 @@ static int mn_invl_range_start(struct mmu_notifier 
> > *mn,
> > return -EAGAIN;
> >  
> > list_for_each_entry(map, >maps, next) {
> > -   if (!blockable && in_range(map, start, end)) {
> > +   if (!in_range(map, start, end))
> > +   continue;
> > +
> > +   if (!blockable) {
> > ret = -EAGAIN;
> > goto out_unlock;
> > }
> > +
> > unmap_if_in_range(map, start, end);
> > }
> > list_for_each_entry(map, >freeable_maps, next) {
> > -   if (!blockable && in_range(map, start, end)) {
> > +   if (!in_range(map, start, end))
> > +   continue;
> > +
> > +   if (!blockable) {
> > ret = -EAGAIN;
> > goto out_unlock;
> > }
> > +
> > unmap_if_in_range(map, start, end);
> > }
> >  
&

Re: [Xen-devel] [PATCH] xen/gntdev: fix up blockable calls to mn_invl_range_start

2018-08-23 Thread Michal Hocko
On Thu 23-08-18 10:06:53, Boris Ostrovsky wrote:
> On 08/23/2018 09:51 AM, Michal Hocko wrote:
> > On Thu 23-08-18 22:44:07, Tetsuo Handa wrote:
> >> On 2018/08/23 21:07, Michal Hocko wrote:
> >>> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> >>> index 57390c7666e5..e7d8bb1bee2a 100644
> >>> --- a/drivers/xen/gntdev.c
> >>> +++ b/drivers/xen/gntdev.c
> >>> @@ -519,21 +519,20 @@ static int mn_invl_range_start(struct mmu_notifier 
> >>> *mn,
> >>>   struct gntdev_grant_map *map;
> >>>   int ret = 0;
> >>>  
> >>> - /* TODO do we really need a mutex here? */
> >>>   if (blockable)
> >>>   mutex_lock(>lock);
> >>>   else if (!mutex_trylock(>lock))
> >>>   return -EAGAIN;
> >>>  
> >>>   list_for_each_entry(map, >maps, next) {
> >>> - if (in_range(map, start, end)) {
> >>> + if (!blockable && in_range(map, start, end)) {
> >> This still looks strange. Prior to 93065ac753e4, in_range() test was
> >> inside unmap_if_in_range(). But this patch removes in_range() test
> >> if blockable == true. That is, unmap_if_in_range() will unconditionally
> >> unmap if blockable == true, which seems to be an unexpected change.
> > You are right. I completely forgot I've removed in_range there. Does
> > this look any better?
> >
> > diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> > index e7d8bb1bee2a..30f81004ea63 100644
> > --- a/drivers/xen/gntdev.c
> > +++ b/drivers/xen/gntdev.c
> > @@ -525,14 +525,20 @@ static int mn_invl_range_start(struct mmu_notifier 
> > *mn,
> > return -EAGAIN;
> >  
> > list_for_each_entry(map, >maps, next) {
> > -   if (!blockable && in_range(map, start, end)) {
> > +   if (in_range(map, start, end)) {
> > +   if (blockable)
> > +   continue;
> > +
> > ret = -EAGAIN;
> > goto out_unlock;
> > }
> > unmap_if_in_range(map, start, end);
> 
> 
> (I obviously missed that too with my R-b).
> 
> This will never get anything done either. How about

Yeah. I was half way out and posted a complete garbage. Sorry about
that!

Michal repeat after me
Never post patches when in hurry! Never post patches when in hurry!
Never post patches when in hurry! Never post patches when in hurry!
Never post patches when in hurry! Never post patches when in hurry!
Never post patches when in hurry! Never post patches when in hurry!
Never post patches when in hurry! Never post patches when in hurry! 

What I really meant was this

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index e7d8bb1bee2a..6fcc5a44f29d 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -525,17 +525,25 @@ static int mn_invl_range_start(struct mmu_notifier *mn,
return -EAGAIN;
 
list_for_each_entry(map, >maps, next) {
-   if (!blockable && in_range(map, start, end)) {
+   if (!in_range(map, start, end))
+   continue;
+
+   if (!blockable) {
ret = -EAGAIN;
goto out_unlock;
}
+
unmap_if_in_range(map, start, end);
}
list_for_each_entry(map, >freeable_maps, next) {
-   if (!blockable && in_range(map, start, end)) {
+   if (!in_range(map, start, end))
+   continue;
+
+   if (!blockable) {
ret = -EAGAIN;
goto out_unlock;
}
+
unmap_if_in_range(map, start, end);
}
 
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] xen/gntdev: fix up blockable calls to mn_invl_range_start

2018-08-23 Thread Michal Hocko
On Thu 23-08-18 22:44:07, Tetsuo Handa wrote:
> On 2018/08/23 21:07, Michal Hocko wrote:
> > diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> > index 57390c7666e5..e7d8bb1bee2a 100644
> > --- a/drivers/xen/gntdev.c
> > +++ b/drivers/xen/gntdev.c
> > @@ -519,21 +519,20 @@ static int mn_invl_range_start(struct mmu_notifier 
> > *mn,
> > struct gntdev_grant_map *map;
> > int ret = 0;
> >  
> > -   /* TODO do we really need a mutex here? */
> > if (blockable)
> > mutex_lock(>lock);
> > else if (!mutex_trylock(>lock))
> > return -EAGAIN;
> >  
> > list_for_each_entry(map, >maps, next) {
> > -   if (in_range(map, start, end)) {
> > +   if (!blockable && in_range(map, start, end)) {
> 
> This still looks strange. Prior to 93065ac753e4, in_range() test was
> inside unmap_if_in_range(). But this patch removes in_range() test
> if blockable == true. That is, unmap_if_in_range() will unconditionally
> unmap if blockable == true, which seems to be an unexpected change.

You are right. I completely forgot I've removed in_range there. Does
this look any better?

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index e7d8bb1bee2a..30f81004ea63 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -525,14 +525,20 @@ static int mn_invl_range_start(struct mmu_notifier *mn,
return -EAGAIN;
 
list_for_each_entry(map, >maps, next) {
-   if (!blockable && in_range(map, start, end)) {
+   if (in_range(map, start, end)) {
+   if (blockable)
+   continue;
+
ret = -EAGAIN;
goto out_unlock;
}
unmap_if_in_range(map, start, end);
}
list_for_each_entry(map, >freeable_maps, next) {
-   if (!blockable && in_range(map, start, end)) {
+   if (in_range(map, start, end)) {
+   if (blockable)
+       continue;
+   
ret = -EAGAIN;
goto out_unlock;
}
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH] xen/gntdev: fix up blockable calls to mn_invl_range_start

2018-08-23 Thread Michal Hocko
From: Michal Hocko 

93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
has introduced blockable parameter to all mmu_notifiers and the notifier
has to back off when called in !blockable case and it could block down
the road.

The above commit implemented that for mn_invl_range_start but both
in_range checks are done unconditionally regardless of the blockable
mode and as such they would fail all the time for regular calls.
Fix this by checking blockable parameter as well.

Once we are there we can remove the stale TODO. The lock has to be
sleepable because we wait for completion down in gnttab_unmap_refs_sync.

Fixes: 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu notifiers")
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Signed-off-by: Michal Hocko 
---
 drivers/xen/gntdev.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 57390c7666e5..e7d8bb1bee2a 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -519,21 +519,20 @@ static int mn_invl_range_start(struct mmu_notifier *mn,
struct gntdev_grant_map *map;
int ret = 0;
 
-   /* TODO do we really need a mutex here? */
if (blockable)
mutex_lock(>lock);
else if (!mutex_trylock(>lock))
return -EAGAIN;
 
list_for_each_entry(map, >maps, next) {
-   if (in_range(map, start, end)) {
+   if (!blockable && in_range(map, start, end)) {
ret = -EAGAIN;
goto out_unlock;
}
unmap_if_in_range(map, start, end);
}
list_for_each_entry(map, >freeable_maps, next) {
-   if (in_range(map, start, end)) {
+   if (!blockable && in_range(map, start, end)) {
ret = -EAGAIN;
goto out_unlock;
}
-- 
2.18.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-25 Thread Michal Hocko
On Tue 24-07-18 12:53:07, Andrew Morton wrote:
[...]
> > On top of that the proposed cleanup looks as follows:
> > 
> 
> Looks good to me.  Seems a bit strange that we omit the pr_info()
> output if the mm was partially reaped - people would still want to know
> this?   Not very important though.

I think that having a single output once we are done is better but I do
not have a strong opinion on this.

Btw. here is the changelog for the cleanup.

"
Andrew has noticed someinconsistencies in oom_reap_task_mm. Notably
 - Undocumented return value.

 - comment "failed to reap part..." is misleading - sounds like it's
   referring to something which happened in the past, is in fact
   referring to something which might happen in the future.

 - fails to call trace_finish_task_reaping() in one case

 - code duplication.

 - Increases mmap_sem hold time a little by moving
   trace_finish_task_reaping() inside the locked region.  So sue me ;)

 - Sharing the finish: path means that the trace event won't
   distinguish between the two sources of finishing.

Add a short explanation for the return value and fix the rest by
reorganizing the function a bit to have unified function exit paths.

Suggested-by: Andrew Morton 
Signed-off-by: Michal Hocko 
"

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-25 Thread Michal Hocko
On Tue 24-07-18 14:07:49, David Rientjes wrote:
[...]
> mm/oom_kill.c: clean up oom_reap_task_mm() fix
> 
> indicate reaping has been partially skipped so we can expect future skips 
> or another start before finish.

But we are not skipping. This is essentially the same case as mmap_sem
trylock fail. Maybe we can add a bool parameter to trace_finish_task_reaping
to denote partial success?

> Signed-off-by: David Rientjes 
> ---
>  mm/oom_kill.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -569,10 +569,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
> struct mm_struct *mm)
>  
>   trace_start_task_reaping(tsk->pid);
>  
> - /* failed to reap part of the address space. Try again later */
>   ret = __oom_reap_task_mm(mm);
> - if (!ret)
> + if (!ret) {
> + /* Failed to reap part of the address space. Try again later */
> + trace_skip_task_reaping(tsk->pid);
>   goto out_finish;
> + }
>  
>   pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
> file-rss:%lukB, shmem-rss:%lukB\n",
>   task_pid_nr(tsk), tsk->comm,

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-24 Thread Michal Hocko
On Fri 20-07-18 17:09:02, Andrew Morton wrote:
[...]
> - Undocumented return value.
> 
> - comment "failed to reap part..." is misleading - sounds like it's
>   referring to something which happened in the past, is in fact
>   referring to something which might happen in the future.
> 
> - fails to call trace_finish_task_reaping() in one case
> 
> - code duplication.
> 
> - Increases mmap_sem hold time a little by moving
>   trace_finish_task_reaping() inside the locked region.  So sue me ;)
> 
> - Sharing the finish: path means that the trace event won't
>   distinguish between the two sources of finishing.
> 
> Please take a look?

oom_reap_task_mm should return false when __oom_reap_task_mm return
false. This is what my patch did but it seems this changed by
http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-oom-remove-oom_lock-from-oom_reaper.patch
so that one should be fixed.

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 104ef4a01a55..88657e018714 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -565,7 +565,7 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
/* failed to reap part of the address space. Try again later */
if (!__oom_reap_task_mm(mm)) {
up_read(>mmap_sem);
-   return true;
+   return false;
}
 
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",


On top of that the proposed cleanup looks as follows:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 88657e018714..4e185a282b3d 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -541,8 +541,16 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
return ret;
 }
 
+/*
+ * Reaps the address space of the give task.
+ *
+ * Returns true on success and false if none or part of the address space
+ * has been reclaimed and the caller should retry later.
+ */
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
+   bool ret = true;
+
if (!down_read_trylock(>mmap_sem)) {
trace_skip_task_reaping(tsk->pid);
return false;
@@ -555,28 +563,28 @@ static bool oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
 * down_write();up_write() cycle in exit_mmap().
 */
if (test_bit(MMF_OOM_SKIP, >flags)) {
-   up_read(>mmap_sem);
trace_skip_task_reaping(tsk->pid);
-   return true;
+   goto out_unlock;
}
 
trace_start_task_reaping(tsk->pid);
 
/* failed to reap part of the address space. Try again later */
-   if (!__oom_reap_task_mm(mm)) {
-   up_read(>mmap_sem);
-   return false;
-   }
+   ret = __oom_reap_task_mm(mm);
+   if (!ret)
+   goto out_finish;
 
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
+out_finish:
+   trace_finish_task_reaping(tsk->pid);
+out_unlock:
up_read(>mmap_sem);
 
-   trace_finish_task_reaping(tsk->pid);
-   return true;
+   return ret;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko
On Fri 20-07-18 16:01:25, Andrew Morton wrote:
> On Tue, 17 Jul 2018 10:12:01 +0200 Michal Hocko  wrote:
> 
> > > Any suggestions regarding how the driver developers can test this code
> > > path?  I don't think we presently have a way to fake an oom-killing
> > > event?  Perhaps we should add such a thing, given the problems we're
> > > having with that feature.
> > 
> > The simplest way is to wrap an userspace code which uses these notifiers
> > into a memcg and set the hard limit to hit the oom. This can be done
> > e.g. after the test faults in all the mmu notifier managed memory and
> > set the hard limit to something really small. Then we are looking for a
> > proper process tear down.
> 
> Chances are, some of the intended audience don't know how to do this
> and will either have to hunt down a lot of documentation or will just
> not test it.  But we want them to test it, so a little worked step-by-step
> example would help things along please.

I am willing to give more specific steps. Is anybody interested? From my
experience so far this is not something drivers developers using mmu
notifiers would be unfamiliar with.

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko
On Mon 23-07-18 09:11:54, Michal Hocko wrote:
> On Mon 23-07-18 09:03:06, Michal Hocko wrote:
> > On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> > [...]
> > > Please take a look?
> > 
> > Are you OK to have these in a separate patch?
> 
> Btw. I will rebase this patch once oom stuff in linux-next settles. At
> least oom_lock removal from oom_reaper will conflict.

Hmm, I have just checked Andrew's akpm and the patch is already in and
Andrew has resolved the conflict with the oom_lock patch. It just seems
that linux-next (next-20180720) doesn't have the newest mmotm tree.

Anyway, I will go with the incremental cleanup patch per Andrew's
comments as soon as linux-next catches up.

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko
On Mon 23-07-18 09:03:06, Michal Hocko wrote:
> On Fri 20-07-18 17:09:02, Andrew Morton wrote:
> [...]
> > Please take a look?
> 
> Are you OK to have these in a separate patch?

Btw. I will rebase this patch once oom stuff in linux-next settles. At
least oom_lock removal from oom_reaper will conflict.

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-23 Thread Michal Hocko
On Fri 20-07-18 17:09:02, Andrew Morton wrote:
[...]
> Please take a look?

Are you OK to have these in a separate patch?

-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-17 Thread Michal Hocko
On Mon 16-07-18 16:12:49, Andrew Morton wrote:
> On Mon, 16 Jul 2018 13:50:58 +0200 Michal Hocko  wrote:
> 
> > From: Michal Hocko 
> > 
> > There are several blockable mmu notifiers which might sleep in
> > mmu_notifier_invalidate_range_start and that is a problem for the
> > oom_reaper because it needs to guarantee a forward progress so it cannot
> > depend on any sleepable locks.
> > 
> > Currently we simply back off and mark an oom victim with blockable mmu
> > notifiers as done after a short sleep. That can result in selecting a
> > new oom victim prematurely because the previous one still hasn't torn
> > its memory down yet.
> > 
> > We can do much better though. Even if mmu notifiers use sleepable locks
> > there is no reason to automatically assume those locks are held.
> > Moreover majority of notifiers only care about a portion of the address
> > space and there is absolutely zero reason to fail when we are unmapping an
> > unrelated range. Many notifiers do really block and wait for HW which is
> > harder to handle and we have to bail out though.
> > 
> > This patch handles the low hanging fruid. 
> > __mmu_notifier_invalidate_range_start
> > gets a blockable flag and callbacks are not allowed to sleep if the
> > flag is set to false. This is achieved by using trylock instead of the
> > sleepable lock for most callbacks and continue as long as we do not
> > block down the call chain.
> 
> I assume device driver developers are wondering "what does this mean
> for me".  As I understand it, the only time they will see
> blockable==false is when their driver is being called in response to an
> out-of-memory condition, yes?  So it is a very rare thing.

Yes, this is the case right now. Maybe we will grow other users in
future. Those other potential users is the reason why I used blockable
rather than oom parameter name.

> Any suggestions regarding how the driver developers can test this code
> path?  I don't think we presently have a way to fake an oom-killing
> event?  Perhaps we should add such a thing, given the problems we're
> having with that feature.

The simplest way is to wrap an userspace code which uses these notifiers
into a memcg and set the hard limit to hit the oom. This can be done
e.g. after the test faults in all the mmu notifier managed memory and
set the hard limit to something really small. Then we are looking for a
proper process tear down.

> > I think we can improve that even further because there is a common
> > pattern to do a range lookup first and then do something about that.
> > The first part can be done without a sleeping lock in most cases AFAICS.
> > 
> > The oom_reaper end then simply retries if there is at least one notifier
> > which couldn't make any progress in !blockable mode. A retry loop is
> > already implemented to wait for the mmap_sem and this is basically the
> > same thing.
> > 
> > ...
> >
> > +static inline int mmu_notifier_invalidate_range_start_nonblock(struct 
> > mm_struct *mm,
> > + unsigned long start, unsigned long end)
> > +{
> > +   int ret = 0;
> > +   if (mm_has_notifiers(mm))
> > +   ret = __mmu_notifier_invalidate_range_start(mm, start, end, 
> > false);
> > +
> > +   return ret;
> >  }
> 
> nit,
> 
> {
>   if (mm_has_notifiers(mm))
>   return __mmu_notifier_invalidate_range_start(mm, start, end, 
> false);
>   return 0;
> }
> 
> would suffice.

Sure. Fixed
 
> > 
> > ...
> >
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
> >  * reliably test it.
> >  */
> > mutex_lock(_lock);
> > -   __oom_reap_task_mm(mm);
> > +   (void)__oom_reap_task_mm(mm);
> > mutex_unlock(_lock);
> 
> What does this do?

There is no error to be returned here as the comment above explains
 * Nothing can be holding mm->mmap_sem here and the above call
 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 * __oom_reap_task_mm() will not block.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-16 Thread Michal Hocko
From: Michal Hocko 

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a
new oom victim prematurely because the previous one still hasn't torn
its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover majority of notifiers only care about a portion of the address
space and there is absolutely zero reason to fail when we are unmapping an
unrelated range. Many notifiers do really block and wait for HW which is
harder to handle and we have to bail out though.

This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
gets a blockable flag and callbacks are not allowed to sleep if the
flag is set to false. This is achieved by using trylock instead of the
sleepable lock for most callbacks and continue as long as we do not
block down the call chain.

I think we can improve that even further because there is a common
pattern to do a range lookup first and then do something about that.
The first part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

Changes since rfc v1
- gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter
- note that a notifier failed to the log for easier debugging
- back off early in ib_umem_notifier_invalidate_range_start if the
  callback is called
- mn_invl_range_start waits for completion down the unmap_grant_pages
  path so we have to back off early on overlapping ranges

Cc: "David (ChunMing) Zhou" 
Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Alex Deucher 
Cc: David Airlie 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Doug Ledford 
Cc: Jason Gunthorpe 
Cc: Mike Marciniszyn 
Cc: Dennis Dalessandro 
Cc: Sudeep Dutt 
Cc: Ashutosh Dixit 
Cc: Dimitri Sivanich 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: "Jérôme Glisse" 
Cc: Andrea Arcangeli 
Cc: Felix Kuehling 
Cc: k...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: amd-...@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: linux-r...@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: linux...@kvack.org
Acked-by: Christian König  # AMD notifiers
Acked-by: Leon Romanovsky  # mlx and umem_odp
Reported-by: David Rientjes 
Signed-off-by: Michal Hocko 
---

Hi,
there were no major objections when I sent this as an RFC the last time
[1]. I was hoping for more feedback in the drivers land because I am
touching the code I have no way to test. On the other hand the pattern
is quite simple and consistent over all users so there shouldn't be
any large surprises hopefully.

Any further review would be highly appreciate of course. But is this
something to put into the mm tree now?

[1] http://lkml.kernel.org/r/20180627074421.gf32...@dhcp22.suse.cz


 arch/x86/kvm/x86.c  |  7 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++-
 drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++--
 drivers/gpu/drm/radeon/radeon_mn.c  | 22 +++--
 drivers/infiniband/core/umem_odp.c  | 33 +++
 drivers/infiniband/hw/hfi1/mmu_rb.c | 11 ---
 drivers/infiniband/hw/mlx5/odp.c|  2 +-
 drivers/misc/mic/scif/scif_dma.c|  7 ++--
 drivers/misc/sgi-gru/grutlbpurge.c  |  7 ++--
 drivers/xen/gntdev.c| 44 -
 include/linux/kvm_host.h|  4 +--
 include/linux/mmu_notifier.h| 35 +++-
 include/linux/oom.h |  2 +-
 include/rdma/ib_umem_odp.h  |  3 +-
 mm/hmm.c|  7 ++--
 mm/mmap.c   |  2 +-
 mm/mmu_notifier.c   | 19 ---
 mm/oom_kill.c   | 29 
 virt/kvm/kvm_main.c | 15 ++---
 19 files changed, 225 insertions(+), 80 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-   unsigned lo

Re: [Xen-devel] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-11 Thread Michal Hocko
On Wed 11-07-18 13:14:47, Leon Romanovsky wrote:
> On Wed, Jul 11, 2018 at 11:03:53AM +0200, Michal Hocko wrote:
> > On Tue 10-07-18 19:20:20, Leon Romanovsky wrote:
> > > On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> > > > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > > > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > > > > This is the v2 of RFC based on the feedback I've received so far. 
> > > > > > > The
> > > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, 
> > > > > > > mostly
> > > > > > > because I have no idea how.
> > > > > > >
> > > > > > > Any further feedback is highly appreciated of course.
> > > > > >
> > > > > > Any other feedback before I post this as non-RFC?
> > > > >
> > > > > From mlx5 perspective, who is primary user of umem_odp.c your change 
> > > > > looks ok.
> > > >
> > > > Can I assume your Acked-by?
> > >
> > > I didn't have a chance to test it because it applies on our rdma-next, but
> > > fails to compile.
> >
> > What is the compilation problem? Is it caused by the patch or some other
> > unrelated changed?
> 
> Thanks for pushing me to take a look on it.
> Your patch needs the following hunk to properly compile at least on my system.

I suspect you were trying the original version. I've posted an updated
patch here http://lkml.kernel.org/r/20180627074421.gf32...@dhcp22.suse.cz
and all these issues should be fixed there. Including many other fixes.

Could you have a look at that one please?
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-11 Thread Michal Hocko
On Tue 10-07-18 19:20:20, Leon Romanovsky wrote:
> On Tue, Jul 10, 2018 at 04:14:10PM +0200, Michal Hocko wrote:
> > On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> > > On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > > > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, 
> > > > > mostly
> > > > > because I have no idea how.
> > > > >
> > > > > Any further feedback is highly appreciated of course.
> > > >
> > > > Any other feedback before I post this as non-RFC?
> > >
> > > From mlx5 perspective, who is primary user of umem_odp.c your change 
> > > looks ok.
> >
> > Can I assume your Acked-by?
> 
> I didn't have a chance to test it because it applies on our rdma-next, but
> fails to compile.

What is the compilation problem? Is it caused by the patch or some other
unrelated changed?
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-10 Thread Michal Hocko
On Tue 10-07-18 16:40:40, Leon Romanovsky wrote:
> On Mon, Jul 09, 2018 at 02:29:08PM +0200, Michal Hocko wrote:
> > On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> > > This is the v2 of RFC based on the feedback I've received so far. The
> > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > because I have no idea how.
> > >
> > > Any further feedback is highly appreciated of course.
> >
> > Any other feedback before I post this as non-RFC?
> 
> From mlx5 perspective, who is primary user of umem_odp.c your change looks ok.

Can I assume your Acked-by?

Thanks for your review!
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-09 Thread Michal Hocko
On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> This is the v2 of RFC based on the feedback I've received so far. The
> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> because I have no idea how.
> 
> Any further feedback is highly appreciated of course.

Any other feedback before I post this as non-RFC?

> ---
> From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Wed, 20 Jun 2018 15:03:20 +0200
> Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
> 
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
> 
> This patch handles the low hanging fruid. 
> __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.
> 
> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
> 
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
> 
> Changes since rfc v1
> - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
>   on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
>   we bail out when we have an intersecting range for starter
> - note that a notifier failed to the log for easier debugging
> - back off early in ib_umem_notifier_invalidate_range_start if the
>   callback is called
> - mn_invl_range_start waits for completion down the unmap_grant_pages
>   path so we have to back off early on overlapping ranges
> 
> Cc: "David (ChunMing) Zhou" 
> Cc: Paolo Bonzini 
> Cc: "Radim Krčmář" 
> Cc: Alex Deucher 
> Cc: "Christian König" 
> Cc: David Airlie 
> Cc: Jani Nikula 
> Cc: Joonas Lahtinen 
> Cc: Rodrigo Vivi 
> Cc: Doug Ledford 
> Cc: Jason Gunthorpe 
> Cc: Mike Marciniszyn 
> Cc: Dennis Dalessandro 
> Cc: Sudeep Dutt 
> Cc: Ashutosh Dixit 
> Cc: Dimitri Sivanich 
> Cc: Boris Ostrovsky 
> Cc: Juergen Gross 
> Cc: "Jérôme Glisse" 
> Cc: Andrea Arcangeli 
> Cc: Felix Kuehling 
> Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 
> 64-BIT))
> Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-de...@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding 
> Poulsbo, Moorestow...)
> Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes 
> Signed-off-by: Michal Hocko 
> ---
>  arch/x86/kvm/x86.c  |  7 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++-
>  drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++--
>  drivers/gpu/drm/radeon/radeon_mn.c  | 22 +++--
>  drivers/infiniband/core/umem_odp.c  | 33 +++
>  drivers/infiniband/hw/hfi1/mmu_rb.c | 11 ---
>  drivers/infiniband/hw/mlx5/odp.c|  2 +-
>  drivers/misc/mic/scif/scif_dma.c|  7 ++--
>  drivers/misc/sgi-gru/grutlbpurge.c  |  7 ++--
>  drivers/xen/gntdev.c| 44 -
>

Re: [Xen-devel] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-02 Thread Michal Hocko
On Mon 02-07-18 14:39:50, Christian König wrote:
[...]
> Not wanting to block something as important as this, so feel free to add an
> Acked-by: Christian König  to the patch.

Thanks a lot!

> Let's rather face the next topic: Any idea how to runtime test this?

This is a good question indeed. One way to do that would be triggering
the OOM killer from the context which uses each of these mmu notifiers
(one at the time) and see how that works. You would see the note in the
log whenever the notifier would block. The primary thing to test is how
often the oom reaper really had to back off completely.

> I mean I can rather easily provide a test which crashes an AMD GPU, which in
> turn then would mean that the MMU notifier would block forever without this
> patch.

Well, you do not really have to go that far. It should be sufficient to
do the above. The current code would simply back of without releasing
any memory. The patch should help to reclaim some memory.
 
> But do you know a way to let the OOM killer kill a specific process?

Yes, you can set its oom_score_adj to 1000 which means always select
that task.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-02 Thread Michal Hocko
On Mon 02-07-18 14:24:29, Christian König wrote:
> Am 02.07.2018 um 14:20 schrieb Michal Hocko:
> > On Mon 02-07-18 14:13:42, Christian König wrote:
> > > Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > > > On Mon 02-07-18 11:14:58, Christian König wrote:
> > > > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > > > This is the v2 of RFC based on the feedback I've received so far. 
> > > > > > The
> > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, 
> > > > > > mostly
> > > > > > because I have no idea how.
> > > > > > 
> > > > > > Any further feedback is highly appreciated of course.
> > > > > That sounds like it should work and at least the amdgpu changes now 
> > > > > look
> > > > > good to me on first glance.
> > > > > 
> > > > > Can you split that up further in the usual way? E.g. adding the 
> > > > > blockable
> > > > > flag in one patch and fixing all implementations of the MMU notifier 
> > > > > in
> > > > > follow up patches.
> > > > But such a code would be broken, no? Ignoring the blockable state will
> > > > simply lead to lockups until the fixup parts get applied.
> > > Well to still be bisect-able you only need to get the interface change in
> > > first with fixing the function signature of the implementations.
> > That would only work if those functions return -AGAIN unconditionally.
> > Otherwise they would pretend to not block while that would be obviously
> > incorrect. This doesn't sound correct to me.
> > 
> > > Then add all the new code to the implementations and last start to 
> > > actually
> > > use the new interface.
> > > 
> > > That is a pattern we use regularly and I think it's good practice to do
> > > this.
> > But we do rely on the proper blockable handling.
> 
> Yeah, but you could add the handling only after you have all the
> implementations in place. Don't you?

Yeah, but then I would be adding a code with no user. And I really
prefer to no do so because then the code is harder to argue about.

> > > > Is the split up really worth it? I was thinking about that but had hard
> > > > times to end up with something that would be bisectable. Well, except
> > > > for returning -EBUSY until all notifiers are implemented. Which I found
> > > > confusing.
> > > It at least makes reviewing changes much easier, cause as driver 
> > > maintainer
> > > I can concentrate on the stuff only related to me.
> > > 
> > > Additional to that when you cause some unrelated side effect in a driver 
> > > we
> > > can much easier pinpoint the actual change later on when the patch is
> > > smaller.
> > > 
> > > > > This way I'm pretty sure Felix and I can give an rb on the 
> > > > > amdgpu/amdkfd
> > > > > changes.
> > > > If you are worried to give r-b only for those then this can be done even
> > > > for larger patches. Just make your Reviewd-by more specific
> > > > R-b: name # For BLA BLA
> > > Yeah, possible alternative but more work for me when I review it :)
> > I definitely do not want to add more work to reviewers and I completely
> > see how massive "flag days" like these are not popular but I really
> > didn't find a reasonable way around that would be both correct and
> > wouldn't add much more churn on the way. So if you really insist then I
> > would really appreciate a hint on the way to achive the same without any
> > above downsides.
> 
> Well, I don't insist on this. It's just from my point of view that this
> patch doesn't needs to be one patch, but could be split up.

Well, if there are more people with the same concern I can try to do
that. But if your only concern is to focus on your particular part then
I guess it would be easier both for you and me to simply apply the patch
and use git show $files_for_your_subystem on your end. I have put the
patch to attempts/oom-vs-mmu-notifiers branch to my tree at
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-07-02 Thread Michal Hocko
On Mon 02-07-18 14:13:42, Christian König wrote:
> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > On Mon 02-07-18 11:14:58, Christian König wrote:
> > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > because I have no idea how.
> > > > 
> > > > Any further feedback is highly appreciated of course.
> > > That sounds like it should work and at least the amdgpu changes now look
> > > good to me on first glance.
> > > 
> > > Can you split that up further in the usual way? E.g. adding the blockable
> > > flag in one patch and fixing all implementations of the MMU notifier in
> > > follow up patches.
> > But such a code would be broken, no? Ignoring the blockable state will
> > simply lead to lockups until the fixup parts get applied.
> 
> Well to still be bisect-able you only need to get the interface change in
> first with fixing the function signature of the implementations.

That would only work if those functions return -AGAIN unconditionally.
Otherwise they would pretend to not block while that would be obviously
incorrect. This doesn't sound correct to me.

> Then add all the new code to the implementations and last start to actually
> use the new interface.
> 
> That is a pattern we use regularly and I think it's good practice to do
> this.

But we do rely on the proper blockable handling.

> > Is the split up really worth it? I was thinking about that but had hard
> > times to end up with something that would be bisectable. Well, except
> > for returning -EBUSY until all notifiers are implemented. Which I found
> > confusing.
> 
> It at least makes reviewing changes much easier, cause as driver maintainer
> I can concentrate on the stuff only related to me.
> 
> Additional to that when you cause some unrelated side effect in a driver we
> can much easier pinpoint the actual change later on when the patch is
> smaller.
> 
> > 
> > > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> > > changes.
> > If you are worried to give r-b only for those then this can be done even
> > for larger patches. Just make your Reviewd-by more specific
> > R-b: name # For BLA BLA
> 
> Yeah, possible alternative but more work for me when I review it :)

I definitely do not want to add more work to reviewers and I completely
see how massive "flag days" like these are not popular but I really
didn't find a reasonable way around that would be both correct and
wouldn't add much more churn on the way. So if you really insist then I
would really appreciate a hint on the way to achive the same without any
above downsides.
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-27 Thread Michal Hocko
This is the v2 of RFC based on the feedback I've received so far. The
code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
because I have no idea how.

Any further feedback is highly appreciated of course.
---
From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Wed, 20 Jun 2018 15:03:20 +0200
Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a
new oom victim prematurely because the previous one still hasn't torn
its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover majority of notifiers only care about a portion of the address
space and there is absolutely zero reason to fail when we are unmapping an
unrelated range. Many notifiers do really block and wait for HW which is
harder to handle and we have to bail out though.

This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
gets a blockable flag and callbacks are not allowed to sleep if the
flag is set to false. This is achieved by using trylock instead of the
sleepable lock for most callbacks and continue as long as we do not
block down the call chain.

I think we can improve that even further because there is a common
pattern to do a range lookup first and then do something about that.
The first part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

Changes since rfc v1
- gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter
- note that a notifier failed to the log for easier debugging
- back off early in ib_umem_notifier_invalidate_range_start if the
  callback is called
- mn_invl_range_start waits for completion down the unmap_grant_pages
  path so we have to back off early on overlapping ranges

Cc: "David (ChunMing) Zhou" 
Cc: Paolo Bonzini 
Cc: "Radim Krčmář" 
Cc: Alex Deucher 
Cc: "Christian König" 
Cc: David Airlie 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: Doug Ledford 
Cc: Jason Gunthorpe 
Cc: Mike Marciniszyn 
Cc: Dennis Dalessandro 
Cc: Sudeep Dutt 
Cc: Ashutosh Dixit 
Cc: Dimitri Sivanich 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: "Jérôme Glisse" 
Cc: Andrea Arcangeli 
Cc: Felix Kuehling 
Cc: k...@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
Cc: linux-ker...@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 
64-BIT))
Cc: amd-...@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
Cc: dri-de...@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: intel-...@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding 
Poulsbo, Moorestow...)
Cc: linux-r...@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Cc: linux...@kvack.org (open list:HMM - Heterogeneous Memory Management)
Reported-by: David Rientjes 
Signed-off-by: Michal Hocko 
---
 arch/x86/kvm/x86.c  |  7 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++-
 drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++--
 drivers/gpu/drm/radeon/radeon_mn.c  | 22 +++--
 drivers/infiniband/core/umem_odp.c  | 33 +++
 drivers/infiniband/hw/hfi1/mmu_rb.c | 11 ---
 drivers/infiniband/hw/mlx5/odp.c|  2 +-
 drivers/misc/mic/scif/scif_dma.c|  7 ++--
 drivers/misc/sgi-gru/grutlbpurge.c  |  7 ++--
 drivers/xen/gntdev.c| 44 -
 include/linux/kvm_host.h|  4 +--
 include/linux/mmu_notifier.h| 35 +++-
 include/linux/oom.h |  2 +-
 include/rdma/ib_umem_odp.h  |  3 +-
 mm/hmm.c|  7 ++--
 mm/mmap.c   |  2 +-
 mm/mmu_notifier.c   | 19 ---
 mm/oom_kill.c   | 29 
 virt/kvm/kvm_main.c | 15 ++---
 19 files changed, 225 insertions(+), 80 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e

Re: [Xen-devel] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-25 Thread Michal Hocko
On Mon 25-06-18 10:01:03, Michal Hocko wrote:
> On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> > On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > > On Fri 22-06-18 17:13:02, Christian König wrote:
> > >> Hi Michal,
> > >>
> > >> [Adding Felix as well]
> > >>
> > >> Well first of all you have a misconception why at least the AMD graphics
> > >> driver need to be able to sleep in an MMU notifier: We need to sleep 
> > >> because
> > >> we need to wait for hardware operations to finish and *NOT* because we 
> > >> need
> > >> to wait for locks.
> > >>
> > >> I'm not sure if your flag now means that you generally can't sleep in MMU
> > >> notifiers any more, but if that's the case at least AMD hardware will 
> > >> break
> > >> badly. In our case the approach of waiting for a short time for the 
> > >> process
> > >> to be reaped and then select another victim actually sounds like the 
> > >> right
> > >> thing to do.
> > > Well, I do not need to make the notifier code non blocking all the time.
> > > All I need is to ensure that it won't sleep if the flag says so and
> > > return -EAGAIN instead.
> > >
> > > So here is what I do for amdgpu:
> > 
> > In the case of KFD we also need to take the DQM lock:
> > 
> > amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> > kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> > 
> > So we'd need to pass the blockable parameter all the way through that
> > call chain.
> 
> Thanks, I have missed that part. So I guess I will start with something
> similar to intel-gfx and back off when the current range needs some
> treatment. So this on top. Does it look correct?
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index d138a526feff..e2d422b3eb0b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct 
> mmu_notifier *mn,
>   struct amdgpu_mn_node *node;
>   struct amdgpu_bo *bo;
>  
> + if (!blockable) {
> + amdgpu_mn_read_unlock();
> + return -EAGAIN;
> + }
> +
>   node = container_of(it, struct amdgpu_mn_node, it);
>   it = interval_tree_iter_next(it, start, end);

Ble, just noticed that half of the change didn't get to git index...
This is what I have
commit c4701b36ac2802b903db3d05cf77c030fccce3a8
Author: Michal Hocko 
Date:   Mon Jun 25 15:24:03 2018 +0200

fold me

- amd gpu notifiers can sleep deeper in the callchain 
(evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..3399a4a927fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -225,6 +225,11 @@ static int amdgpu_mn_invalidate_range_start_gfx(struct 
mmu_notifier *mn,
while (it) {
struct amdgpu_mn_node *node;
 
+   if (!blockable) {
+   amdgpu_mn_read_unlock(rmn);
+   return -EAGAIN;
+   }
+
node = container_of(it, struct amdgpu_mn_node, it);
it = interval_tree_iter_next(it, start, end);
 
@@ -266,6 +271,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct 
mmu_notifier *mn,
struct amdgpu_mn_node *node;
struct amdgpu_bo *bo;
 
+   if (!blockable) {
+   amdgpu_mn_read_unlock(rmn);
+   return -EAGAIN;
+   }
+
node = container_of(it, struct amdgpu_mn_node, it);
it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-25 Thread Michal Hocko
On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > On Fri 22-06-18 17:13:02, Christian König wrote:
> >> Hi Michal,
> >>
> >> [Adding Felix as well]
> >>
> >> Well first of all you have a misconception why at least the AMD graphics
> >> driver need to be able to sleep in an MMU notifier: We need to sleep 
> >> because
> >> we need to wait for hardware operations to finish and *NOT* because we need
> >> to wait for locks.
> >>
> >> I'm not sure if your flag now means that you generally can't sleep in MMU
> >> notifiers any more, but if that's the case at least AMD hardware will break
> >> badly. In our case the approach of waiting for a short time for the process
> >> to be reaped and then select another victim actually sounds like the right
> >> thing to do.
> > Well, I do not need to make the notifier code non blocking all the time.
> > All I need is to ensure that it won't sleep if the flag says so and
> > return -EAGAIN instead.
> >
> > So here is what I do for amdgpu:
> 
> In the case of KFD we also need to take the DQM lock:
> 
> amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> 
> So we'd need to pass the blockable parameter all the way through that
> call chain.

Thanks, I have missed that part. So I guess I will start with something
similar to intel-gfx and back off when the current range needs some
treatment. So this on top. Does it look correct?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..e2d422b3eb0b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct 
mmu_notifier *mn,
struct amdgpu_mn_node *node;
struct amdgpu_bo *bo;
 
+   if (!blockable) {
+   amdgpu_mn_read_unlock();
+   return -EAGAIN;
+   }
+
node = container_of(it, struct amdgpu_mn_node, it);
it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Michal Hocko
[Resnding with the CC list fixed]

On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > Hi,
> > > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > > mmu notifiers semantics very much so this is a crude attempt to 
> > > > > achieve
> > > > > what I need basically. It might be completely wrong but I would like
> > > > > to discuss what would be a better way if that is the case.
> > > > > 
> > > > > get_maintainers gave me quite large list of people to CC so I had to 
> > > > > trim
> > > > > it down. If you think I have forgot somebody, please let me know
> > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > > > > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object 
> > > > > *mo)
> > > > > mo->attached = false;
> > > > >  }
> > > > >  
> > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > > mmu_notifier *_mn,
> > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > > mmu_notifier *_mn,
> > > > >struct 
> > > > > mm_struct *mm,
> > > > >unsigned long 
> > > > > start,
> > > > > -  unsigned long 
> > > > > end)
> > > > > +  unsigned long 
> > > > > end,
> > > > > +  bool blockable)
> > > > >  {
> > > > > struct i915_mmu_notifier *mn =
> > > > > container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > @@ -124,7 +125,7 @@ static void 
> > > > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > LIST_HEAD(cancelled);
> > > > >  
> > > > > if (RB_EMPTY_ROOT(>objects.rb_root))
> > > > > -   return;
> > > > > +   return 0;
> > > > 
> > > > The principle wait here is for the HW (even after fixing all the locks
> > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > access).
> > > 
> > > Is this wait bound or it can take basically arbitrary amount of time?
> > 
> > Arbitrary amount of time but in desktop use case you can assume that
> > it should never go above 16ms for a 60frame per second rendering of
> > your desktop (in GPU compute case this kind of assumption does not
> > hold). Is the process exit_state already updated by the time this mmu
> > notifier callbacks happen ?
> 
> What do you mean? The process is killed (by SIGKILL) at the time but we
> do not know much more than that. The task might be stuck anywhere in the
> kernel before handling that signal.
> 
> > > > The first pass would be then to not do anything here if
> > > > !blockable.
> > > 
> > > something like this? (incremental diff)
> > 
> > What i wanted to do with HMM and mmu notifier is split the invalidation
> > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
> > depends on the range and invalidate internal driver states (like clear
> > buffer object pages array in case of GPU but not GPU page table). While
> > the second callback would do the actual wait on the GPU to be done and
> > update the GPU page table.
> 
> What can you do after the first phase? Can I unmap the range?
> 
> > Now in this scheme in case the task is already in some exit state and
> > that all CPU threads are frozen/kill then we can probably find a way to
> > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
> > share userptr bo hence a uptr bo should only ever be access through

Re: [Xen-devel] [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Michal Hocko
[Hmm, the cc list got mangled somehow - you have just made many people
to work for suse ;) and to kvack.org in the preious one - fixed up
hopefully]

On Fri 22-06-18 17:07:21, Chris Wilson wrote:
> Quoting Michal Hocko (2018-06-22 16:57:16)
> > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > Hi,
> > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > what I need basically. It might be completely wrong but I would like
> > > > to discuss what would be a better way if that is the case.
> > > > 
> > > > get_maintainers gave me quite large list of people to CC so I had to 
> > > > trim
> > > > it down. If you think I have forgot somebody, please let me know
> > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > > > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > index 854bd51b9478..5285df9331fa 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > > mo->attached = false;
> > > >  }
> > > >  
> > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > mmu_notifier *_mn,
> > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > mmu_notifier *_mn,
> > > >struct mm_struct 
> > > > *mm,
> > > >unsigned long 
> > > > start,
> > > > -  unsigned long 
> > > > end)
> > > > +  unsigned long 
> > > > end,
> > > > +  bool blockable)
> > > >  {
> > > > struct i915_mmu_notifier *mn =
> > > > container_of(_mn, struct i915_mmu_notifier, mn);
> > > > @@ -124,7 +125,7 @@ static void 
> > > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > LIST_HEAD(cancelled);
> > > >  
> > > > if (RB_EMPTY_ROOT(>objects.rb_root))
> > > > -   return;
> > > > +   return 0;
> > > 
> > > The principle wait here is for the HW (even after fixing all the locks
> > > to be not so coarse, we still have to wait for the HW to finish its
> > > access).
> > 
> > Is this wait bound or it can take basically arbitrary amount of time?
> 
> Arbitrary. It waits for the last operation in the queue that needs that
> set of backing pages, and that queue is unbounded and not even confined
> to the local driver. (Though each operation should be bounded to be
> completed within an interval or be cancelled, that interval is on the
> order of 10s!)

OK, I see. We should rather not wait that long so backoff is just
better. The whole point of the oom_reaper is to tear down and free some
memory. We do not really need to reclaim all of it.

It would be great if we could do something like - kick the tear down of
the device memory but have it done in the background. We wouldn't tear
the vma down in that case but the whole process would start at least.
I am not sure something like that is possible.
 
> > > The first pass would be then to not do anything here if
> > > !blockable.
> > 
> > something like this? (incremental diff)
> 
> Yup.

Cool, I will start with that because even that is an improvement from
the oom_reaper POV.

Thanks!
-- 
Michal Hocko
SUSE Labs

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

  1   2   >