Re: [PATCH v2 6/6] mm: export dump_mm()

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:51, Suren Baghdasaryan wrote:
> mmap_assert_write_locked() is used in vm_flags modifiers. Because
> mmap_assert_write_locked() uses dump_mm() and vm_flags are sometimes
> modified from from inside a module, it's necessary to export
> dump_mm() function.
> 
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 

> ---
>  mm/debug.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/debug.c b/mm/debug.c
> index 9d3d893dc7f4..96d594e16292 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -215,6 +215,7 @@ void dump_mm(const struct mm_struct *mm)
>   mm->def_flags, &mm->def_flags
>   );
>  }
> +EXPORT_SYMBOL(dump_mm);
>  
>  static bool page_init_poisoning __read_mostly = true;
>  
> -- 
> 2.39.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 2/6] mm: replace VM_LOCKED_CLEAR_MASK with VM_LOCKED_MASK

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:47, Suren Baghdasaryan wrote:
> To simplify the usage of VM_LOCKED_CLEAR_MASK in clear_vm_flags(),
> replace it with VM_LOCKED_MASK bitmask and convert all users.
>
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 

> ---
>  include/linux/mm.h | 4 ++--
>  kernel/fork.c  | 2 +-
>  mm/hugetlb.c   | 4 ++--
>  mm/mlock.c | 6 +++---
>  mm/mmap.c  | 6 +++---
>  mm/mremap.c| 2 +-
>  6 files changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b71f2809caac..da62bdd627bf 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -421,8 +421,8 @@ extern unsigned int kobjsize(const void *objp);
>  /* This mask defines which mm->def_flags a process can inherit its parent */
>  #define VM_INIT_DEF_MASK VM_NOHUGEPAGE
>  
> -/* This mask is used to clear all the VMA flags used by mlock */
> -#define VM_LOCKED_CLEAR_MASK (~(VM_LOCKED | VM_LOCKONFAULT))
> +/* This mask represents all the VMA flag bits used by mlock */
> +#define VM_LOCKED_MASK   (VM_LOCKED | VM_LOCKONFAULT)
>  
>  /* Arch-specific flags to clear when updating VM flags on protection change 
> */
>  #ifndef VM_ARCH_CLEAR
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 6683c1b0f460..03d472051236 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -669,7 +669,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>   tmp->anon_vma = NULL;
>   } else if (anon_vma_fork(tmp, mpnt))
>   goto fail_nomem_anon_vma_fork;
> - tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
> + clear_vm_flags(tmp, VM_LOCKED_MASK);
>   file = tmp->vm_file;
>   if (file) {
>   struct address_space *mapping = file->f_mapping;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d20c8b09890e..4ecdbad9a451 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6973,8 +6973,8 @@ static unsigned long page_table_shareable(struct 
> vm_area_struct *svma,
>   unsigned long s_end = sbase + PUD_SIZE;
>  
>   /* Allow segments to share if only one is marked locked */
> - unsigned long vm_flags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> - unsigned long svm_flags = svma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED_MASK;
> + unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED_MASK;
>  
>   /*
>* match the virtual addresses, permission and the alignment of the
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 0336f52e03d7..5c4fff93cd6b 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -497,7 +497,7 @@ static int apply_vma_lock_flags(unsigned long start, 
> size_t len,
>   if (vma->vm_start != tmp)
>   return -ENOMEM;
>  
> - newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + newflags = vma->vm_flags & ~VM_LOCKED_MASK;
>   newflags |= flags;
>   /* Here we know that  vma->vm_start <= nstart < vma->vm_end. */
>   tmp = vma->vm_end;
> @@ -661,7 +661,7 @@ static int apply_mlockall_flags(int flags)
>   struct vm_area_struct *vma, *prev = NULL;
>   vm_flags_t to_add = 0;
>  
> - current->mm->def_flags &= VM_LOCKED_CLEAR_MASK;
> + current->mm->def_flags &= ~VM_LOCKED_MASK;
>   if (flags & MCL_FUTURE) {
>   current->mm->def_flags |= VM_LOCKED;
>  
> @@ -681,7 +681,7 @@ static int apply_mlockall_flags(int flags)
>   for_each_vma(vmi, vma) {
>   vm_flags_t newflags;
>  
> - newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
> + newflags = vma->vm_flags & ~VM_LOCKED_MASK;
>   newflags |= to_add;
>  
>   /* Ignore errors */
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d4abc6feced1..323bd253b25a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2671,7 +2671,7 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>   if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
>   is_vm_hugetlb_page(vma) ||
>   vma == get_gate_vma(current->mm))
> - vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
> + clear_vm_flags(vma, VM_LOCKED_MASK);
>   else
>   mm->locked_vm += (len >> PAGE_SHIFT);
>   }
> @@ -3340,8 +3340,8 @@ static struct vm_area_struct *__install_special_mapping(
&g

Re: [PATCH v2 4/6] mm: replace vma->vm_flags indirect modification in ksm_madvise

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:49, Suren Baghdasaryan wrote:
> Replace indirect modifications to vma->vm_flags with calls to modifier
> functions to be able to track flag changes and to keep vma locking
> correctness. Add a BUG_ON check in ksm_madvise() to catch indirect
> vm_flags modification attempts.

Those BUG_ONs scream to much IMHO. KSM is an MM internal code so I
gueess we should be willing to trust it.

> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 1/6] mm: introduce vma->vm_flags modifier functions

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:46, Suren Baghdasaryan wrote:
> vm_flags are among VMA attributes which affect decisions like VMA merging
> and splitting. Therefore all vm_flags modifications are performed after
> taking exclusive mmap_lock to prevent vm_flags updates racing with such
> operations. Introduce modifier functions for vm_flags to be used whenever
> flags are updated. This way we can better check and control correct
> locking behavior during these updates.
> 
> Signed-off-by: Suren Baghdasaryan 

Acked-by: Michal Hocko 

> ---
>  include/linux/mm.h   | 37 +
>  include/linux/mm_types.h |  8 +++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c2f62bdce134..b71f2809caac 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -627,6 +627,43 @@ static inline void vma_init(struct vm_area_struct *vma, 
> struct mm_struct *mm)
>   INIT_LIST_HEAD(&vma->anon_vma_chain);
>  }
>  
> +/* Use when VMA is not part of the VMA tree and needs no locking */
> +static inline void init_vm_flags(struct vm_area_struct *vma,
> +  unsigned long flags)
> +{
> + vma->vm_flags = flags;
> +}
> +
> +/* Use when VMA is part of the VMA tree and modifications need coordination 
> */
> +static inline void reset_vm_flags(struct vm_area_struct *vma,
> +   unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + init_vm_flags(vma, flags);
> +}
> +
> +static inline void set_vm_flags(struct vm_area_struct *vma,
> + unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags |= flags;
> +}
> +
> +static inline void clear_vm_flags(struct vm_area_struct *vma,
> +   unsigned long flags)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags &= ~flags;
> +}
> +
> +static inline void mod_vm_flags(struct vm_area_struct *vma,
> + unsigned long set, unsigned long clear)
> +{
> + mmap_assert_write_locked(vma->vm_mm);
> + vma->vm_flags |= set;
> + vma->vm_flags &= ~clear;
> +}
> +
>  static inline void vma_set_anonymous(struct vm_area_struct *vma)
>  {
>   vma->vm_ops = NULL;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2d6d790d9bed..6c7c70bf50dd 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -491,7 +491,13 @@ struct vm_area_struct {
>* See vmf_insert_mixed_prot() for discussion.
>*/
>   pgprot_t vm_page_prot;
> - unsigned long vm_flags; /* Flags, see mm.h. */
> +
> + /*
> +  * Flags, see mm.h.
> +  * WARNING! Do not modify directly.
> +  * Use {init|reset|set|clear|mod}_vm_flags() functions instead.
> +  */
> + unsigned long vm_flags;
>  
>   /*
>* For areas with an address space and backing store,
> -- 
> 2.39.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 3/6] mm: replace vma->vm_flags direct modifications with modifier calls

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 00:38:48, Suren Baghdasaryan wrote:
> Replace direct modifications to vma->vm_flags with calls to modifier
> functions to be able to track flag changes and to keep vma locking
> correctness.

Is this a manual (git grep) based work or have you used Coccinele for
the patch generation?

My potentially incomplete check
$ git grep ">[[:space:]]*vm_flags[[:space:]]*[&|^]="

shows that nothing should be left after this. There is still quite a lot
of direct checks of the flags (more than 600). Maybe it would be good to
make flags accessible only via accessors which would also prevent any
future direct setting of those flags in uncontrolled way as well.

Anyway
Acked-by: Michal Hocko 
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 5/6] mm: introduce mod_vm_flags_nolock and use it in untrack_pfn

2023-01-25 Thread Michal Hocko
ST_USER_ADDRESS,
>next ? next->vm_start : USER_PGTABLES_CEILING);
>   tlb_finish_mmu(&tlb);
> @@ -2391,7 +2391,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct 
> vm_area_struct *vma,
>   mmap_write_downgrade(mm);
>   }
>  
> - unmap_region(mm, &mt_detach, vma, prev, next, start, end);
> + /*
> +  * We can free page tables without write-locking mmap_lock because VMAs
> +  * were isolated before we downgraded mmap_lock.
> +  */
> + unmap_region(mm, &mt_detach, vma, prev, next, start, end, !downgrade);
>   /* Statistics and freeing VMAs */
>   mas_set(&mas_detach, start);
>   remove_mt(mm, &mas_detach);
> @@ -2704,7 +2708,7 @@ unsigned long mmap_region(struct file *file, unsigned 
> long addr,
>  
>   /* Undo any partial mapping done by a device driver. */
>   unmap_region(mm, &mm->mm_mt, vma, prev, next, vma->vm_start,
> -  vma->vm_end);
> +  vma->vm_end, true);
>   }
>   if (file && (vm_flags & VM_SHARED))
>   mapping_unmap_writable(file->f_mapping);
> @@ -3031,7 +3035,7 @@ void exit_mmap(struct mm_struct *mm)
>   tlb_gather_mmu_fullmm(&tlb, mm);
>   /* update_hiwater_rss(mm) here? but nobody should be looking */
>   /* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
> - unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
> + unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX, false);
>   mmap_read_unlock(mm);
>  
>   /*
> -- 
> 2.39.1

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 4/6] mm: replace vma->vm_flags indirect modification in ksm_madvise

2023-01-25 Thread Michal Hocko
On Wed 25-01-23 08:57:48, Suren Baghdasaryan wrote:
> On Wed, Jan 25, 2023 at 1:38 AM 'Michal Hocko' via kernel-team
>  wrote:
> >
> > On Wed 25-01-23 00:38:49, Suren Baghdasaryan wrote:
> > > Replace indirect modifications to vma->vm_flags with calls to modifier
> > > functions to be able to track flag changes and to keep vma locking
> > > correctness. Add a BUG_ON check in ksm_madvise() to catch indirect
> > > vm_flags modification attempts.
> >
> > Those BUG_ONs scream to much IMHO. KSM is an MM internal code so I
> > gueess we should be willing to trust it.
> 
> Yes, but I really want to prevent an indirect misuse since it was not
> easy to find these. If you feel strongly about it I will remove them
> or if you have a better suggestion I'm all for it.

You can avoid that by making flags inaccesible directly, right?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 1/4] memcg: Track exported dma-buffers

2023-01-25 Thread Michal Hocko
On Tue 24-01-23 10:55:21, T.J. Mercier wrote:
> On Tue, Jan 24, 2023 at 7:00 AM Michal Hocko  wrote:
> >
> > On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> > > When a buffer is exported to userspace, use memcg to attribute the
> > > buffer to the allocating cgroup until all buffer references are
> > > released.
> >
> > Is there any reason why this memory cannot be charged during the
> > allocation (__GFP_ACCOUNT used)?
> 
> My main motivation was to keep code changes away from exporters and
> implement the accounting in one common spot for all of them. This is a
> bit of a carryover from a previous approach [1] where there was some
> objection to pushing off this work onto exporters and forcing them to
> adapt, but __GFP_ACCOUNT does seem like a smaller burden than before
> at least initially. However in order to support charge transfer
> between cgroups with __GFP_ACCOUNT we'd need to be able to get at the
> pages backing dmabuf objects, and the exporters are the ones with that
> access. Meaning I think we'd have to add some additional dma_buf_ops
> to achieve that, which was the objection from [1].
> 
> [1] https://lore.kernel.org/lkml/5cc27a05-8131-ce9b-dea1-5c75e9942...@amd.com/
> 
> >
> > Also you do charge and account the memory but underlying pages do not
> > know about their memcg (this is normally done with commit_charge for
> > user mapped pages). This would become a problem if the memory is
> > migrated for example.
> 
> Hmm, what problem do you see in this situation? If the backing pages
> are to be migrated that requires the cooperation of the exporter,
> which currently has no influence on how the cgroup charging is done
> and that seems fine. (Unless you mean migrating the charge across
> cgroups? In which case that's the next patch.)

My main concern was that page migration could lose the external tracking
without some additional steps on the dmabuf front.

> > This also means that you have to maintain memcg
> > reference outside of the memcg proper which is not really nice either.
> > This mimicks tcp kmem limit implementation which I really have to say I
> > am not a great fan of and this pattern shouldn't be coppied.
> >
> Ah, what can I say. This way looked simple to me. I think otherwise
> we're back to making all exporters do more stuff for the accounting.
> 
> > Also you are not really saying anything about the oom behavior. With
> > this implementation the kernel will try to reclaim the memory and even
> > trigger the memcg oom killer if the request size is <= 8 pages. Is this
> > a desirable behavior?
> 
> It will try to reclaim some memory, but not the dmabuf pages right?
> Not *yet* anyway. This behavior sounds expected to me.

Yes, we have discussed that shrinkers will follow up later which is
fine. The question is how much reclaim actually makes sense at this
stage. Charging interface usually copes with sizes resulting from
allocation requests (so usually 1<

Re: [PATCH v2 1/4] memcg: Track exported dma-buffers

2023-01-25 Thread Michal Hocko
On Tue 24-01-23 19:46:28, Shakeel Butt wrote:
> On Tue, Jan 24, 2023 at 03:59:58PM +0100, Michal Hocko wrote:
> > On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> > > When a buffer is exported to userspace, use memcg to attribute the
> > > buffer to the allocating cgroup until all buffer references are
> > > released.
> > 
> > Is there any reason why this memory cannot be charged during the
> > allocation (__GFP_ACCOUNT used)?
> > Also you do charge and account the memory but underlying pages do not
> > know about their memcg (this is normally done with commit_charge for
> > user mapped pages). This would become a problem if the memory is
> > migrated for example.
> 
> I don't think this is movable memory.
> 
> > This also means that you have to maintain memcg
> > reference outside of the memcg proper which is not really nice either.
> > This mimicks tcp kmem limit implementation which I really have to say I
> > am not a great fan of and this pattern shouldn't be coppied.
> > 
> 
> I think we should keep the discussion on technical merits instead of
> personal perference. To me using skmem like interface is totally fine
> but the pros/cons need to be very explicit and the clear reasons to
> select that option should be included.

I do agree with that. I didn't want sound to be personal wrt tcp kmem
accounting but the overall code maintenance cost is higher because
of how tcp take on accounting differs from anything else in the memcg
proper. I would prefer to not grow another example like that.

> To me there are two options:
> 
> 1. Using skmem like interface as this patch series:
> 
> The main pros of this option is that it is very simple. Let me list down
> the cons of this approach:
> 
> a. There is time window between the actual memory allocation/free and
> the charge and uncharge and [un]charge happen when the whole memory is
> allocated or freed. I think for the charge path that might not be a big
> issue but on the uncharge, this can cause issues. The application and
> the potential shrinkers have freed some of this dmabuf memory but until
> the whole dmabuf is freed, the memcg uncharge will not happen. This can
> consequences on reclaim and oom behavior of the application.
> 
> b. Due to the usage model i.e. a central daemon allocating the dmabuf
> memory upfront, there is a requirement to have a memcg charge transfer
> functionality to transfer the charge from the central daemon to the
> client applications. This does introduce complexity and avenues of weird
> reclaim and oom behavior.
> 
> 
> 2. Allocate and charge the memory on page fault by actual user
> 
> In this approach, the memory is not allocated upfront by the central
> daemon but rather on the page fault by the client application and the
> memcg charge happen at the same time.
> 
> The only cons I can think of is this approach is more involved and may
> need some clever tricks to track the page on the free patch i.e. we to
> decrement the dmabuf memcg stat on free path. Maybe a page flag.
> 
> The pros of this approach is there is no need have a charge transfer
> functionality and the charge/uncharge being closely tied to the actual
> memory allocation and free.
> 
> Personally I would prefer the second approach but I don't want to just
> block this work if the dmabuf folks are ok with the cons mentioned of
> the first approach.

I am not familiar with dmabuf internals to judge complexity on their end
but I fully agree that charge-when-used is much more easier to reason
about and it should have less subtle surprises.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH v2 1/4] memcg: Track exported dma-buffers

2023-01-24 Thread Michal Hocko
On Mon 23-01-23 19:17:23, T.J. Mercier wrote:
> When a buffer is exported to userspace, use memcg to attribute the
> buffer to the allocating cgroup until all buffer references are
> released.

Is there any reason why this memory cannot be charged during the
allocation (__GFP_ACCOUNT used)?
Also you do charge and account the memory but underlying pages do not
know about their memcg (this is normally done with commit_charge for
user mapped pages). This would become a problem if the memory is
migrated for example. This also means that you have to maintain memcg
reference outside of the memcg proper which is not really nice either.
This mimicks tcp kmem limit implementation which I really have to say I
am not a great fan of and this pattern shouldn't be coppied.

Also you are not really saying anything about the oom behavior. With
this implementation the kernel will try to reclaim the memory and even
trigger the memcg oom killer if the request size is <= 8 pages. Is this
a desirable behavior?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 0/4] Track exported dma-buffers with memcg

2023-01-12 Thread Michal Hocko
On Thu 12-01-23 07:56:31, Shakeel Butt wrote:
> On Wed, Jan 11, 2023 at 11:56:45PM +0100, Daniel Vetter wrote:
> > 
> [...]
> > I think eventually, at least for other "account gpu stuff in cgroups" use
> > case we do want to actually charge the memory.
> > 
> > The problem is a bit that with gpu allocations reclaim is essentially "we
> > pass the error to userspace and they get to sort the mess out". There are
> > some exceptions (some gpu drivers to have shrinkers) would we need to make
> > sure these shrinkers are tied into the cgroup stuff before we could enable
> > charging for them?
> > 
> 
> No, there is no requirement to have shrinkers or making such memory
> reclaimable before charging it. Though existing shrinkers and the
> possible future shrinkers would need to be converted into memcg aware
> shrinkers.
> 
> Though there will be a need to update user expectations that if they 
> use memcgs with hard limits, they may start seeing memcg OOMs after the
> charging of dmabuf.

Agreed. This wouldn't be the first in kernel memory charged memory that
is not directly reclaimable. With a dedicated counter an excessive
dmabuf usage would be visible in the oom report because we do print
memcg stats.

It is definitely preferable to have a shrinker mechanism but if that is
to be done in a follow up step then this is acceptable. But leaving out
charging from early on sounds like a bad choice to me.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 1/4] memcg: Track exported dma-buffers

2023-01-10 Thread Michal Hocko
On Mon 09-01-23 21:38:04, T.J. Mercier wrote:
> When a buffer is exported to userspace, use memcg to attribute the
> buffer to the allocating cgroup until all buffer references are
> released.
> 
> Unlike the dmabuf sysfs stats implementation, this memcg accounting
> avoids contention over the kernfs_rwsem incurred when creating or
> removing nodes.

I am not familiar with dmabuf infrastructure so please bear with me.
AFAIU this patch adds a dmabuf specific counter to find out the amount
of dmabuf memory used. But I do not see any actual charging implemented
for that memory.

I have looked at two random users of dma_buf_export cma_heap_allocate
and it allocates pages to back the dmabuf (AFAIU) by cma_alloc
which doesn't account to memcg, system_heap_allocate uses
alloc_largest_available which relies on order_flags which doesn't seem
to ever use __GFP_ACCOUNT.

This would mean that the counter doesn't represent any actual memory
reflected in the overall memory consumption of a memcg. I believe this
is rather unexpected and confusing behavior. While some counters
overlap and their sum would exceed the charged memory we do not have any
that doesn't correspond to any memory (at least not for non-root memcgs).

-- 
Michal Hocko
SUSE Labs


Re: [PATCH v5] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-21 Thread Michal Hocko
On Wed 21-04-21 10:37:11, peter.enderb...@sony.com wrote:
> On 4/21/21 11:15 AM, Daniel Vetter wrote:
[...]
> > We need to understand what the "correct" value is. Not in terms of kernel
> > code, but in terms of semantics. Like if userspace allocates a GL texture,
> > is this supposed to show up in your metric or not. Stuff like that.
> That it like that would like to only one pointer type. You need to know what
> 
> you pointing at to know what it is. it might be a hardware or a other pointer.
> 
> If there is a limitation on your pointers it is a good metric to count them
> even if you don't  know what they are. Same goes for dma-buf, they
> are generic, but they consume some resources that are counted in pages.
> 
> It would be very good if there a sub division where you could measure
> all possible types separately.  We have the detailed in debugfs, but nothing
> for the user. A summary in meminfo seems to be the best place for such
> metric.

I strongly suspect that the main problem of this patch (and its previous
versions) is that you are failing to explain the purpose of the counter
to others. From what you have said so far it sounds like this is a
number which is nice to have because gives us more than nothing. And
while this is not really hard to agree with it doesn't really meet the
justification for exporting yet another counter to the userspace with
all the headache of the future maintenance. I think it would hugely help
to describe a typical scenario when the counter would be useful and the
conclusion you can draw from the exported value or a set of values over
time.

And please note that a mixed bag of different memory resources in a
single counter doesn't make this any easier because then it essentially
makes it impossible to know whether an excessive usage contribues to RAM
or device memory depletion - hence this is completely bogus for the OOM
report.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 09:25:51, peter.enderb...@sony.com wrote:
> On 4/20/21 11:12 AM, Michal Hocko wrote:
> > On Tue 20-04-21 09:02:57, peter.enderb...@sony.com wrote:
> >>>> But that isn't really system memory at all, it's just allocated device
> >>>> memory.
> >>> OK, that was not really clear to me. So this is not really accounted to
> >>> MemTotal? If that is really the case then reporting it into the oom
> >>> report is completely pointless and I am not even sure /proc/meminfo is
> >>> the right interface either. It would just add more confusion I am
> >>> afraid.
> >>>  
> >> Why is it confusing? Documentation is quite clear:
> > Because a single counter without a wider context cannot be put into any
> > reasonable context. There is no notion of the total amount of device
> > memory usable for dma-buf. As Christian explained some of it can be RAM
> > based. So a single number is rather pointless on its own in many cases.
> >
> > Or let me just ask. What can you tell from dma-bud: $FOO kB in its
> > current form?
> 
> It is better to be blind?

No it is better to have a sensible counter that can be reasoned about.
So far you are only claiming that having something is better than
nothing and I would agree with you if that was a debugging one off
interface. But /proc/meminfo and other proc files have to be maintained
with future portability in mind. This is not a dumping ground for _some_
counters that might be interesting at the _current_ moment. E.g. what
happens if somebody wants to have a per device resp. memory based
dma-buf data? Are you going to change the semantic or add another
2 counters?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 09:02:57, peter.enderb...@sony.com wrote:
> 
> >> But that isn't really system memory at all, it's just allocated device
> >> memory.
> > OK, that was not really clear to me. So this is not really accounted to
> > MemTotal? If that is really the case then reporting it into the oom
> > report is completely pointless and I am not even sure /proc/meminfo is
> > the right interface either. It would just add more confusion I am
> > afraid.
> >  
> 
> Why is it confusing? Documentation is quite clear:

Because a single counter without a wider context cannot be put into any
reasonable context. There is no notion of the total amount of device
memory usable for dma-buf. As Christian explained some of it can be RAM
based. So a single number is rather pointless on its own in many cases.

Or let me just ask. What can you tell from dma-bud: $FOO kB in its
current form?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/2 V6]Add dma-buf counter

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 10:22:18, Peter Enderborg wrote:
> The dma-buf counter is a metric for mapped memory used by it's clients.
> It is a shared buffer that is typically used for interprocess communication
> or process to hardware communication. In android we used to have ION,. but
> it is now replaced with dma-buf. ION had some overview metrics that was 
> similar.

The discussion around the previous version is still not over and as it
seems your proposed approach is not really viable. So please do not send
new versions until that is sorted out.

Thanks!
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 10:00:07, Christian König wrote:
> Am 20.04.21 um 09:46 schrieb Michal Hocko:
> > On Tue 20-04-21 09:32:14, Christian König wrote:
> > > Am 20.04.21 um 09:04 schrieb Michal Hocko:
> > > > On Mon 19-04-21 18:37:13, Christian König wrote:
> > > > > Am 19.04.21 um 18:11 schrieb Michal Hocko:
> > [...]
> > > > What I am trying to bring up with NUMA side is that the same problem can
> > > > happen on per-node basis. Let's say that some user consumes unexpectedly
> > > > large amount of dma-buf on a certain node. This can lead to observable
> > > > performance impact on anybody on allocating from that node and even
> > > > worse cause an OOM for node bound consumers. How do I find out that it
> > > > was dma-buf that has caused the problem?
> > > Yes, that is the direction my thinking goes as well, but also even 
> > > further.
> > > 
> > > See DMA-buf is also used to share device local memory between processes as
> > > well. In other words VRAM on graphics hardware.
> > > 
> > > On my test system here I have 32GB of system memory and 16GB of VRAM. I 
> > > can
> > > use DMA-buf to allocate that 16GB of VRAM quite easily which then shows up
> > > under /proc/meminfo as used memory.
> > This is something that would be really interesting in the changelog. I
> > mean the expected and extreme memory consumption of this memory. Ideally
> > with some hints on what to do when the number is really high (e.g. mount
> > debugfs and have a look here and there to check whether this is just too
> > many users or an unexpected pattern to be reported).
> > 
> > > But that isn't really system memory at all, it's just allocated device
> > > memory.
> > OK, that was not really clear to me. So this is not really accounted to
> > MemTotal?
> 
> It depends. In a lot of embedded systems you only have system memory and in
> this case that value here is indeed really useful.
> 
> > If that is really the case then reporting it into the oom
> > report is completely pointless and I am not even sure /proc/meminfo is
> > the right interface either. It would just add more confusion I am
> > afraid.
> 
> I kind of agree. As I said a DMA-buf could be backed by system memory or
> device memory.
> 
> In the case when it is backed by system memory it does make sense to report
> this in an OOM dump.
> 
> But only the exporting driver knows what the DMA-buf handle represents, the
> framework just provides the common ground for inter driver communication.

Then those drivers need to account for meminfo/oom report purposes.

> > > > See where I am heading?
> > > Yeah, totally. Thanks for pointing this out.
> > > 
> > > Suggestions how to handle that?
> > As I've pointed out in previous reply we do have an API to account per
> > node memory but now that you have brought up that this is not something
> > we account as a regular memory then this doesn't really fit into that
> > model. But maybe I am just confused.
> 
> Well does that API also has a counter for memory used by device drivers?

I think that "memory used by device drivers" is immaterial. The only
important thing is to account that memory where it makes sense. So for
RAM based allocations to report them via meminfo and find other way to
report device memory allocations.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 10:20:43, Mike Rapoport wrote:
> On Tue, Apr 20, 2021 at 09:04:51AM +0200, Michal Hocko wrote:
> > On Mon 19-04-21 18:37:13, Christian König wrote:
> > > Am 19.04.21 um 18:11 schrieb Michal Hocko:
> > [...]
> > > > The question is not whether it is NUMA aware but whether it is useful to
> > > > know per-numa data for the purpose the counter is supposed to serve.
> > > 
> > > No, not at all. The pages of a single DMA-buf could even be from different
> > > NUMA nodes if the exporting driver decides that this is somehow useful.
> > 
> > As the use of the counter hasn't been explained yet I can only
> > speculate. One thing that I can imagine to be useful is to fill gaps in
> > our accounting. It is quite often that the memroy accounted in
> > /proc/meminfo (or oom report) doesn't add up to the overall memory
> > usage. In some workloads the workload can be huge! In many cases there
> > are other means to find out additional memory by a subsystem specific
> > interfaces (e.g. networking buffers). I do assume that dma-buf is just
> > one of those and the counter can fill the said gap at least partially
> > for some workloads. That is definitely useful.
> 
> A bit off-topic.
> 
> Michal, I think it would have been nice to have an explanation like above
> in Documentation/proc/meminfo, what do you say?

Not sure which specific parts (likely the unaccounted memory?) but sure
why not. Our /proc/meminfo is rather underdocumented. More information
cannot hurt.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Tue 20-04-21 09:32:14, Christian König wrote:
> Am 20.04.21 um 09:04 schrieb Michal Hocko:
> > On Mon 19-04-21 18:37:13, Christian König wrote:
> > > Am 19.04.21 um 18:11 schrieb Michal Hocko:
[...]
> > What I am trying to bring up with NUMA side is that the same problem can
> > happen on per-node basis. Let's say that some user consumes unexpectedly
> > large amount of dma-buf on a certain node. This can lead to observable
> > performance impact on anybody on allocating from that node and even
> > worse cause an OOM for node bound consumers. How do I find out that it
> > was dma-buf that has caused the problem?
> 
> Yes, that is the direction my thinking goes as well, but also even further.
> 
> See DMA-buf is also used to share device local memory between processes as
> well. In other words VRAM on graphics hardware.
> 
> On my test system here I have 32GB of system memory and 16GB of VRAM. I can
> use DMA-buf to allocate that 16GB of VRAM quite easily which then shows up
> under /proc/meminfo as used memory.

This is something that would be really interesting in the changelog. I
mean the expected and extreme memory consumption of this memory. Ideally
with some hints on what to do when the number is really high (e.g. mount
debugfs and have a look here and there to check whether this is just too
many users or an unexpected pattern to be reported).

> But that isn't really system memory at all, it's just allocated device
> memory.

OK, that was not really clear to me. So this is not really accounted to
MemTotal? If that is really the case then reporting it into the oom
report is completely pointless and I am not even sure /proc/meminfo is
the right interface either. It would just add more confusion I am
afraid.
 
> > See where I am heading?
> 
> Yeah, totally. Thanks for pointing this out.
> 
> Suggestions how to handle that?

As I've pointed out in previous reply we do have an API to account per
node memory but now that you have brought up that this is not something
we account as a regular memory then this doesn't really fit into that
model. But maybe I am just confused.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-20 Thread Michal Hocko
On Mon 19-04-21 18:37:13, Christian König wrote:
> Am 19.04.21 um 18:11 schrieb Michal Hocko:
[...]
> > The question is not whether it is NUMA aware but whether it is useful to
> > know per-numa data for the purpose the counter is supposed to serve.
> 
> No, not at all. The pages of a single DMA-buf could even be from different
> NUMA nodes if the exporting driver decides that this is somehow useful.

As the use of the counter hasn't been explained yet I can only
speculate. One thing that I can imagine to be useful is to fill gaps in
our accounting. It is quite often that the memroy accounted in
/proc/meminfo (or oom report) doesn't add up to the overall memory
usage. In some workloads the workload can be huge! In many cases there
are other means to find out additional memory by a subsystem specific
interfaces (e.g. networking buffers). I do assume that dma-buf is just
one of those and the counter can fill the said gap at least partially
for some workloads. That is definitely useful.

What I am trying to bring up with NUMA side is that the same problem can
happen on per-node basis. Let's say that some user consumes unexpectedly
large amount of dma-buf on a certain node. This can lead to observable
performance impact on anybody on allocating from that node and even
worse cause an OOM for node bound consumers. How do I find out that it
was dma-buf that has caused the problem?

See where I am heading?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-19 Thread Michal Hocko
On Mon 19-04-21 17:44:13, Christian König wrote:
> Am 19.04.21 um 17:19 schrieb peter.enderb...@sony.com:
> > On 4/19/21 5:00 PM, Michal Hocko wrote:
> > > On Mon 19-04-21 12:41:58, peter.enderb...@sony.com wrote:
> > > > On 4/19/21 2:16 PM, Michal Hocko wrote:
> > > > > On Sat 17-04-21 12:40:32, Peter Enderborg wrote:
> > > > > > This adds a total used dma-buf memory. Details
> > > > > > can be found in debugfs, however it is not for everyone
> > > > > > and not always available. dma-buf are indirect allocated by
> > > > > > userspace. So with this value we can monitor and detect
> > > > > > userspace applications that have problems.
> > > > > The changelog would benefit from more background on why this is 
> > > > > needed,
> > > > > and who is the primary consumer of that value.
> > > > > 
> > > > > I cannot really comment on the dma-buf internals but I have two 
> > > > > remarks.
> > > > > Documentation/filesystems/proc.rst needs an update with the counter
> > > > > explanation and secondly is this information useful for OOM situations
> > > > > analysis? If yes then show_mem should dump the value as well.
> > > > > 
> > > > >  From the implementation point of view, is there any reason why this
> > > > > hasn't used the existing global_node_page_state infrastructure?
> > > > I fix doc in next version.  Im not sure what you expect the commit 
> > > > message to include.
> > > As I've said. Usual justification covers answers to following questions
> > >   - Why do we need it?
> > >   - Why the existing data is insuficient?
> > >   - Who is supposed to use the data and for what?
> > > 
> > > I can see an answer for the first two questions (because this can be a
> > > lot of memory and the existing infrastructure is not production suitable
> > > - debugfs). But the changelog doesn't really explain who is going to use
> > > the new data. Is this a monitoring to raise an early alarm when the
> > > value grows? Is this for debugging misbehaving drivers? How is it
> > > valuable for those?
> > > 
> > > > The function of the meminfo is: (From 
> > > > Documentation/filesystems/proc.rst)
> > > > 
> > > > "Provides information about distribution and utilization of memory."
> > > True. Yet we do not export any random counters, do we?
> > > 
> > > > Im not the designed of dma-buf, I think  global_node_page_state as a 
> > > > kernel
> > > > internal.
> > > It provides a node specific and optimized counters. Is this a good fit
> > > with your new counter? Or the NUMA locality is of no importance?
> > Sounds good to me, if Christian Koenig think it is good, I will use that.
> > It is only virtio in drivers that use the global_node_page_state if
> > that matters.
> 
> DMA-buf are not NUMA aware at all. On which node the pages are allocated
> (and if we use pages at all and not internal device memory) is up to the
> exporter and importer.

The question is not whether it is NUMA aware but whether it is useful to
know per-numa data for the purpose the counter is supposed to serve.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-19 Thread Michal Hocko
On Mon 19-04-21 12:41:58, peter.enderb...@sony.com wrote:
> On 4/19/21 2:16 PM, Michal Hocko wrote:
> > On Sat 17-04-21 12:40:32, Peter Enderborg wrote:
> >> This adds a total used dma-buf memory. Details
> >> can be found in debugfs, however it is not for everyone
> >> and not always available. dma-buf are indirect allocated by
> >> userspace. So with this value we can monitor and detect
> >> userspace applications that have problems.
> > The changelog would benefit from more background on why this is needed,
> > and who is the primary consumer of that value.
> >
> > I cannot really comment on the dma-buf internals but I have two remarks.
> > Documentation/filesystems/proc.rst needs an update with the counter
> > explanation and secondly is this information useful for OOM situations
> > analysis? If yes then show_mem should dump the value as well.
> >
> > From the implementation point of view, is there any reason why this
> > hasn't used the existing global_node_page_state infrastructure?
> 
> I fix doc in next version.  Im not sure what you expect the commit message to 
> include.

As I've said. Usual justification covers answers to following questions
- Why do we need it?
- Why the existing data is insuficient?
- Who is supposed to use the data and for what?

I can see an answer for the first two questions (because this can be a
lot of memory and the existing infrastructure is not production suitable
- debugfs). But the changelog doesn't really explain who is going to use
the new data. Is this a monitoring to raise an early alarm when the
value grows? Is this for debugging misbehaving drivers? How is it
valuable for those?

> The function of the meminfo is: (From Documentation/filesystems/proc.rst)
> 
> "Provides information about distribution and utilization of memory."

True. Yet we do not export any random counters, do we?

> Im not the designed of dma-buf, I think  global_node_page_state as a kernel
> internal.

It provides a node specific and optimized counters. Is this a good fit
with your new counter? Or the NUMA locality is of no importance?

> dma-buf is a device driver that provides a function so I might be
> on the outside. However I also see that it might be relevant for a OOM.
> It is memory that can be freed by killing userspace processes.
> 
> The show_mem thing. Should it be a separate patch?

This is up to you but if you want to expose the counter then send it in
one series.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-19 Thread Michal Hocko
On Sat 17-04-21 12:40:32, Peter Enderborg wrote:
> This adds a total used dma-buf memory. Details
> can be found in debugfs, however it is not for everyone
> and not always available. dma-buf are indirect allocated by
> userspace. So with this value we can monitor and detect
> userspace applications that have problems.

The changelog would benefit from more background on why this is needed,
and who is the primary consumer of that value.

I cannot really comment on the dma-buf internals but I have two remarks.
Documentation/filesystems/proc.rst needs an update with the counter
explanation and secondly is this information useful for OOM situations
analysis? If yes then show_mem should dump the value as well.

>From the implementation point of view, is there any reason why this
hasn't used the existing global_node_page_state infrastructure?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/ttm: stop warning on TT shrinker failure

2021-03-23 Thread Michal Hocko
On Tue 23-03-21 14:56:54, Christian König wrote:
> Am 23.03.21 um 14:41 schrieb Michal Hocko:
[...]
> > Anyway, I am wondering whether the overall approach is sound. Why don't
> > you simply use shmem as your backing storage from the beginning and pin
> > those pages if they are used by the device?
> 
> Yeah, that is exactly what the Intel guys are doing for their integrated
> GPUs :)
> 
> Problem is for TTM I need to be able to handle dGPUs and those have all
> kinds of funny allocation restrictions. In other words I need to guarantee
> that the allocated memory is coherent accessible to the GPU without using
> SWIOTLB.
> 
> The simple case is that the device can only do DMA32, but you also got
> device which can only do 40bits or 48bits.
> 
> On top of that you also got AGP, CMA and stuff like CPU cache behavior
> changes (write back vs. write through, vs. uncached).

OK, so the underlying problem seems to be that gfp mask (thus
mapping_gfp_mask) cannot really reflect your requirements, right?  Would
it help if shmem would allow to provide an allocation callback to
override alloc_page_vma which is used currently? I am pretty sure there
will be more to handle but going through shmem for the whole life time
is just so much easier to reason about than some tricks to abuse shmem
just for the swapout path.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/ttm: stop warning on TT shrinker failure

2021-03-23 Thread Michal Hocko
On Tue 23-03-21 14:15:05, Daniel Vetter wrote:
> On Tue, Mar 23, 2021 at 01:04:03PM +0100, Michal Hocko wrote:
> > On Tue 23-03-21 12:48:58, Christian König wrote:
> > > Am 23.03.21 um 12:28 schrieb Daniel Vetter:
> > > > On Tue, Mar 23, 2021 at 08:38:33AM +0100, Michal Hocko wrote:
> > > > > I think this is where I don't get yet what Christian tries to do: We
> > > > > really shouldn't do different tricks and calling contexts between 
> > > > > direct
> > > > > reclaim and kswapd reclaim. Otherwise very hard to track down bugs are
> > > > > pretty much guaranteed. So whether we use explicit gfp flags or the
> > > > > context apis, result is exactly the same.
> > > 
> > > Ok let us recap what TTMs TT shrinker does here:
> > > 
> > > 1. We got memory which is not swapable because it might be accessed by the
> > > GPU at any time.
> > > 2. Make sure the memory is not accessed by the GPU and driver need to 
> > > grab a
> > > lock before they can make it accessible again.
> > > 3. Allocate a shmem file and copy over the not swapable pages.
> > 
> > This is quite tricky because the shrinker operates in the PF_MEMALLOC
> > context so such an allocation would be allowed to completely deplete
> > memory unless you explicitly mark that context as __GFP_NOMEMALLOC. Also
> > note that if the allocation cannot succeed it will not trigger reclaim
> > again because you are already called from the reclaim context.
> 
> [Limiting to that discussion]
> 
> Yes it's not emulating real (direct) reclaim correctly, but ime the
> biggest issue with direct reclaim is when you do mutex_lock instead of
> mutex_trylock or in general block on stuff that you cant. And lockdep +
> fs_reclaim annotations gets us that, so pretty good to make sure our
> shrinker is correct.

I have to confess that I manage to (happily) forget all the nasty
details about fs_reclaim lockdep internals so I am not sure the use by
the proposed patch is actually reasonable. Talk to lockdep guys about
that and make sure to put a big fat comment explaining what is going on.

In general allocating from the reclaim context is a bad idea and you
should avoid that. As already said a simple allocation request from the
reclaim context is not constrained and it will not recurse back into
the reclaim. Calling into shmem from the shrinker context might be
really tricky as well. I am not even sure this is possible for anything
other than full (GFP_KERNEL) reclaim context.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/ttm: stop warning on TT shrinker failure

2021-03-23 Thread Michal Hocko
On Tue 23-03-21 14:06:25, Christian König wrote:
> Am 23.03.21 um 13:37 schrieb Michal Hocko:
> > On Tue 23-03-21 13:21:32, Christian König wrote:
[...]
> > > Ideally I would like to be able to trigger swapping out the shmem page I
> > > allocated immediately after doing the copy.
> > So let me try to rephrase to make sure I understand. You would like to
> > swap out the existing content from the shrinker and you use shmem as a
> > way to achieve that. The swapout should happen at the time of copying
> > (shrinker context) or shortly afterwards?
> > 
> > So effectively to call pageout() on the shmem page after the copy?
> 
> Yes, exactly that.

OK, good. I see what you are trying to achieve now. I do not think we
would want to allow pageout from the shrinker's context but what you can
do is to instantiate the shmem page into the tail of the inactive list
so the next reclaim attempt will swap it out (assuming swap is available
of course).

This is not really something that our existing infrastructure gives you
though, I am afraid. There is no way to tell a newly allocated shmem
page should be in fact cold and the first one to swap out. But there are
people more familiar with shmem and its pecularities so I might be wrong
here.

Anyway, I am wondering whether the overall approach is sound. Why don't
you simply use shmem as your backing storage from the beginning and pin
those pages if they are used by the device?

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/ttm: stop warning on TT shrinker failure

2021-03-23 Thread Michal Hocko
On Tue 23-03-21 13:21:32, Christian König wrote:
> Am 23.03.21 um 13:04 schrieb Michal Hocko:
> > On Tue 23-03-21 12:48:58, Christian König wrote:
> > > Am 23.03.21 um 12:28 schrieb Daniel Vetter:
> > > > On Tue, Mar 23, 2021 at 08:38:33AM +0100, Michal Hocko wrote:
> > > > > On Mon 22-03-21 20:34:25, Christian König wrote:
> > [...]
> > > > > > My only concern is that if I could rely on memalloc_no* being used 
> > > > > > we could
> > > > > > optimize this quite a bit further.
> > > > > Yes you can use the scope API and you will be guaranteed that _any_
> > > > > allocation from the enclosed context will inherit GFP_NO* semantic.
> > > The question is if this is also guaranteed the other way around?
> > > 
> > > In other words if somebody calls get_free_page(GFP_NOFS) are the context
> > > flags set as well?
> > gfp mask is always restricted in the page allocator. So say you have
> > noio scope context and call get_free_page/kmalloc(GFP_NOFS) then the
> > scope would restrict the allocation flags to GFP_NOIO (aka drop
> > __GFP_IO). For further details, have a look at current_gfp_context
> > and its callers.
> > 
> > Does this answer your question?
> 
> But what happens if you don't have noio scope and somebody calls
> get_free_page(GFP_NOFS)?

Then this will be a regular NOFS request. Let me repeat scope API will
further restrict any requested allocation mode.

> Is then the noio scope added automatically? And is it possible that the
> shrinker gets called without noio scope even we would need it?

Here you have lost me again.

> > > > > I think this is where I don't get yet what Christian tries to do: We
> > > > > really shouldn't do different tricks and calling contexts between 
> > > > > direct
> > > > > reclaim and kswapd reclaim. Otherwise very hard to track down bugs are
> > > > > pretty much guaranteed. So whether we use explicit gfp flags or the
> > > > > context apis, result is exactly the same.
> > > Ok let us recap what TTMs TT shrinker does here:
> > > 
> > > 1. We got memory which is not swapable because it might be accessed by the
> > > GPU at any time.
> > > 2. Make sure the memory is not accessed by the GPU and driver need to 
> > > grab a
> > > lock before they can make it accessible again.
> > > 3. Allocate a shmem file and copy over the not swapable pages.
> > This is quite tricky because the shrinker operates in the PF_MEMALLOC
> > context so such an allocation would be allowed to completely deplete
> > memory unless you explicitly mark that context as __GFP_NOMEMALLOC.
> 
> Thanks, exactly that was one thing I was absolutely not sure about. And yes
> I agree that this is really tricky.
> 
> Ideally I would like to be able to trigger swapping out the shmem page I
> allocated immediately after doing the copy.

So let me try to rephrase to make sure I understand. You would like to
swap out the existing content from the shrinker and you use shmem as a
way to achieve that. The swapout should happen at the time of copying
(shrinker context) or shortly afterwards?

So effectively to call pageout() on the shmem page after the copy?
 
> This way I would only need a single page for the whole shrink operation at
> any given time.

What do you mean by that? You want the share the same shmem page for
other copy+swapout?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/ttm: stop warning on TT shrinker failure

2021-03-23 Thread Michal Hocko
On Tue 23-03-21 12:51:13, Christian König wrote:
> 
> 
> Am 23.03.21 um 12:46 schrieb Michal Hocko:
> > On Tue 23-03-21 12:28:20, Daniel Vetter wrote:
> > > On Tue, Mar 23, 2021 at 08:38:33AM +0100, Michal Hocko wrote:
> > [...]
> > > > > > fs_reclaim_acquire is there to make sure lockdep understands that 
> > > > > > this
> > > > > > is a shrinker and that it checks all the dependencies for us like if
> > > > > > we'd be in real reclaim. There is some drop caches interfaces in 
> > > > > > proc
> > > > > > iirc, but those drop everything, and they don't have the fs_reclaim
> > > > > > annotations to teach lockdep about what we're doing.
> > > > ... I really do not follow this. You shouldn't really care whether this
> > > > is a reclaim interface or not. Or maybe I just do not understand this...
> > > We're heavily relying on lockdep and fs_reclaim to make sure we get it all
> > > right. So any drop caches interface that isn't wrapped in fs_reclaim
> > > context is kinda useless for testing. Plus ideally we want to only hit our
> > > own paths, and not trash every other cache in the system. Speed matters in
> > > CI.
> > But what is special about this path to hack around and make it pretend
> > it is part of the fs reclaim path?
> 
> That's just to teach lockdep that there is a dependency.
> 
> In other words we pretend in the debugfs file that it is part of the fs
> reclaim path to check for the case when it really becomes part of the fs
> reclaim path.

OK, our emails crossed and I can see your response only after replying
to your other email. OK, this makes more sense now. But as pointed in
other email this will likely not do what you think. Let's continue
discussing in the other subthread to reduce the further confusion.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/ttm: stop warning on TT shrinker failure

2021-03-23 Thread Michal Hocko
On Tue 23-03-21 12:48:58, Christian König wrote:
> Am 23.03.21 um 12:28 schrieb Daniel Vetter:
> > On Tue, Mar 23, 2021 at 08:38:33AM +0100, Michal Hocko wrote:
> > > On Mon 22-03-21 20:34:25, Christian König wrote:
[...]
> > > > My only concern is that if I could rely on memalloc_no* being used we 
> > > > could
> > > > optimize this quite a bit further.
> > > Yes you can use the scope API and you will be guaranteed that _any_
> > > allocation from the enclosed context will inherit GFP_NO* semantic.
> 
> The question is if this is also guaranteed the other way around?
> 
> In other words if somebody calls get_free_page(GFP_NOFS) are the context
> flags set as well?

gfp mask is always restricted in the page allocator. So say you have
noio scope context and call get_free_page/kmalloc(GFP_NOFS) then the
scope would restrict the allocation flags to GFP_NOIO (aka drop
__GFP_IO). For further details, have a look at current_gfp_context
and its callers.

Does this answer your question?

> > > I think this is where I don't get yet what Christian tries to do: We
> > > really shouldn't do different tricks and calling contexts between direct
> > > reclaim and kswapd reclaim. Otherwise very hard to track down bugs are
> > > pretty much guaranteed. So whether we use explicit gfp flags or the
> > > context apis, result is exactly the same.
> 
> Ok let us recap what TTMs TT shrinker does here:
> 
> 1. We got memory which is not swapable because it might be accessed by the
> GPU at any time.
> 2. Make sure the memory is not accessed by the GPU and driver need to grab a
> lock before they can make it accessible again.
> 3. Allocate a shmem file and copy over the not swapable pages.

This is quite tricky because the shrinker operates in the PF_MEMALLOC
context so such an allocation would be allowed to completely deplete
memory unless you explicitly mark that context as __GFP_NOMEMALLOC. Also
note that if the allocation cannot succeed it will not trigger reclaim
again because you are already called from the reclaim context.

> 4. Free the not swapable/reclaimable pages.
> 
> The pages we got from the shmem file are easily swapable to disk after the
> copy is completed. But only if IO is not already blocked because the
> shrinker was called from an allocation restricted by GFP_NOFS or GFP_NOIO.

Sorry for being dense here but I still do not follow the actual problem
(well, except for the above mentioned one). Is the sole point of this to
emulate a GFP_NO* allocation context and see how shrinker behaves? 
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/ttm: stop warning on TT shrinker failure

2021-03-23 Thread Michal Hocko
On Tue 23-03-21 12:28:20, Daniel Vetter wrote:
> On Tue, Mar 23, 2021 at 08:38:33AM +0100, Michal Hocko wrote:
[...]
> > > > fs_reclaim_acquire is there to make sure lockdep understands that this
> > > > is a shrinker and that it checks all the dependencies for us like if
> > > > we'd be in real reclaim. There is some drop caches interfaces in proc
> > > > iirc, but those drop everything, and they don't have the fs_reclaim
> > > > annotations to teach lockdep about what we're doing.
> > 
> > ... I really do not follow this. You shouldn't really care whether this
> > is a reclaim interface or not. Or maybe I just do not understand this...
> 
> We're heavily relying on lockdep and fs_reclaim to make sure we get it all
> right. So any drop caches interface that isn't wrapped in fs_reclaim
> context is kinda useless for testing. Plus ideally we want to only hit our
> own paths, and not trash every other cache in the system. Speed matters in
> CI.

But what is special about this path to hack around and make it pretend
it is part of the fs reclaim path?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/ttm: stop warning on TT shrinker failure

2021-03-23 Thread Michal Hocko
On Mon 22-03-21 20:34:25, Christian König wrote:
> Am 22.03.21 um 18:02 schrieb Daniel Vetter:
> > On Mon, Mar 22, 2021 at 5:06 PM Michal Hocko  wrote:
> > > On Mon 22-03-21 14:05:48, Matthew Wilcox wrote:
> > > > On Mon, Mar 22, 2021 at 02:49:27PM +0100, Daniel Vetter wrote:
> > > > > On Sun, Mar 21, 2021 at 03:18:28PM +0100, Christian König wrote:
> > > > > > Am 20.03.21 um 14:17 schrieb Daniel Vetter:
> > > > > > > On Sat, Mar 20, 2021 at 10:04 AM Christian König
> > > > > > >  wrote:
> > > > > > > > Am 19.03.21 um 20:06 schrieb Daniel Vetter:
> > > > > > > > > On Fri, Mar 19, 2021 at 07:53:48PM +0100, Christian König 
> > > > > > > > > wrote:
> > > > > > > > > > Am 19.03.21 um 18:52 schrieb Daniel Vetter:
> > > > > > > > > > > On Fri, Mar 19, 2021 at 03:08:57PM +0100, Christian König 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Don't print a warning when we fail to allocate a page 
> > > > > > > > > > > > for swapping things out.
> > > > > > > > > > > > 
> > > > > > > > > > > > Also rely on memalloc_nofs_save/memalloc_nofs_restore 
> > > > > > > > > > > > instead of GFP_NOFS.
> > > > > > > > > > > Uh this part doesn't make sense. Especially since you 
> > > > > > > > > > > only do it for the
> > > > > > > > > > > debugfs file, not in general. Which means you've just 
> > > > > > > > > > > completely broken
> > > > > > > > > > > the shrinker.
> > > > > > > > > > Are you sure? My impression is that GFP_NOFS should now 
> > > > > > > > > > work much more out
> > > > > > > > > > of the box with the 
> > > > > > > > > > memalloc_nofs_save()/memalloc_nofs_restore().
> > > > > > > > > Yeah, if you'd put it in the right place :-)
> > > > > > > > > 
> > > > > > > > > But also -mm folks are very clear that memalloc_no*() family 
> > > > > > > > > is for dire
> > > > > > > > > situation where there's really no other way out. For anything 
> > > > > > > > > where you
> > > > > > > > > know what you're doing, you really should use explicit gfp 
> > > > > > > > > flags.
> > > > > > > > My impression is just the other way around. You should try to 
> > > > > > > > avoid the
> > > > > > > > NOFS/NOIO flags and use the memalloc_no* approach instead.
> > > > > > > Where did you get that idea?
> > > > > > Well from the kernel comment on GFP_NOFS:
> > > > > > 
> > > > > >   * %GFP_NOFS will use direct reclaim but will not use any 
> > > > > > filesystem
> > > > > > interfaces.
> > > > > >   * Please try to avoid using this flag directly and instead use
> > > > > >   * memalloc_nofs_{save,restore} to mark the whole scope which
> > > > > > cannot/shouldn't
> > > > > >   * recurse into the FS layer with a short explanation why. All 
> > > > > > allocation
> > > > > >   * requests will inherit GFP_NOFS implicitly.
> > > > > Huh that's interesting, since iirc Willy or Dave told me the 
> > > > > opposite, and
> > > > > the memalloc_no* stuff is for e.g. nfs calling into network layer 
> > > > > (needs
> > > > > GFP_NOFS) or swap on top of a filesystems (even needs GFP_NOIO I 
> > > > > think).
> > > > > 
> > > > > Adding them, maybe I got confused.
> > > > My impression is that the scoped API is preferred these days.
> > > > 
> > > > https://www.kernel.org/doc/html/latest/core-api/gfp_mask-from-fs-io.html
> > > > 
> > > > I'd probably need to spend a few months learning the DRM subsystem to
> > > > have a more detailed opinion on whether passing GFP flags around 
> > > > explicitly
> > > > or using the scope API is the better approa

Re: [PATCH] drm/ttm: stop warning on TT shrinker failure

2021-03-22 Thread Michal Hocko
On Mon 22-03-21 14:05:48, Matthew Wilcox wrote:
> On Mon, Mar 22, 2021 at 02:49:27PM +0100, Daniel Vetter wrote:
> > On Sun, Mar 21, 2021 at 03:18:28PM +0100, Christian König wrote:
> > > Am 20.03.21 um 14:17 schrieb Daniel Vetter:
> > > > On Sat, Mar 20, 2021 at 10:04 AM Christian König
> > > >  wrote:
> > > > > Am 19.03.21 um 20:06 schrieb Daniel Vetter:
> > > > > > On Fri, Mar 19, 2021 at 07:53:48PM +0100, Christian König wrote:
> > > > > > > Am 19.03.21 um 18:52 schrieb Daniel Vetter:
> > > > > > > > On Fri, Mar 19, 2021 at 03:08:57PM +0100, Christian König wrote:
> > > > > > > > > Don't print a warning when we fail to allocate a page for 
> > > > > > > > > swapping things out.
> > > > > > > > > 
> > > > > > > > > Also rely on memalloc_nofs_save/memalloc_nofs_restore instead 
> > > > > > > > > of GFP_NOFS.
> > > > > > > > Uh this part doesn't make sense. Especially since you only do 
> > > > > > > > it for the
> > > > > > > > debugfs file, not in general. Which means you've just 
> > > > > > > > completely broken
> > > > > > > > the shrinker.
> > > > > > > Are you sure? My impression is that GFP_NOFS should now work much 
> > > > > > > more out
> > > > > > > of the box with the memalloc_nofs_save()/memalloc_nofs_restore().
> > > > > > Yeah, if you'd put it in the right place :-)
> > > > > > 
> > > > > > But also -mm folks are very clear that memalloc_no*() family is for 
> > > > > > dire
> > > > > > situation where there's really no other way out. For anything where 
> > > > > > you
> > > > > > know what you're doing, you really should use explicit gfp flags.
> > > > > My impression is just the other way around. You should try to avoid 
> > > > > the
> > > > > NOFS/NOIO flags and use the memalloc_no* approach instead.
> > > > Where did you get that idea?
> > > 
> > > Well from the kernel comment on GFP_NOFS:
> > > 
> > >  * %GFP_NOFS will use direct reclaim but will not use any filesystem
> > > interfaces.
> > >  * Please try to avoid using this flag directly and instead use
> > >  * memalloc_nofs_{save,restore} to mark the whole scope which
> > > cannot/shouldn't
> > >  * recurse into the FS layer with a short explanation why. All allocation
> > >  * requests will inherit GFP_NOFS implicitly.
> > 
> > Huh that's interesting, since iirc Willy or Dave told me the opposite, and
> > the memalloc_no* stuff is for e.g. nfs calling into network layer (needs
> > GFP_NOFS) or swap on top of a filesystems (even needs GFP_NOIO I think).
> > 
> > Adding them, maybe I got confused.
> 
> My impression is that the scoped API is preferred these days.
> 
> https://www.kernel.org/doc/html/latest/core-api/gfp_mask-from-fs-io.html
> 
> I'd probably need to spend a few months learning the DRM subsystem to
> have a more detailed opinion on whether passing GFP flags around explicitly
> or using the scope API is the better approach for your situation.

yes, in an ideal world we would have a clearly defined scope of the
reclaim recursion wrt FS/IO associated with it. I've got back to
https://lore.kernel.org/amd-gfx/20210319140857.2262-1-christian.koe...@amd.com/
and there are two things standing out. Why does ttm_tt_debugfs_shrink_show
really require NOFS semantic? And why does it play with
fs_reclaim_acquire?

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] procfs/dmabuf: Add /proc//task//dmabuf_fds

2021-01-28 Thread Michal Hocko
On Wed 27-01-21 12:01:55, Christian König wrote:
[...]
> Some years ago I've proposed an change to improve the OOM killer to take
> into account how much memory is reference through special file descriptors
> like device drivers or DMA-buf.
> 
> But that never want anywhere because of concerns that this would add to much
> overhead to the OOM killer.

I have to admit I do not remember such a discussion. There were many on
the related topic, concerns were mostly about a proper accounting of
shared resources or locking required to gain the information. I do not
remember overhead as being the main concern as oom is a real cold path.

Anyway, if you want to revamp those patches then feel free to CC me and
we can discuss further. I do not want to hijack this thread by an
unrelated topic.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] procfs/dmabuf: Add /proc//task//dmabuf_fds

2021-01-28 Thread Michal Hocko
On Wed 27-01-21 12:08:50, Christian König wrote:
> Am 27.01.21 um 12:02 schrieb Michal Hocko:
> > On Wed 27-01-21 11:53:55, Christian König wrote:
> > [...]
> > > In general processes are currently not held accountable for memory they
> > > reference through their file descriptors. DMA-buf is just one special 
> > > case.
> > True
> > 
> > > In other words you can currently do something like this
> > > 
> > > fd = memfd_create("test", 0);
> > > while (1)
> > >      write(fd, buf, 1024);
> > > 
> > > and the OOM killer will terminate random processes, but never the one
> > > holding the memfd reference.
> > memfd is just shmem under cover, no? And that means that the memory gets
> > accounted to MM_SHMEMPAGES. But you are right that this in its own
> > doesn't help much if the fd is shared and the memory stays behind a
> > killed victim.
> 
> I think so, yes. But I just tested this and it doesn't seem to work
> correctly.
> 
> When I run the few lines above the OOM killer starts to terminate processes,
> but since my small test program uses very very little memory basically
> everything else gets terminated (including X, desktop, sshd etc..) before it
> is terminated as well.

Something worth looking into. Maybe those pages are not really accounted
properly after all. Can you send a separate email about details with oom
reports please?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] procfs/dmabuf: Add /proc//task//dmabuf_fds

2021-01-28 Thread Michal Hocko
r's dmabuf_fds.
> +  */
> + group_leader_files = priv->task->group_leader->files;
> + if (priv->task != priv->task->group_leader && priv->task->files == 
> group_leader_files) {
> + task_unlock(priv->task);
> + put_task_struct(priv->task);
> + priv->task = NULL;
> + return NULL;
> + }
> +
> + return next_dmabuf(priv, pos);
> +}
> +
> +static void *dmabuf_fds_seq_next(struct seq_file *s, void *v, loff_t *pos)
> +{
> + ++*pos;
> + return next_dmabuf(s->private, pos);
> +}
> +
> +static void dmabuf_fds_seq_stop(struct seq_file *s, void *v)
> +{
> + struct dmabuf_fds_private *priv = s->private;
> +
> + if (priv->task) {
> + task_unlock(priv->task);
> + put_task_struct(priv->task);
> +
> + }
> + if (priv->dmabuf_file)
> + fput(priv->dmabuf_file);
> +}
> +
> +static int dmabuf_fds_seq_show(struct seq_file *s, void *v)
> +{
> + struct dmabuf_fds_private *priv = s->private;
> + struct file *file = priv->dmabuf_file;
> + struct dma_buf *dmabuf = file->private_data;
> +
> + if (!dmabuf)
> + return -ESRCH;
> +
> + seq_printf(s, "%8lu ", file_inode(file)->i_ino);
> +
> + fput(priv->dmabuf_file);
> + priv->dmabuf_file = NULL;
> +
> + return 0;
> +}
> +
> +static const struct seq_operations proc_tid_dmabuf_fds_seq_ops = {
> + .start = dmabuf_fds_seq_start,
> + .next  = dmabuf_fds_seq_next,
> + .stop  = dmabuf_fds_seq_stop,
> + .show  = dmabuf_fds_seq_show
> +};
> +
> +static int proc_dmabuf_fds_open(struct inode *inode, struct file *file,
> +  const struct seq_operations *ops)
> +{
> + struct dmabuf_fds_private *priv;
> + struct task_struct *task;
> + bool allowed = false;
> +
> + task = get_proc_task(inode);
> + if (!task)
> + return -ESRCH;
> +
> + allowed = ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS);
> + put_task_struct(task);
> +
> + if (!allowed)
> + return -EACCES;
> +
> + priv = __seq_open_private(file, ops, sizeof(*priv));
> + if (!priv)
> + return -ENOMEM;
> +
> + priv->inode = inode;
> + priv->task = NULL;
> + priv->dmabuf_file = NULL;
> +
> + return 0;
> +}
> +
> +static int proc_dmabuf_fds_release(struct inode *inode, struct file *file)
> +{
> + return seq_release_private(inode, file);
> +}
> +
> +static int tid_dmabuf_fds_open(struct inode *inode, struct file *file)
> +{
> + return proc_dmabuf_fds_open(inode, file,
> + &proc_tid_dmabuf_fds_seq_ops);
> +}
> +
> +const struct file_operations proc_tid_dmabuf_fds_operations = {
> + .open   = tid_dmabuf_fds_open,
> + .read   = seq_read,
> + .llseek = seq_lseek,
> + .release    = proc_dmabuf_fds_release,
> +};
> +
> diff --git a/fs/proc/internal.h b/fs/proc/internal.h
> index f60b379dcdc7..4ca74220db9c 100644
> --- a/fs/proc/internal.h
> +++ b/fs/proc/internal.h
> @@ -303,6 +303,7 @@ extern const struct file_operations 
> proc_pid_smaps_operations;
>  extern const struct file_operations proc_pid_smaps_rollup_operations;
>  extern const struct file_operations proc_clear_refs_operations;
>  extern const struct file_operations proc_pagemap_operations;
> +extern const struct file_operations proc_tid_dmabuf_fds_operations;
>  
>  extern unsigned long task_vsize(struct mm_struct *);
>  extern unsigned long task_statm(struct mm_struct *,
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index cf72699cb2bc..087e11f7f193 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -27,6 +27,11 @@ struct device;
>  struct dma_buf;
>  struct dma_buf_attachment;
>  
> +/**
> + * Check if struct file* is associated with dma_buf.
> + */
> +int is_dma_buf_file(struct file *file);
> +
>  /**
>   * struct dma_buf_ops - operations possible on struct dma_buf
>   * @vmap: [optional] creates a virtual mapping for the buffer into kernel
> -- 
> 2.30.0.280.ga3ce27912f-goog

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] procfs/dmabuf: Add /proc//task//dmabuf_fds

2021-01-28 Thread Michal Hocko
On Wed 27-01-21 11:47:29, Jann Horn wrote:
> +jeffv from Android
> 
> On Tue, Jan 26, 2021 at 11:51 PM Kalesh Singh  wrote:
> > In order to measure how much memory a process actually consumes, it is
> > necessary to include the DMA buffer sizes for that process in the memory
> > accounting. Since the handle to DMA buffers are raw FDs, it is important
> > to be able to identify which processes have FD references to a DMA buffer.
> 
> Or you could try to let the DMA buffer take a reference on the
> mm_struct and account its size into the mm_struct? That would probably
> be nicer to work with than having to poke around in procfs separately
> for DMA buffers.

Yes that would make some sense to me as well but how do you know that
the process actually uses a buffer? If it mmaps it then you already have
that information via /proc//maps. My understanding of dma-buf is
really coarse but my impression is that you can consume the memory via
standard read syscall as well. How would you account for that.

[...]
Skipping over a large part of your response but I do agree that the
interface is really elaborate to drill down to the information.

> I'm not convinced that introducing a new procfs file for this is the
> right way to go. And the idea of having to poke into multiple
> different files in procfs and in sysfs just to be able to compute a
> proper memory usage score for a process seems weird to me. "How much
> memory is this process using" seems like the kind of question the
> kernel ought to be able to answer (and the kernel needs to be able to
> answer somewhat accurately so that its own OOM killer can do its job
> properly)?

Well, shared buffers are tricky but it is true that we already consider
shmem in badness so this wouldn't go out of line. Kernel oom killer
could be more clever with these special fds though and query for buffer
size directly.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] procfs/dmabuf: Add /proc//task//dmabuf_fds

2021-01-28 Thread Michal Hocko
On Wed 27-01-21 11:53:55, Christian König wrote:
[...]
> In general processes are currently not held accountable for memory they
> reference through their file descriptors. DMA-buf is just one special case.

True

> In other words you can currently do something like this
> 
> fd = memfd_create("test", 0);
> while (1)
>     write(fd, buf, 1024);
> 
> and the OOM killer will terminate random processes, but never the one
> holding the memfd reference.

memfd is just shmem under cover, no? And that means that the memory gets
accounted to MM_SHMEMPAGES. But you are right that this in its own
doesn't help much if the fd is shared and the memory stays behind a
killed victim.

But I do agree with you that there are resources which are bound to a
process life time but the oom killer has no idea about those as they are
not accounted on a per process level and/or oom_badness doesn't take
them into consideration.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [patch 00/13] preempt: Make preempt count unconditional

2020-09-30 Thread Michal Hocko
On Wed 16-09-20 23:43:02, Daniel Vetter wrote:
> I can
> then figure out whether it's better to risk not spotting issues with
> call_rcu vs slapping a memalloc_noio_save/restore around all these
> critical section which force-degrades any allocation to GFP_ATOMIC at

did you mean memalloc_noreclaim_* here?

> most, but has the risk that we run into code that assumes "GFP_KERNEL
> never fails for small stuff" and has a decidedly less tested fallback
> path than rcu code.

Even if the above then please note that memalloc_noreclaim_* or
PF_MEMALLOC should be used with an extreme care. Essentially only for
internal memory reclaimers. It grants access to _all_ the available
memory so any abuse can be detrimental to the overall system operation.
Allocation failure in this mode means that we are out of memory and any
code relying on such an allocation has to carefuly consider failure.
This is not a random allocation mode.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [patch 00/13] preempt: Make preempt count unconditional

2020-09-30 Thread Michal Hocko
On Tue 29-09-20 11:00:03, Daniel Vetter wrote:
> On Tue, Sep 29, 2020 at 10:19:38AM +0200, Michal Hocko wrote:
> > On Wed 16-09-20 23:43:02, Daniel Vetter wrote:
> > > I can
> > > then figure out whether it's better to risk not spotting issues with
> > > call_rcu vs slapping a memalloc_noio_save/restore around all these
> > > critical section which force-degrades any allocation to GFP_ATOMIC at
> > 
> > did you mean memalloc_noreclaim_* here?
> 
> Yeah I picked the wrong one of that family of functions.
> 
> > > most, but has the risk that we run into code that assumes "GFP_KERNEL
> > > never fails for small stuff" and has a decidedly less tested fallback
> > > path than rcu code.
> > 
> > Even if the above then please note that memalloc_noreclaim_* or
> > PF_MEMALLOC should be used with an extreme care. Essentially only for
> > internal memory reclaimers. It grants access to _all_ the available
> > memory so any abuse can be detrimental to the overall system operation.
> > Allocation failure in this mode means that we are out of memory and any
> > code relying on such an allocation has to carefuly consider failure.
> > This is not a random allocation mode.
> 
> Agreed, that's why I don't like having these kind of automagic critical
> sections. It's a bit a shotgun approach. Paul said that the code would
> handle failures, but the problem is that it applies everywhere.

Ohh, in the ideal world we wouldn't need anything like that. But then
the reality fires:
* PF_MEMALLOC (resp memalloc_noreclaim_* for that matter) is primarily used
  to make sure that allocations from inside the memory reclaim - yeah that
  happens - will not recurse.
* PF_MEMALLOC_NO{FS,IO} (resp memalloc_no{fs,io}*) are used to mark no
  fs/io reclaim recursion critical sections because controling that for
  each allocation inside fs transaction (or other sensitive) or IO
  contexts turned out to be unmaintainable and people simply fallen into
  using NOFS/NOIO unconditionally which is causing reclaim imbalance
  problems.
* PF_MEMALLOC_NOCMA (resp memalloc_nocma*) is used for long term pinning
  when CMA pages cannot be pinned because that would break the CMA
  guarantees. Communicating this to all potential allocations during
  pinning is simply unfeasible.

So you are absolutely right that these critical sections with side
effects on all allocations are far from ideal from the API point of view
but they are mostly mirroring a demand for functionality which is
_practically_ impossible to achieve with our current code base. Not that
we couldn't get back to drawing board and come up with a saner thing and
rework the world...
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: next-20200515: Xorg killed due to "OOM"

2020-06-01 Thread Michal Hocko
On Sun 31-05-20 14:16:01, Pavel Machek wrote:
> On Thu 2020-05-28 14:07:50, Michal Hocko wrote:
> > On Thu 28-05-20 14:03:54, Pavel Machek wrote:
> > > On Thu 2020-05-28 11:05:17, Michal Hocko wrote:
> > > > On Tue 26-05-20 11:10:54, Pavel Machek wrote:
> > > > [...]
> > > > > [38617.276517] oom_reaper: reaped process 31769 (chromium), now 
> > > > > anon-rss:0kB, file-rss:0kB, shmem-rss:7968kB
> > > > > [38617.277232] Xorg invoked oom-killer: gfp_mask=0x0(), order=0, 
> > > > > oom_score_adj=0
> > > > > [38617.277247] CPU: 0 PID: 2978 Comm: Xorg Not tainted 
> > > > > 5.7.0-rc5-next-20200515+ #117
> > > > > [38617.277256] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW 
> > > > > (2.19 ) 03/31/2011
> > > > > [38617.277266] Call Trace:
> > > > > [38617.277286]  dump_stack+0x54/0x6e
> > > > > [38617.277300]  dump_header+0x45/0x321
> > > > > [38617.277313]  oom_kill_process.cold+0x9/0xe
> > > > > [38617.277324]  ? out_of_memory+0x167/0x420
> > > > > [38617.277336]  out_of_memory+0x1f2/0x420
> > > > > [38617.277348]  pagefault_out_of_memory+0x34/0x56
> > > > > [38617.277361]  mm_fault_error+0x4a/0x130
> > > > > [38617.277372]  do_page_fault+0x3ce/0x416
> > > > 
> > > > The reason the OOM killer has been invoked is that the page fault
> > > > handler has returned VM_FAULT_OOM. So this is not a result of the page
> > > > allocator struggling to allocate a memory. It would be interesting to
> > > > check which code path has returned this. 
> > > 
> > > Should the core WARN_ON if that happens and there's enough memory, or
> > > something like that?
> > 
> > I wish it would simply go away. There shouldn't be really any reason for
> > VM_FAULT_OOM to exist. The real low on memory situation is already
> > handled in the page allocator.
> 
> Umm. Maybe the WARN_ON is first step in that direction? So we can see
> what driver actually did that, and complain to its authors?

This is much harder done than it seems. But maybe this doesn't really
need a full coverage. Some of the code paths which return VM_FAULT_OOM
will simply not fail. But checking for vma->vm_ops->fault() failures
might be interesting. Does the following tell you more about the failure
you can see

diff --git a/mm/memory.c b/mm/memory.c
index 9ab00dcb95d4..5ff023ab7b49 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3442,8 +3442,11 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
 
ret = vma->vm_ops->fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
-   VM_FAULT_DONE_COW)))
+   VM_FAULT_DONE_COW))) {
+   if (unlikely(ret & VM_FAULT_OOM))
+   pr_warn("VM_FAULT_OOM returned from %ps\n", 
vma->vm_ops->fault);
return ret;
+   }
 
if (unlikely(PageHWPoison(vmf->page))) {
if (ret & VM_FAULT_LOCKED)

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: next-20200515: Xorg killed due to "OOM"

2020-05-28 Thread Michal Hocko
On Thu 28-05-20 14:03:54, Pavel Machek wrote:
> On Thu 2020-05-28 11:05:17, Michal Hocko wrote:
> > On Tue 26-05-20 11:10:54, Pavel Machek wrote:
> > [...]
> > > [38617.276517] oom_reaper: reaped process 31769 (chromium), now 
> > > anon-rss:0kB, file-rss:0kB, shmem-rss:7968kB
> > > [38617.277232] Xorg invoked oom-killer: gfp_mask=0x0(), order=0, 
> > > oom_score_adj=0
> > > [38617.277247] CPU: 0 PID: 2978 Comm: Xorg Not tainted 
> > > 5.7.0-rc5-next-20200515+ #117
> > > [38617.277256] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 
> > > ) 03/31/2011
> > > [38617.277266] Call Trace:
> > > [38617.277286]  dump_stack+0x54/0x6e
> > > [38617.277300]  dump_header+0x45/0x321
> > > [38617.277313]  oom_kill_process.cold+0x9/0xe
> > > [38617.277324]  ? out_of_memory+0x167/0x420
> > > [38617.277336]  out_of_memory+0x1f2/0x420
> > > [38617.277348]  pagefault_out_of_memory+0x34/0x56
> > > [38617.277361]  mm_fault_error+0x4a/0x130
> > > [38617.277372]  do_page_fault+0x3ce/0x416
> > 
> > The reason the OOM killer has been invoked is that the page fault
> > handler has returned VM_FAULT_OOM. So this is not a result of the page
> > allocator struggling to allocate a memory. It would be interesting to
> > check which code path has returned this. 
> 
> Should the core WARN_ON if that happens and there's enough memory, or
> something like that?

I wish it would simply go away. There shouldn't be really any reason for
VM_FAULT_OOM to exist. The real low on memory situation is already
handled in the page allocator.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: next-20200515: Xorg killed due to "OOM"

2020-05-28 Thread Michal Hocko
On Tue 26-05-20 11:10:54, Pavel Machek wrote:
[...]
> [38617.276517] oom_reaper: reaped process 31769 (chromium), now anon-rss:0kB, 
> file-rss:0kB, shmem-rss:7968kB
> [38617.277232] Xorg invoked oom-killer: gfp_mask=0x0(), order=0, 
> oom_score_adj=0
> [38617.277247] CPU: 0 PID: 2978 Comm: Xorg Not tainted 
> 5.7.0-rc5-next-20200515+ #117
> [38617.277256] Hardware name: LENOVO 17097HU/17097HU, BIOS 7BETD8WW (2.19 ) 
> 03/31/2011
> [38617.277266] Call Trace:
> [38617.277286]  dump_stack+0x54/0x6e
> [38617.277300]  dump_header+0x45/0x321
> [38617.277313]  oom_kill_process.cold+0x9/0xe
> [38617.277324]  ? out_of_memory+0x167/0x420
> [38617.277336]  out_of_memory+0x1f2/0x420
> [38617.277348]  pagefault_out_of_memory+0x34/0x56
> [38617.277361]  mm_fault_error+0x4a/0x130
> [38617.277372]  do_page_fault+0x3ce/0x416

The reason the OOM killer has been invoked is that the page fault
handler has returned VM_FAULT_OOM. So this is not a result of the page
allocator struggling to allocate a memory. It would be interesting to
check which code path has returned this. 
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4 0/9] Huge page-table entries for TTM

2020-03-03 Thread Michal Hocko
On Fri 28-02-20 14:08:04, Thomas Hellström (VMware) wrote:
> Andrew, Michal
> 
> I'm wondering what's the best way here to get the patches touching mm
> reviewed and accepted?

I am sorry, but I am busy with other stuff and unlikely to find time to
review this series.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4 1/2] mm: Add a vmf_insert_mixed_prot() function

2019-12-20 Thread Michal Hocko
On Thu 12-12-19 09:47:40, Thomas Hellström (VMware) wrote:
> From: Thomas Hellstrom 
> 
> The TTM module today uses a hack to be able to set a different page
> protection than struct vm_area_struct::vm_page_prot. To be able to do
> this properly, add the needed vm functionality as vmf_insert_mixed_prot().
> 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: "Matthew Wilcox (Oracle)" 
> Cc: "Kirill A. Shutemov" 
> Cc: Ralph Campbell 
> Cc: "Jérôme Glisse" 
> Cc: "Christian König" 
> Signed-off-by: Thomas Hellstrom 
> Acked-by: Christian König 

I cannot say I am happy about this because it adds a discrepancy and
that is always tricky but I do agree that a formalized discrepancy is
better than ad-hoc hacks so

Acked-by: Michal Hocko 

Thanks for extending the documentation.

> ---
>  include/linux/mm.h   |  2 ++
>  include/linux/mm_types.h |  7 ++-
>  mm/memory.c  | 43 
>  3 files changed, 47 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cc292273e6ba..29575d3c1e47 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2548,6 +2548,8 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct 
> *vma, unsigned long addr,
>   unsigned long pfn, pgprot_t pgprot);
>  vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
>   pfn_t pfn);
> +vm_fault_t vmf_insert_mixed_prot(struct vm_area_struct *vma, unsigned long 
> addr,
> + pfn_t pfn, pgprot_t pgprot);
>  vm_fault_t vmf_insert_mixed_mkwrite(struct vm_area_struct *vma,
>   unsigned long addr, pfn_t pfn);
>  int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned 
> long len);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index fa795284..ac96afdbb4bc 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -307,7 +307,12 @@ struct vm_area_struct {
>   /* Second cache line starts here. */
>  
>   struct mm_struct *vm_mm;/* The address space we belong to. */
> - pgprot_t vm_page_prot;  /* Access permissions of this VMA. */
> +
> + /*
> +  * Access permissions of this VMA.
> +  * See vmf_insert_mixed() for discussion.
> +  */
> + pgprot_t vm_page_prot;
>   unsigned long vm_flags; /* Flags, see mm.h. */
>  
>   /*
> diff --git a/mm/memory.c b/mm/memory.c
> index b1ca51a079f2..269a8a871e83 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1646,6 +1646,9 @@ static vm_fault_t insert_pfn(struct vm_area_struct 
> *vma, unsigned long addr,
>   * vmf_insert_pfn_prot should only be used if using multiple VMAs is
>   * impractical.
>   *
> + * See vmf_insert_mixed_prot() for a discussion of the implication of using
> + * a value of @pgprot different from that of @vma->vm_page_prot.
> + *
>   * Context: Process context.  May allocate using %GFP_KERNEL.
>   * Return: vm_fault_t value.
>   */
> @@ -1719,9 +1722,9 @@ static bool vm_mixed_ok(struct vm_area_struct *vma, 
> pfn_t pfn)
>  }
>  
>  static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
> - unsigned long addr, pfn_t pfn, bool mkwrite)
> + unsigned long addr, pfn_t pfn, pgprot_t pgprot,
> + bool mkwrite)
>  {
> - pgprot_t pgprot = vma->vm_page_prot;
>   int err;
>  
>   BUG_ON(!vm_mixed_ok(vma, pfn));
> @@ -1764,10 +1767,42 @@ static vm_fault_t __vm_insert_mixed(struct 
> vm_area_struct *vma,
>   return VM_FAULT_NOPAGE;
>  }
>  
> +/**
> + * vmf_insert_mixed_prot - insert single pfn into user vma with specified 
> pgprot
> + * @vma: user vma to map to
> + * @addr: target user address of this page
> + * @pfn: source kernel pfn
> + * @pgprot: pgprot flags for the inserted page
> + *
> + * This is exactly like vmf_insert_mixed(), except that it allows drivers to
> + * to override pgprot on a per-page basis.
> + *
> + * Typically this function should be used by drivers to set caching- and
> + * encryption bits different than those of @vma->vm_page_prot, because
> + * the caching- or encryption mode may not be known at mmap() time.
> + * This is ok as long as @vma->vm_page_prot is not used by the core vm
> + * to set caching and encryption bits for those vmas (except for COW pages).
> + * This is ensured by core vm only modifying these page table entries using
> + * functions that don't touch caching- or encryption bits, using pte_modify()
> + * if needed. (See for example mprotect()).
> + * Also when new page-table entries are created, this is 

Re: [PATCH v3 2/2] mm, drm/ttm: Fix vm page protection handling

2019-12-06 Thread Michal Hocko
On Fri 06-12-19 14:16:10, Thomas Hellstrom wrote:
> Hi Michal,
> 
> On Fri, 2019-12-06 at 11:30 +0100, Michal Hocko wrote:
> > On Fri 06-12-19 09:24:26, Thomas Hellström (VMware) wrote:
> > [...]
> > > @@ -283,11 +282,26 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct
> > > vm_fault *vmf,
> > >   pfn = page_to_pfn(page);
> > >   }
> > >  
> > > + /*
> > > +  * Note that the value of @prot at this point may
> > > differ from
> > > +  * the value of @vma->vm_page_prot in the caching- and
> > > +  * encryption bits. This is because the exact location
> > > of the
> > > +  * data may not be known at mmap() time and may also
> > > change
> > > +  * at arbitrary times while the data is mmap'ed.
> > > +  * This is ok as long as @vma->vm_page_prot is not used
> > > by
> > > +  * the core vm to set caching- and encryption bits.
> > > +  * This is ensured by core vm using pte_modify() to
> > > modify
> > > +  * page table entry protection bits (that function
> > > preserves
> > > +  * old caching- and encryption bits), and the @fault
> > > +  * callback being the only function that creates new
> > > +  * page table entries.
> > > +  */
> > 
> > While this is a very valuable piece of information I believe we need
> > to
> > document this in the generic code where everybody will find it.
> > vmf_insert_mixed_prot sounds like a good place to me. So being
> > explicit
> > about VM_MIXEDMAP. Also a reference from vm_page_prot to this
> > function
> > would be really helpeful.
> > 
> > Thanks!
> > 
> 
> Just to make sure I understand correctly. You'd prefer this (or
> similar) text to be present at the vmf_insert_mixed_prot() and
> vmf_insert_pfn_prot() definitions for MIXEDMAP and PFNMAP respectively,
> and a pointer from vm_page_prot to that text. Is that correct?

Yes. You can keep whatever is specific to TTM here but the rest should
be somewhere visible to users of the interface and a note at
vm_page_prot should help anybody touching the generic/core code to not
break those expectations.

Thanks!
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v3 2/2] mm, drm/ttm: Fix vm page protection handling

2019-12-06 Thread Michal Hocko
On Fri 06-12-19 09:24:26, Thomas Hellström (VMware) wrote:
[...]
> @@ -283,11 +282,26 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault 
> *vmf,
>   pfn = page_to_pfn(page);
>   }
>  
> + /*
> +  * Note that the value of @prot at this point may differ from
> +  * the value of @vma->vm_page_prot in the caching- and
> +  * encryption bits. This is because the exact location of the
> +  * data may not be known at mmap() time and may also change
> +  * at arbitrary times while the data is mmap'ed.
> +  * This is ok as long as @vma->vm_page_prot is not used by
> +  * the core vm to set caching- and encryption bits.
> +  * This is ensured by core vm using pte_modify() to modify
> +  * page table entry protection bits (that function preserves
> +  * old caching- and encryption bits), and the @fault
> +  * callback being the only function that creates new
> +  * page table entries.
> +  */

While this is a very valuable piece of information I believe we need to
document this in the generic code where everybody will find it.
vmf_insert_mixed_prot sounds like a good place to me. So being explicit
about VM_MIXEDMAP. Also a reference from vm_page_prot to this function
would be really helpeful.

Thanks!

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 2/2] drm/ttm: Fix vm page protection handling

2019-12-04 Thread Michal Hocko
On Wed 04-12-19 16:19:27, Thomas Hellström (VMware) wrote:
> On 12/4/19 3:42 PM, Michal Hocko wrote:
> > On Wed 04-12-19 15:36:58, Thomas Hellström (VMware) wrote:
> > > On 12/4/19 3:35 PM, Michal Hocko wrote:
> > > > On Wed 04-12-19 15:16:09, Thomas Hellström (VMware) wrote:
> > > > > On 12/4/19 2:52 PM, Michal Hocko wrote:
> > > > > > On Tue 03-12-19 11:48:53, Thomas Hellström (VMware) wrote:
> > > > > > > From: Thomas Hellstrom 
> > > > > > > 
> > > > > > > TTM graphics buffer objects may, transparently to user-space,  
> > > > > > > move
> > > > > > > between IO and system memory. When that happens, all PTEs 
> > > > > > > pointing to the
> > > > > > > old location are zapped before the move and then faulted in again 
> > > > > > > if
> > > > > > > needed. When that happens, the page protection caching mode- and
> > > > > > > encryption bits may change and be different from those of
> > > > > > > struct vm_area_struct::vm_page_prot.
> > > > > > > 
> > > > > > > We were using an ugly hack to set the page protection correctly.
> > > > > > > Fix that and instead use vmf_insert_mixed_prot() and / or
> > > > > > > vmf_insert_pfn_prot().
> > > > > > > Also get the default page protection from
> > > > > > > struct vm_area_struct::vm_page_prot rather than using 
> > > > > > > vm_get_page_prot().
> > > > > > > This way we catch modifications done by the vm system for drivers 
> > > > > > > that
> > > > > > > want write-notification.
> > > > > > So essentially this should have any new side effect on 
> > > > > > functionality it
> > > > > > is just making a hacky/ugly code less so?
> > > > > Functionality is unchanged. The use of a on-stack vma copy was 
> > > > > severely
> > > > > frowned upon in an earlier thread, which also points to another 
> > > > > similar
> > > > > example using vmf_insert_pfn_prot().
> > > > > 
> > > > > https://lore.kernel.org/lkml/20190905103541.4161-2-thomas...@shipmail.org/
> > > > > 
> > > > > > In other words what are the
> > > > > > consequences of having page protection inconsistent from vma's?
> > > > > During the years, it looks like the caching- and encryption flags of
> > > > > vma::vm_page_prot have been largely removed from usage. From what I 
> > > > > can
> > > > > tell, there are no more places left that can affect TTM. We discussed
> > > > > __split_huge_pmd_locked() towards the end of that thread, but that 
> > > > > doesn't
> > > > > affect TTM even with huge page-table entries.
> > > > Please state all those details/assumptions you are operating on in the
> > > > changelog.
> > > Thanks. I'll update the patchset and add that.
> > And thinking about that this also begs for a comment in the code to
> > explain that some (which?) mappings might have a mismatch and the
> > generic code have to be careful. Because as things stand now this seems
> > to be really subtle and happen to work _now_ and might break in the future.
> > Or what does prevent a generic code to stumble over this discrepancy?
> 
> Yes we had that discussion in the thread I pointed to. I initially suggested
> and argued for updating the vma::vm_page_prot using a WRITE_ONCE() (we only
> have the mmap_sem in read mode), there seems to be other places in generic
> code that does the same.
> 
> But I was convinced by Andy that this was the right way and also was used
> elsewhere.
> 
> (See also 
> https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/vdso/vma.c#L116)
> 
> I guess to have this properly formulated, what's required is that generic
> code doesn't build page-table entries using vma::vm_page_prot for VM_PFNMAP
> and VM_MIXEDMAP outside of driver control.

Let me repeat that this belongs to a code somewhere everybody can see it
rather than a "random" discussion at mailing list.

Thanks!
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 2/2] drm/ttm: Fix vm page protection handling

2019-12-04 Thread Michal Hocko
On Wed 04-12-19 15:36:58, Thomas Hellström (VMware) wrote:
> On 12/4/19 3:35 PM, Michal Hocko wrote:
> > On Wed 04-12-19 15:16:09, Thomas Hellström (VMware) wrote:
> > > On 12/4/19 2:52 PM, Michal Hocko wrote:
> > > > On Tue 03-12-19 11:48:53, Thomas Hellström (VMware) wrote:
> > > > > From: Thomas Hellstrom 
> > > > > 
> > > > > TTM graphics buffer objects may, transparently to user-space,  move
> > > > > between IO and system memory. When that happens, all PTEs pointing to 
> > > > > the
> > > > > old location are zapped before the move and then faulted in again if
> > > > > needed. When that happens, the page protection caching mode- and
> > > > > encryption bits may change and be different from those of
> > > > > struct vm_area_struct::vm_page_prot.
> > > > > 
> > > > > We were using an ugly hack to set the page protection correctly.
> > > > > Fix that and instead use vmf_insert_mixed_prot() and / or
> > > > > vmf_insert_pfn_prot().
> > > > > Also get the default page protection from
> > > > > struct vm_area_struct::vm_page_prot rather than using 
> > > > > vm_get_page_prot().
> > > > > This way we catch modifications done by the vm system for drivers that
> > > > > want write-notification.
> > > > So essentially this should have any new side effect on functionality it
> > > > is just making a hacky/ugly code less so?
> > > Functionality is unchanged. The use of a on-stack vma copy was severely
> > > frowned upon in an earlier thread, which also points to another similar
> > > example using vmf_insert_pfn_prot().
> > > 
> > > https://lore.kernel.org/lkml/20190905103541.4161-2-thomas...@shipmail.org/
> > > 
> > > > In other words what are the
> > > > consequences of having page protection inconsistent from vma's?
> > > During the years, it looks like the caching- and encryption flags of
> > > vma::vm_page_prot have been largely removed from usage. From what I can
> > > tell, there are no more places left that can affect TTM. We discussed
> > > __split_huge_pmd_locked() towards the end of that thread, but that doesn't
> > > affect TTM even with huge page-table entries.
> > Please state all those details/assumptions you are operating on in the
> > changelog.
> 
> Thanks. I'll update the patchset and add that.

And thinking about that this also begs for a comment in the code to
explain that some (which?) mappings might have a mismatch and the
generic code have to be careful. Because as things stand now this seems
to be really subtle and happen to work _now_ and might break in the future.
Or what does prevent a generic code to stumble over this discrepancy?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 2/2] drm/ttm: Fix vm page protection handling

2019-12-04 Thread Michal Hocko
On Wed 04-12-19 15:16:09, Thomas Hellström (VMware) wrote:
> On 12/4/19 2:52 PM, Michal Hocko wrote:
> > On Tue 03-12-19 11:48:53, Thomas Hellström (VMware) wrote:
> > > From: Thomas Hellstrom 
> > > 
> > > TTM graphics buffer objects may, transparently to user-space,  move
> > > between IO and system memory. When that happens, all PTEs pointing to the
> > > old location are zapped before the move and then faulted in again if
> > > needed. When that happens, the page protection caching mode- and
> > > encryption bits may change and be different from those of
> > > struct vm_area_struct::vm_page_prot.
> > > 
> > > We were using an ugly hack to set the page protection correctly.
> > > Fix that and instead use vmf_insert_mixed_prot() and / or
> > > vmf_insert_pfn_prot().
> > > Also get the default page protection from
> > > struct vm_area_struct::vm_page_prot rather than using vm_get_page_prot().
> > > This way we catch modifications done by the vm system for drivers that
> > > want write-notification.
> > So essentially this should have any new side effect on functionality it
> > is just making a hacky/ugly code less so?
> 
> Functionality is unchanged. The use of a on-stack vma copy was severely
> frowned upon in an earlier thread, which also points to another similar
> example using vmf_insert_pfn_prot().
> 
> https://lore.kernel.org/lkml/20190905103541.4161-2-thomas...@shipmail.org/
> 
> > In other words what are the
> > consequences of having page protection inconsistent from vma's?
> 
> During the years, it looks like the caching- and encryption flags of
> vma::vm_page_prot have been largely removed from usage. From what I can
> tell, there are no more places left that can affect TTM. We discussed
> __split_huge_pmd_locked() towards the end of that thread, but that doesn't
> affect TTM even with huge page-table entries.

Please state all those details/assumptions you are operating on in the
changelog.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 2/2] drm/ttm: Fix vm page protection handling

2019-12-04 Thread Michal Hocko
On Tue 03-12-19 11:48:53, Thomas Hellström (VMware) wrote:
> From: Thomas Hellstrom 
> 
> TTM graphics buffer objects may, transparently to user-space,  move
> between IO and system memory. When that happens, all PTEs pointing to the
> old location are zapped before the move and then faulted in again if
> needed. When that happens, the page protection caching mode- and
> encryption bits may change and be different from those of
> struct vm_area_struct::vm_page_prot.
> 
> We were using an ugly hack to set the page protection correctly.
> Fix that and instead use vmf_insert_mixed_prot() and / or
> vmf_insert_pfn_prot().
> Also get the default page protection from
> struct vm_area_struct::vm_page_prot rather than using vm_get_page_prot().
> This way we catch modifications done by the vm system for drivers that
> want write-notification.

So essentially this should have any new side effect on functionality it
is just making a hacky/ugly code less so? In other words what are the
consequences of having page protection inconsistent from vma's?

> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: "Matthew Wilcox (Oracle)" 
> Cc: "Kirill A. Shutemov" 
> Cc: Ralph Campbell 
> Cc: "Jérôme Glisse" 
> Cc: "Christian König" 
> Signed-off-by: Thomas Hellstrom 
> Reviewed-by: Christian König 
> ---
>  drivers/gpu/drm/ttm/ttm_bo_vm.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> index e6495ca2630b..2098f8d4dfc5 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
> @@ -173,7 +173,6 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
>   pgoff_t num_prefault)
>  {
>   struct vm_area_struct *vma = vmf->vma;
> - struct vm_area_struct cvma = *vma;
>   struct ttm_buffer_object *bo = vma->vm_private_data;
>   struct ttm_bo_device *bdev = bo->bdev;
>   unsigned long page_offset;
> @@ -244,7 +243,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
>   goto out_io_unlock;
>   }
>  
> - cvma.vm_page_prot = ttm_io_prot(bo->mem.placement, prot);
> + prot = ttm_io_prot(bo->mem.placement, prot);
>   if (!bo->mem.bus.is_iomem) {
>   struct ttm_operation_ctx ctx = {
>   .interruptible = false,
> @@ -260,7 +259,7 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault *vmf,
>   }
>   } else {
>   /* Iomem should not be marked encrypted */
> - cvma.vm_page_prot = pgprot_decrypted(cvma.vm_page_prot);
> + prot = pgprot_decrypted(prot);
>   }
>  
>   /*
> @@ -284,10 +283,11 @@ vm_fault_t ttm_bo_vm_fault_reserved(struct vm_fault 
> *vmf,
>   }
>  
>   if (vma->vm_flags & VM_MIXEDMAP)
> - ret = vmf_insert_mixed(&cvma, address,
> - __pfn_to_pfn_t(pfn, PFN_DEV));
> + ret = vmf_insert_mixed_prot(vma, address,
> + __pfn_to_pfn_t(pfn, 
> PFN_DEV),
> + prot);
>   else
> - ret = vmf_insert_pfn(&cvma, address, pfn);
> + ret = vmf_insert_pfn_prot(vma, address, pfn, prot);
>  
>   /* Never error on prefaulted PTEs */
>   if (unlikely((ret & VM_FAULT_ERROR))) {
> @@ -319,7 +319,7 @@ vm_fault_t ttm_bo_vm_fault(struct vm_fault *vmf)
>   if (ret)
>   return ret;
>  
> - prot = vm_get_page_prot(vma->vm_flags);
> + prot = vma->vm_page_prot;
>   ret = ttm_bo_vm_fault_reserved(vmf, prot, TTM_BO_VM_NUM_PREFAULT);
>   if (ret == VM_FAULT_RETRY && !(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
>   return ret;
> -- 
> 2.21.0

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 1/2] mm: Add and export vmf_insert_mixed_prot()

2019-12-04 Thread Michal Hocko
On Tue 03-12-19 11:48:52, Thomas Hellström (VMware) wrote:
> From: Thomas Hellstrom 
> 
> The TTM module today uses a hack to be able to set a different page
> protection than struct vm_area_struct::vm_page_prot. To be able to do
> this properly, add and export vmf_insert_mixed_prot().

A new symbol should be added in a patch that adds a new user which is
not the case here.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-10 Thread Michal Hocko
On Tue 10-09-19 09:03:29, Tejun Heo wrote:
> Hello, Michal.
> 
> On Tue, Sep 10, 2019 at 01:54:48PM +0200, Michal Hocko wrote:
> > > So, while it'd great to have shrinkers in the longer term, it's not a
> > > strict requirement to be accounted in memcg.  It already accounts a
> > > lot of memory which isn't reclaimable (a lot of slabs and socket
> > > buffer).
> > 
> > Yeah, having a shrinker is preferred but the memory should better be
> > reclaimable in some form. If not by any other means then at least bound
> > to a user process context so that it goes away with a task being killed
> > by the OOM killer. If that is not the case then we cannot really charge
> > it because then the memcg controller is of no use. We can tolerate it to
> > some degree if the amount of memory charged like that is negligible to
> > the overall size. But from the discussion it seems that these buffers
> > are really large.
> 
> Yeah, oom kills should be able to reduce the usage; however, please
> note that tmpfs, among other things, can already escape this
> restriction and we can have cgroups which are over max and empty.
> It's obviously not ideal but the system doesn't melt down from it
> either.

Right, and that is a reason why an access to tmpfs should be restricted
when containing a workload by memcg. My understanding of this particular
feature is that memcg should be the primary containment method and
that's why I brought this up.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH RFC v4 00/16] new cgroup controller for gpu/drm subsystem

2019-09-10 Thread Michal Hocko
On Fri 06-09-19 08:45:39, Tejun Heo wrote:
> Hello, Daniel.
> 
> On Fri, Sep 06, 2019 at 05:34:16PM +0200, Daniel Vetter wrote:
> > > Hmm... what'd be the fundamental difference from slab or socket memory
> > > which are handled through memcg?  Is system memory used by GPUs have
> > > further global restrictions in addition to the amount of physical
> > > memory used?
> > 
> > Sometimes, but that would be specific resources (kinda like vram),
> > e.g. CMA regions used by a gpu. But probably not something you'll run
> > in a datacenter and want cgroups for ...
> > 
> > I guess we could try to integrate with the memcg group controller. One
> > trouble is that aside from i915 most gpu drivers do not really have a
> > full shrinker, so not sure how that would all integrate.
> 
> So, while it'd great to have shrinkers in the longer term, it's not a
> strict requirement to be accounted in memcg.  It already accounts a
> lot of memory which isn't reclaimable (a lot of slabs and socket
> buffer).

Yeah, having a shrinker is preferred but the memory should better be
reclaimable in some form. If not by any other means then at least bound
to a user process context so that it goes away with a task being killed
by the OOM killer. If that is not the case then we cannot really charge
it because then the memcg controller is of no use. We can tolerate it to
some degree if the amount of memory charged like that is negligible to
the overall size. But from the discussion it seems that these buffers
are really large.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 3/5] kernel.h: Add non_block_start/end()

2019-08-28 Thread Michal Hocko
On Mon 26-08-19 22:14:23, Daniel Vetter wrote:
> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
> 
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress. Quoting Michal:
> 
> "The notifier is called from quite a restricted context - oom_reaper -
> which shouldn't depend on any locks or sleepable conditionals. The code
> should be swift as well but we mostly do care about it to make a forward
> progress. Checking for sleepable context is the best thing we could come
> up with that would describe these demands at least partially."
> 
> Peter also asked whether we want to catch spinlocks on top, but Michal
> said those are less of a problem because spinlocks can't have an
> indirect dependency upon the page allocator and hence close the loop
> with the oom reaper.
> 
> Suggested by Michal Hocko.
> 
> v2:
> - Improve commit message (Michal)
> - Also check in schedule, not just might_sleep (Peter)
> 
> v3: It works better when I actually squash in the fixup I had lying
> around :-/
> 
> v4: Pick the suggestion from Andrew Morton to give non_block_start/end
> some good kerneldoc comments. I added that other blocking calls like
> wait_event pose similar issues, since that's the other example we
> discussed.
> 
> Cc: Jason Gunthorpe 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: David Rientjes 
> Cc: "Christian König" 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: Masahiro Yamada 
> Cc: Wei Wang 
> Cc: Andy Shevchenko 
> Cc: Thomas Gleixner 
> Cc: Jann Horn 
> Cc: Feng Tang 
> Cc: Kees Cook 
> Cc: Randy Dunlap 
> Cc: linux-ker...@vger.kernel.org
> Acked-by: Christian König  (v1)
> Acked-by: Peter Zijlstra (Intel) 
> Signed-off-by: Daniel Vetter 

Acked-by: Michal Hocko 

Thanks and sorry for being mostly silent/slow in discussions here.
ETOOBUSY.

> ---
>  include/linux/kernel.h | 25 -
>  include/linux/sched.h  |  4 
>  kernel/sched/core.c| 19 ++-
>  3 files changed, 42 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 4fa360a13c1e..82f84cfe372f 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -217,7 +217,9 @@ extern void __cant_sleep(const char *file, int line, int 
> preempt_offset);
>   * might_sleep - annotation for functions that can sleep
>   *
>   * this macro will print a stack trace if it is executed in an atomic
> - * context (spinlock, irq-handler, ...).
> + * context (spinlock, irq-handler, ...). Additional sections where blocking 
> is
> + * not allowed can be annotated with non_block_start() and non_block_end()
> + * pairs.
>   *
>   * This is a useful debugging help to be able to catch problems early and not
>   * be bitten later when the calling function happens to sleep when it is not
> @@ -233,6 +235,25 @@ extern void __cant_sleep(const char *file, int line, int 
> preempt_offset);
>  # define cant_sleep() \
>   do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
>  # define sched_annotate_sleep()  (current->task_state_change = 0)
> +/**
> + * non_block_start - annotate the start of section where sleeping is 
> prohibited
> + *
> + * This is on behalf of the oom reaper, specifically when it is calling the 
> mmu
> + * notifiers. The problem is that if the notifier were to block on, for 
> example,
> + * mutex_lock() and if the process which holds that mutex were to perform a
> + * sleeping memory allocation, the oom reaper is now blocked on completion of
> + * that memory allocation. Other blocking calls like wait_event() pose 
> similar
> + * issues.
> + */
> +# define non_block_start() \
> + do { current->non_block_count++; } while (0)
> +/**
> + * non_block_end - annotate the end of section where sleeping is prohibited
> + *
> + * Closes a section opened by non_block_start().
> + */
> +# define non_block_end() \
> + do { WARN_ON(current->non_block_count-- == 0); } while (0)
>  #else
>static inline void ___might_sleep(const char *file, int line,
>  int preempt_offset) { }
> @@ -241,6 +262,8 @@ extern void __cant_sleep(const char *file, int line, int 
> preempt_offset);
>  # define might_sleep() do { might_resched(); } while (0)
>  # define cant_sleep() do { } while (0)
>  # define sched_annotate_sle

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-20 Thread Michal Hocko
On Fri 16-08-19 11:31:45, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2019 at 02:26:25PM +0200, Michal Hocko wrote:
[...]
> > I believe I have given some examples when introducing __GFP_NOLOCKDEP.
> 
> Okay, I think that is 7e7844226f10 ("lockdep: allow to disable reclaim
> lockup detection") Hmm, sadly the lkml link in the commit is broken.
> 
> Hum. There are no users of __GFP_NOLOCKDEP today?? Could all the false
> positives have been fixed??

I would be more than surprised. Those false possitives were usually
papered over by using GFP_NOFS. And the original plan was to convert
those back to GFP_KERNEL like allocations and use __GFP_NOLOCKDEP.
 
> Keep in mind that this fs_reclaim was reworked away from the hacky
> interrupt context flag to a fairly elegant real lock map.

I am glad to hear that because that was just too ugly to live.

> Based on my read of the commit message, my first reaction would be to
> suggest lockdep_set_class() to solve the problem described, ie
> different classes for 'inside transaction' and 'outside transaction'
> on xfs_refcountbt_init_cursor()

No this just turned out to be unmaintainable. The number of lock classes
was growing high. I recommend on of the Dave Chinner's rant. Sorry not
link handy.

> I understood that generally that is the way to handle lockdep false
> positives.
> 
> Anyhow, if you are willing to consider that lockdep isn't broken, I
> have some ideas on how to make this clearer and increase
> coverage. Would you be willing to look at patches on this topic? (not
> soon, I have to finish my mmu notifier stuff)

I haven't claimed that the lockdep is broken. It just had problems to
capture some code paths and generated false positives. I would recommend
talking to lockdep maintainers much more than to me because I would have
to dive into the code much more to be useful. I can still comment on the
MM side of the thing of course if that is helpful.

> > > I would like to inject it into the notifier path as this is very
> > > difficult for driver authors to discover and know about, but I'm
> > > worried about your false positive remark.
> > > 
> > > I think I understand we can use only GFP_ATOMIC in the notifiers, but
> > > we need a strategy to handle OOM to guarentee forward progress.
> > 
> > Your example is from the notifier registration IIUC. 
> 
> Yes, that is where this commit hit it.. Triggering this under an
> actual notifier to get a lockdep report is hard.

All you need is to generate a memory pressure. That shouldn't be that
hard.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-16 Thread Michal Hocko
On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> > > 
> > > > > The last detail is I'm still unclear what a GFP flags a blockable
> > > > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > > > 
> > > > I hope I will not make this muddy again ;)
> > > > invalidate_range_start in the blockable mode can use/depend on any 
> > > > sleepable
> > > > allocation allowed in the context it is called from. 
> > > 
> > > 'in the context is is called from' is the magic phrase, as
> > > invalidate_range_start is called while holding several different mm
> > > related locks. I know at least write mmap_sem and i_mmap_rwsem
> > > (write?)
> > > 
> > > Can GFP_KERNEL be called while holding those locks?
> > 
> > i_mmap_rwsem would be problematic because it is taken during the
> > reclaim.
> 
> Okay.. So the fs_reclaim debugging does catch errors.

I do not think fs_reclaim is the udnerlying mechanism to catch this
deadlock. It is a simple AA deadlock. You take i_mmap_rwsem and then
go down the allocation path, direct reclaim and take the lock again.
Nothing really surprising. fs_reclaim is really to catch GFP_NOFS
context calling into a less restricted (e.g. GFP_KERNEL allocation
context).

> Do you have any
> reference for what a false positive looks like? 

I believe I have given some examples when introducing __GFP_NOLOCKDEP.
 
> I would like to inject it into the notifier path as this is very
> difficult for driver authors to discover and know about, but I'm
> worried about your false positive remark.
> 
> I think I understand we can use only GFP_ATOMIC in the notifiers, but
> we need a strategy to handle OOM to guarentee forward progress.

Your example is from the notifier registration IIUC. Can you
pre-allocate before taking locks? Could you point me to some examples
when the allocation is necessary in the range notifier callback?
-- 
Michal Hocko
SUSE Labs


Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-16 Thread Michal Hocko
On Thu 15-08-19 22:16:43, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 9:35 PM Michal Hocko  wrote:
[...]
> > > The last detail is I'm still unclear what a GFP flags a blockable
> > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> >
> > I hope I will not make this muddy again ;)
> > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > allocation allowed in the context it is called from. So in other words
> > it is no different from any other function in the kernel that calls into
> > allocator. As the API is missing gfp context then I hope it is not
> > called from any restricted contexts (except from the oom which we have
> > !blockable for).
> 
> Hm, that's new to me. I thought mmu notifiers very much can be called
> from direct reclaim paths, so you have to be extremely careful with
> getting back into that one.

Correct, I should have added that notifier callbacks ideally do not
allocate any memory. They can block and even that is quite a pain to be
honest.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-16 Thread Michal Hocko
On Thu 15-08-19 15:15:09, Andrew Morton wrote:
> On Thu, 15 Aug 2019 10:44:29 +0200 Michal Hocko  wrote:
> 
> > > I continue to struggle with this.  It introduces a new kernel state
> > > "running preemptibly but must not call schedule()".  How does this make
> > > any sense?
> > > 
> > > Perhaps a much, much more detailed description of the oom_reaper
> > > situation would help out.
> >  
> > The primary point here is that there is a demand of non blockable mmu
> > notifiers to be called when the oom reaper tears down the address space.
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work. If such a blocking state happens that we can end up
> > in a silent hang with an unusable machine.
> > Now we hope for reasonable implementations of mmu notifiers (strong
> > words I know ;) and this should be relatively simple and effective catch
> > all tool to detect something suspicious is going on.
> > 
> > Does that make the situation more clear?
> 
> Yes, thanks, much.  Maybe a code comment along the lines of
> 
>   This is on behalf of the oom reaper, specifically when it is
>   calling the mmu notifiers.  The problem is that if the notifier were
>   to block on, for example, mutex_lock() and if the process which holds
>   that mutex were to perform a sleeping memory allocation, the oom
>   reaper is now blocked on completion of that memory allocation.

reaper is now blocked on completion of that memory allocation
and cannot provide the guarantee of the OOM forward progress.

OK. 
 
> btw, do we need task_struct.non_block_count?  Perhaps the oom reaper
> thread could set a new PF_NONBLOCK (which would be more general than
> PF_OOM_REAPER).  If we run out of PF_ flags, use (current == oom_reaper_th).

Well, I do not have a strong opinion here. A simple check for the value
seems to be trivial. There are quite some holes in task_struct to hide
this counter without increasing the size.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-16 Thread Michal Hocko
On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> 
> > > The last detail is I'm still unclear what a GFP flags a blockable
> > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > 
> > I hope I will not make this muddy again ;)
> > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > allocation allowed in the context it is called from. 
> 
> 'in the context is is called from' is the magic phrase, as
> invalidate_range_start is called while holding several different mm
> related locks. I know at least write mmap_sem and i_mmap_rwsem
> (write?)
> 
> Can GFP_KERNEL be called while holding those locks?

i_mmap_rwsem would be problematic because it is taken during the
reclaim.

> This is the question of indirect dependency on reclaim via locks you
> raised earlier.
> 
> > So in other words it is no different from any other function in the
> > kernel that calls into allocator. As the API is missing gfp context
> > then I hope it is not called from any restricted contexts (except
> > from the oom which we have !blockable for).
> 
> Yes, the callers are exactly my concern.
>  
> > > Lockdep has
> > > complained on that in past due to fs_reclaim - how do you know if it
> > > is a false positive?
> > 
> > I would have to see the specific lockdep splat.
> 
> See below. I found it when trying to understand why the registration
> of the mmu notififer was so oddly coded.
> 
> The situation was:
> 
>   down_write(&mm->mmap_sem);
>   mm_take_all_locks(mm);
>   kmalloc(GFP_KERNEL);  <--- lockdep warning

Ugh. mm_take_all_locks :/

> I understood Daniel said he saw this directly on a recent kernel when
> working with his lockdep patch?
> 
> Checking myself, on todays kernel I see a call chain:
> 
> shrink_all_memory
>   fs_reclaim_acquire(sc.gfp_mask);
>   [..]
>   do_try_to_free_pages
>shrink_zones
> shrink_node
>  shrink_node_memcg
>   shrink_list
>shrink_active_list
> page_referenced
>  rmap_walk
>   rmap_walk_file
>i_mmap_lock_read
> down_read(i_mmap_rwsem)
> 
> So it is possible that the down_read() above will block on
> i_mmap_rwsem being held in the caller of invalidate_range_start which
> is doing kmalloc(GPF_KERNEL).
> 
> Is this OK? The lockdep annotation says no..

It's not as per the above code patch which is easily possible because
mm_take_all_locks will lock all file vmas.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Michal Hocko
On Thu 15-08-19 16:18:10, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote:
> 
> > This is what you claim and I am saying that fs_reclaim is about a
> > restricted reclaim context and it is an ugly hack. It has proven to
> > report false positives. Maybe it can be extended to a generic reclaim.
> > I haven't tried that. Do not aim to try it.
> 
> Okay, great, I think this has been very helpful, at least for me,
> thanks. I did not know fs_reclaim was so problematic, or the special
> cases about OOM 'reclaim'. 

I am happy that this is more clear now.

> On this patch, I have no general objection to enforcing drivers to be
> non-blocking, I'd just like to see it done with the existing lockdep
> can't sleep detection rather than inventing some new debugging for it.
> 
> I understand this means the debugging requires lockdep enabled and
> will not run in production, but I'm of the view that is OK and in line
> with general kernel practice.

Yes and I do agree with this in general.

> The last detail is I'm still unclear what a GFP flags a blockable
> invalidate_range_start() should use. Is GFP_KERNEL OK?

I hope I will not make this muddy again ;)
invalidate_range_start in the blockable mode can use/depend on any sleepable
allocation allowed in the context it is called from. So in other words
it is no different from any other function in the kernel that calls into
allocator. As the API is missing gfp context then I hope it is not
called from any restricted contexts (except from the oom which we have
!blockable for).

> Lockdep has
> complained on that in past due to fs_reclaim - how do you know if it
> is a false positive?

I would have to see the specific lockdep splat.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Michal Hocko
On Thu 15-08-19 15:24:48, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > > 
> > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > > __GFP_DIRECT_RECLAIM..
> > > > >
> > > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > > you described?
> > > > 
> > > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > > the oom_reaper actually requires. In short no blocking on direct or
> > > > indirect dependecy on memory allocation that might sleep.
> > > 
> > > It is much easier to follow with some hints on code, so the true
> > > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > > allocations, great, that constraint is now clear.
> > 
> > I still do not get why do you put FS/IO into the picture. This is really
> > about __GFP_DIRECT_RECLAIM.
> 
> Like I said this is complicated, translating "no blocking on direct or
> indirect dependecy on memory allocation that might sleep" into GFP
> flags is hard for us outside the mm community.
> 
> So the contraint here is no __GFP_DIRECT_RECLAIM?

OK, I am obviously failing to explain that. Sorry about that. You are
right that this is not simple. Let me try again.

The context we are calling !blockable notifiers from has to finish in a
_finite_ amount of time (and swift is hugely appreciated by users of
otherwise non-responsive system that is under OOM). We are out of memory
so we cannot be blocked waiting for memory. Directly or indirectly (via
a lock, waiting for an event that needs memory to finish in general). So
you need to track dependency over more complicated contexts than the
direct call path (think of workqueue for example).

> I bring up FS/IO because that is what Tejun mentioned when I asked him
> about reclaim restrictions, and is what fs_reclaim_acquire() is
> already sensitive too. It is pretty confusing if we have places using
> the word 'reclaim' with different restrictions. :(

fs_reclaim has been invented to catch potential deadlocks when a
GFP_NO{FS/IO} allocation hits into fs/io reclaim. This is a subset of
the reclaim that we have. The oom context is more restricted and it
cannot depend on _any_ memory reclaim because by the time we have got to
this context all the reclaim has already failed and wait some more will
simply not help.

> > >CPU0 CPU1
> > > mutex_lock()
> > > kmalloc(GFP_KERNEL)
> > 
> > no I mean __GFP_DIRECT_RECLAIM here.
> > 
> > > mutex_unlock()
> > >   fs_reclaim_acquire()
> > >   mutex_lock() <- illegal: lock dep assertion
> > 
> > I cannot really comment on how that is achieveable by lockdep.
> 
> ??? I am trying to explain this is already done and working today. The
> above example will already fault with lockdep enabled.
> 
> This is existing debugging we can use and improve upon rather that
> invent new debugging.

This is what you claim and I am saying that fs_reclaim is about a
restricted reclaim context and it is an ugly hack. It has proven to
report false positives. Maybe it can be extended to a generic reclaim.
I haven't tried that. Do not aim to try it.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Michal Hocko
On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> 
> > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > __GFP_DIRECT_RECLAIM..
> > >
> > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > allocations during OOM, then I think fs_reclaim already matches what
> > > you described?
> > 
> > No GFP_NOFS is equally bad. Please read my other email explaining what
> > the oom_reaper actually requires. In short no blocking on direct or
> > indirect dependecy on memory allocation that might sleep.
> 
> It is much easier to follow with some hints on code, so the true
> requirement is that the OOM repear not block on GFP_FS and GFP_IO
> allocations, great, that constraint is now clear.

I still do not get why do you put FS/IO into the picture. This is really
about __GFP_DIRECT_RECLAIM.

> 
> > If you can express that in the existing lockdep machinery. All
> > fine. But then consider deployments where lockdep is no-no because
> > of the overhead.
> 
> This is all for driver debugging. The point of lockdep is to find all
> these paths without have to hit them as actual races, using debug
> kernels.
> 
> I don't think we need this kind of debugging on production kernels?

Again, the primary motivation was a simple debugging aid that could be
used without worrying about overhead. So lockdep is very often out of
the question.

> > > The best we got was drivers tested the VA range and returned success
> > > if they had no interest. Which is a big win to be sure, but it looks
> > > like getting any more is not really posssible.
> > 
> > And that is already a great win! Because many notifiers only do care
> > about particular mappings. Please note that backing off unconditioanlly
> > will simply cause that the oom reaper will have to back off not doing
> > any tear down anything.
> 
> Well, I'm working to propose that we do the VA range test under core
> mmu notifier code that cannot block and then we simply remove the idea
> of blockable from drivers using this new 'range notifier'. 
> 
> I think this pretty much solves the concern?

Well, my idea was that a range check and early bail out was a first step
and then each specific notifier would be able to do a more specific
check. I was not able to do the second step because that requires a deep
understanding of the respective subsystem.

Really all I do care about is to reclaim as much memory from the
oom_reaper context as possible. And that cannot really be an unbounded
process. Quite contrary it should be as swift as possible. From my
cursory look some notifiers are able to achieve their task without
blocking or depending on memory just fine. So bailing out
unconditionally on the range of interest would just put us back.

> > > However, we could (probably even should) make the drivers fs_reclaim
> > > safe.
> > > 
> > > If that is enough to guarantee progress of OOM, then lets consider
> > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > > allocation behavior on the driver callback and lockdep to try and keep
> > > pushing on the the debugging, and dropping !blocking.
> > 
> > How are you going to enforce indirect dependency? E.g. a lock that is
> > also used in other context which depend on sleepable memory allocation
> > to move forward.
> 
> You mean like this:
> 
>CPU0 CPU1
> mutex_lock()
> kmalloc(GFP_KERNEL)

no I mean __GFP_DIRECT_RECLAIM here.

> mutex_unlock()
>   fs_reclaim_acquire()
>   mutex_lock() <- illegal: lock dep assertion

I cannot really comment on how that is achieveable by lockdep. I managed
to forget details about FS/IO reclaim recursion tracking already and I
do not have time to learn it again. It was quite a hack. Anyway, let me
repeat that the primary motivation was a simple aid. Not something as
poverful as lockdep.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Michal Hocko
On Thu 15-08-19 11:12:19, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 03:21:27PM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> > > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > > > > In some special cases we must not block, but there's not a
> > > > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > > > that arms the might_sleep() debug checks. Add a 
> > > > > > non_block_start/end()
> > > > > > pair to annotate these.
> > > > > > 
> > > > > > This will be used in the oom paths of mmu-notifiers, where blocking 
> > > > > > is
> > > > > > not allowed to make sure there's forward progress. Quoting Michal:
> > > > > > 
> > > > > > "The notifier is called from quite a restricted context - 
> > > > > > oom_reaper -
> > > > > > which shouldn't depend on any locks or sleepable conditionals. The 
> > > > > > code
> > > > > > should be swift as well but we mostly do care about it to make a 
> > > > > > forward
> > > > > > progress. Checking for sleepable context is the best thing we could 
> > > > > > come
> > > > > > up with that would describe these demands at least partially."
> > > > > 
> > > > > But this describes fs_reclaim_acquire() - is there some reason we are
> > > > > conflating fs_reclaim with non-sleeping?
> > > > 
> > > > No idea why you tie this into fs_reclaim. We can definitly sleep in 
> > > > there,
> > > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> > > > event supposed to I thought. To make sure we can get at the last bit of
> > > > memory by flushing all the queues and waiting for everything to be 
> > > > cleaned
> > > > out.
> > > 
> > > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
> > > the page allocator" ie a justification that was given this !blockable
> > > stuff.
> > > 
> > > For instance:
> > > 
> > >   fs_reclaim_acquire()
> > >   kmalloc(GFP_KERNEL) <- lock dep assertion
> > > 
> > > And further, Michal's concern about indirectness through locks is also
> > > handled by lockdep:
> > > 
> > >CPU0 CPU1
> > > mutex_lock()
> > > kmalloc(GFP_KERNEL)
> > > mutex_unlock()
> > >   fs_reclaim_acquire()
> > >   mutex_lock() <- lock dep assertion
> > > 
> > > In other words, to prevent recursion into the page allocator you use
> > > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.
> > 
> > fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about
> > any !GFP_NOWAIT allocation context here and any {in}direct dependency on
> > it. 
> 
> AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> __GFP_DIRECT_RECLAIM..
>
> This matches the existing test in __need_fs_reclaim() - so if you are
> OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> allocations during OOM, then I think fs_reclaim already matches what
> you described?

No GFP_NOFS is equally bad. Please read my other email explaining what
the oom_reaper actually requires. In short no blocking on direct or
indirect dependecy on memory allocation that might sleep. If you can
express that in the existing lockdep machinery. All fine. But then
consider deployments where lockdep is no-no because of the overhead.

> > No, non-blocking is a very coarse approximation of what we really need.
> > But it should give us even a stronger condition. Essentially any sleep
> > other than a preemption shouldn't be allowed in that context.
> 
> But it is a nonsense API to give the driver invalidate_range_start,
> the blocking alternative to the non-blocking invalidate_range and then
> demand it to be non-blocking.
> 
> Inspecting the code, no drivers are actually able to progress their
> side in non-blocking mode.
> 
> The best we got was drivers tested the VA range and return

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Michal Hocko
On Thu 15-08-19 10:04:15, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
> 
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work.
> 
> But lockdep *does* track all this and fs_reclaim_acquire() was created
> to solve exactly this problem.
> 
> fs_reclaim is a lock and it flows through all the usual lockdep
> schemes like any other lock. Any time the page allocator wants to do
> something the would deadlock with reclaim it takes the lock.

Our emails have crossed. Even if fs_reclaim can be re-purposed for other
than FS/IO reclaim contexts which I am not sure about I think that
lockdep is too heavy weight for the purpose of this debugging aid in the
first place. The non block mode is really something as simple as "hey
mmu notifier you are only allowed to do a lightweight processing, if you
cannot guarantee that then just back off". The annotation will give us a
warning if the notifier gets out of this promise.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Michal Hocko
On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > > In some special cases we must not block, but there's not a
> > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > > pair to annotate these.
> > > > 
> > > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > > not allowed to make sure there's forward progress. Quoting Michal:
> > > > 
> > > > "The notifier is called from quite a restricted context - oom_reaper -
> > > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > > should be swift as well but we mostly do care about it to make a forward
> > > > progress. Checking for sleepable context is the best thing we could come
> > > > up with that would describe these demands at least partially."
> > > 
> > > But this describes fs_reclaim_acquire() - is there some reason we are
> > > conflating fs_reclaim with non-sleeping?
> > 
> > No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> > event supposed to I thought. To make sure we can get at the last bit of
> > memory by flushing all the queues and waiting for everything to be cleaned
> > out.
> 
> AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
> the page allocator" ie a justification that was given this !blockable
> stuff.
> 
> For instance:
> 
>   fs_reclaim_acquire()
>   kmalloc(GFP_KERNEL) <- lock dep assertion
> 
> And further, Michal's concern about indirectness through locks is also
> handled by lockdep:
> 
>CPU0 CPU1
> mutex_lock()
> kmalloc(GFP_KERNEL)
> mutex_unlock()
>   fs_reclaim_acquire()
>   mutex_lock() <- lock dep assertion
> 
> In other words, to prevent recursion into the page allocator you use
> fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.

fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about
any !GFP_NOWAIT allocation context here and any {in}direct dependency on
it. Whether fs_reclaim_acquire can be reused for that I do not know
because I am not familiar with the lockdep machinery enough
 
> I asked Tejun about this once in regards to WQ_MEM_RECLAIM and he
> explained that it means you can't call the allocator functions in a
> way that would recurse into reclaim (ie instead use instead GFP_ATOMIC, or
> tolerate allocation failure, or various other things).
> 
> So, the reason I bring it up is half the justifications you posted for
> blockable had to do with not recursing into reclaim and deadlocking,
> and didn't seem to have much to do with blocking.
> 
> I'm asking if *non-blocking* is really the requirement or if this is
> just the usual 'do not deadlock on the allocator' thing reclaim paths
> alread have?

No, non-blocking is a very coarse approximation of what we really need.
But it should give us even a stronger condition. Essentially any sleep
other than a preemption shouldn't be allowed in that context.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Michal Hocko
On Wed 14-08-19 13:45:58, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter  
> wrote:
> 
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> > 
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> > 
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> > 
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
> 
> I continue to struggle with this.  It introduces a new kernel state
> "running preemptibly but must not call schedule()".  How does this make
> any sense?
> 
> Perhaps a much, much more detailed description of the oom_reaper
> situation would help out.
 
The primary point here is that there is a demand of non blockable mmu
notifiers to be called when the oom reaper tears down the address space.
As the oom reaper is the primary guarantee of the oom handling forward
progress it cannot be blocked on anything that might depend on blockable
memory allocations. These are not really easy to track because they
might be indirect - e.g. notifier blocks on a lock which other context
holds while allocating memory or waiting for a flusher that needs memory
to perform its work. If such a blocking state happens that we can end up
in a silent hang with an unusable machine.
Now we hope for reasonable implementations of mmu notifiers (strong
words I know ;) and this should be relatively simple and effective catch
all tool to detect something suspicious is going on.

Does that make the situation more clear?

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites

2019-08-07 Thread Michal Hocko
On Wed 07-08-19 10:37:26, Jan Kara wrote:
> On Fri 02-08-19 12:14:09, John Hubbard wrote:
> > On 8/2/19 7:52 AM, Jan Kara wrote:
> > > On Fri 02-08-19 07:24:43, Matthew Wilcox wrote:
> > > > On Fri, Aug 02, 2019 at 02:41:46PM +0200, Jan Kara wrote:
> > > > > On Fri 02-08-19 11:12:44, Michal Hocko wrote:
> > > > > > On Thu 01-08-19 19:19:31, john.hubb...@gmail.com wrote:
> > > > > > [...]
> > > > > > > 2) Convert all of the call sites for get_user_pages*(), to
> > > > > > > invoke put_user_page*(), instead of put_page(). This involves 
> > > > > > > dozens of
> > > > > > > call sites, and will take some time.
> > > > > > 
> > > > > > How do we make sure this is the case and it will remain the case in 
> > > > > > the
> > > > > > future? There must be some automagic to enforce/check that. It is 
> > > > > > simply
> > > > > > not manageable to do it every now and then because then 3) will 
> > > > > > simply
> > > > > > be never safe.
> > > > > > 
> > > > > > Have you considered coccinele or some other scripted way to do the
> > > > > > transition? I have no idea how to deal with future changes that 
> > > > > > would
> > > > > > break the balance though.
> > 
> > Hi Michal,
> > 
> > Yes, I've thought about it, and coccinelle falls a bit short (it's not smart
> > enough to know which put_page()'s to convert). However, there is a debug
> > option planned: a yet-to-be-posted commit [1] uses struct page extensions
> > (obviously protected by CONFIG_DEBUG_GET_USER_PAGES_REFERENCES) to add
> > a redundant counter. That allows:
> > 
> > void __put_page(struct page *page)
> > {
> > ...
> > /* Someone called put_page() instead of put_user_page() */
> > WARN_ON_ONCE(atomic_read(&page_ext->pin_count) > 0);
> > 
> > > > > 
> > > > > Yeah, that's why I've been suggesting at LSF/MM that we may need to 
> > > > > create
> > > > > a gup wrapper - say vaddr_pin_pages() - and track which sites dropping
> > > > > references got converted by using this wrapper instead of gup. The
> > > > > counterpart would then be more logically named as unpin_page() or 
> > > > > whatever
> > > > > instead of put_user_page().  Sure this is not completely foolproof 
> > > > > (you can
> > > > > create new callsite using vaddr_pin_pages() and then just drop refs 
> > > > > using
> > > > > put_page()) but I suppose it would be a high enough barrier for missed
> > > > > conversions... Thoughts?
> > 
> > The debug option above is still a bit simplistic in its implementation
> > (and maybe not taking full advantage of the data it has), but I think
> > it's preferable, because it monitors the "core" and WARNs.
> > 
> > Instead of the wrapper, I'm thinking: documentation and the passage of
> > time, plus the debug option (perhaps enhanced--probably once I post it
> > someone will notice opportunities), yes?
> 
> So I think your debug option and my suggested renaming serve a bit
> different purposes (and thus both make sense). If you do the renaming, you
> can just grep to see unconverted sites. Also when someone merges new GUP
> user (unaware of the new rules) while you switch GUP to use pins instead of
> ordinary references, you'll get compilation error in case of renaming
> instead of hard to debug refcount leak without the renaming. And such
> conflict is almost bound to happen given the size of GUP patch set... Also
> the renaming serves against the "coding inertia" - i.e., GUP is around for
> ages so people just use it without checking any documentation or comments.
> After switching how GUP works, what used to be correct isn't anymore so
> renaming the function serves as a warning that something has really
> changed.

Fully agreed!

> Your refcount debug patches are good to catch bugs in the conversions done
> but that requires you to be able to excercise the code path in the first
> place which may require particular HW or so, and you also have to enable
> the debug option which means you already aim at verifying the GUP
> references are treated properly.
> 
>   Honza
> 
> -- 
> Jan Kara 
> SUSE Labs, CR

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 00/34] put_user_pages(): miscellaneous call sites

2019-08-02 Thread Michal Hocko
On Thu 01-08-19 19:19:31, john.hubb...@gmail.com wrote:
[...]
> 2) Convert all of the call sites for get_user_pages*(), to
> invoke put_user_page*(), instead of put_page(). This involves dozens of
> call sites, and will take some time.

How do we make sure this is the case and it will remain the case in the
future? There must be some automagic to enforce/check that. It is simply
not manageable to do it every now and then because then 3) will simply
be never safe.

Have you considered coccinele or some other scripted way to do the
transition? I have no idea how to deal with future changes that would
break the balance though.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm/hmm: replace hmm_update with mmu_notifier_range

2019-07-24 Thread Michal Hocko
On Wed 24-07-19 20:56:17, Michal Hocko wrote:
> On Wed 24-07-19 15:08:37, Jason Gunthorpe wrote:
> > On Wed, Jul 24, 2019 at 07:58:58PM +0200, Michal Hocko wrote:
> [...]
> > > Maybe new users have started relying on a new semantic in the meantime,
> > > back then, none of the notifier has even started any action in blocking
> > > mode on a EAGAIN bailout. Most of them simply did trylock early in the
> > > process and bailed out so there was nothing to do for the range_end
> > > callback.
> > 
> > Single notifiers are not the problem. I tried to make this clear in
> > the commit message, but lets be more explicit.
> > 
> > We have *two* notifiers registered to the mm, A and B:
> > 
> > A invalidate_range_start: (has no blocking)
> > spin_lock()
> > counter++
> > spin_unlock()
> > 
> > A invalidate_range_end:
> > spin_lock()
> > counter--
> > spin_unlock()
> > 
> > And this one:
> > 
> > B invalidate_range_start: (has blocking)
> > if (!try_mutex_lock())
> > return -EAGAIN;
> > counter++
> > mutex_unlock()
> > 
> > B invalidate_range_end:
> > spin_lock()
> > counter--
> > spin_unlock()
> > 
> > So now the oom path does:
> > 
> > invalidate_range_start_non_blocking:
> >  for each mn:
> >a->invalidate_range_start
> >b->invalidate_range_start
> >rc = EAGAIN
> > 
> > Now we SKIP A's invalidate_range_end even though A had no idea this
> > would happen has state that needs to be unwound. A is broken.
> > 
> > B survived just fine.
> > 
> > A and B *alone* work fine, combined they fail.
> 
> But that requires that they share some state, right?
> 
> > When the commit was landed you can use KVM as an example of A and RDMA
> > ODP as an example of B
> 
> Could you point me where those two share the state please? KVM seems to
> be using kvm->mmu_notifier_count but I do not know where to look for the
> RDMA...

Scratch that. ELONGDAY... I can see your point. It is all or nothing
that doesn't really work here. Looking back at your patch it seems
reasonable but I am not sure what is supposed to be a behavior for
notifiers that failed.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm/hmm: replace hmm_update with mmu_notifier_range

2019-07-24 Thread Michal Hocko
On Wed 24-07-19 15:08:37, Jason Gunthorpe wrote:
> On Wed, Jul 24, 2019 at 07:58:58PM +0200, Michal Hocko wrote:
[...]
> > Maybe new users have started relying on a new semantic in the meantime,
> > back then, none of the notifier has even started any action in blocking
> > mode on a EAGAIN bailout. Most of them simply did trylock early in the
> > process and bailed out so there was nothing to do for the range_end
> > callback.
> 
> Single notifiers are not the problem. I tried to make this clear in
> the commit message, but lets be more explicit.
> 
> We have *two* notifiers registered to the mm, A and B:
> 
> A invalidate_range_start: (has no blocking)
> spin_lock()
> counter++
> spin_unlock()
> 
> A invalidate_range_end:
> spin_lock()
> counter--
> spin_unlock()
> 
> And this one:
> 
> B invalidate_range_start: (has blocking)
> if (!try_mutex_lock())
> return -EAGAIN;
> counter++
> mutex_unlock()
> 
> B invalidate_range_end:
> spin_lock()
> counter--
> spin_unlock()
> 
> So now the oom path does:
> 
> invalidate_range_start_non_blocking:
>  for each mn:
>a->invalidate_range_start
>b->invalidate_range_start
>rc = EAGAIN
> 
> Now we SKIP A's invalidate_range_end even though A had no idea this
> would happen has state that needs to be unwound. A is broken.
> 
> B survived just fine.
> 
> A and B *alone* work fine, combined they fail.

But that requires that they share some state, right?

> When the commit was landed you can use KVM as an example of A and RDMA
> ODP as an example of B

Could you point me where those two share the state please? KVM seems to
be using kvm->mmu_notifier_count but I do not know where to look for the
RDMA...
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm/hmm: replace hmm_update with mmu_notifier_range

2019-07-24 Thread Michal Hocko
On Wed 24-07-19 17:33:05, Christoph Hellwig wrote:
> On Wed, Jul 24, 2019 at 12:28:58PM -0300, Jason Gunthorpe wrote:
> > Humm. Actually having looked this some more, I wonder if this is a
> > problem:
> 
> What a mess.
> 
> Question: do we even care for the non-blocking events?  The only place
> where mmu_notifier_invalidate_range_start_nonblock is called is the oom
> killer, which means the process is about to die and the pagetable will
> get torn down ASAP.  Should we just skip them unconditionally?

How do you tell whether they are going to block or not without trying?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm/hmm: replace hmm_update with mmu_notifier_range

2019-07-24 Thread Michal Hocko
On Wed 24-07-19 12:28:58, Jason Gunthorpe wrote:
> On Wed, Jul 24, 2019 at 09:05:53AM +0200, Christoph Hellwig wrote:
> > Looks good:
> > 
> > Reviewed-by: Christoph Hellwig 
> > 
> > One comment on a related cleanup:
> > 
> > >   list_for_each_entry(mirror, &hmm->mirrors, list) {
> > >   int rc;
> > >  
> > > - rc = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
> > > + rc = mirror->ops->sync_cpu_device_pagetables(mirror, nrange);
> > >   if (rc) {
> > > - if (WARN_ON(update.blockable || rc != -EAGAIN))
> > > + if (WARN_ON(mmu_notifier_range_blockable(nrange) ||
> > > + rc != -EAGAIN))
> > >   continue;
> > >   ret = -EAGAIN;
> > >   break;
> > 
> > This magic handling of error seems odd.  I think we should merge rc and
> > ret into one variable and just break out if any error happens instead
> > or claiming in the comments -EAGAIN is the only valid error and then
> > ignoring all others here.
> 
> The WARN_ON is enforcing the rules already commented near
> mmuu_notifier_ops.invalidate_start - we could break or continue, it
> doesn't much matter how to recover from a broken driver, but since we
> did the WARN_ON this should sanitize the ret to EAGAIN or 0
> 
> Humm. Actually having looked this some more, I wonder if this is a
> problem:
> 
> I see in __oom_reap_task_mm():
> 
>   if 
> (mmu_notifier_invalidate_range_start_nonblock(&range)) {
>   tlb_finish_mmu(&tlb, range.start, range.end);
>   ret = false;
>   continue;
>   }
>   unmap_page_range(&tlb, vma, range.start, range.end, 
> NULL);
>   mmu_notifier_invalidate_range_end(&range);
> 
> Which looks like it creates an unbalanced start/end pairing if any
> start returns EAGAIN?
> 
> This does not seem OK.. Many users require start/end to be paired to
> keep track of their internal locking. Ie for instance hmm breaks
> because the hmm->notifiers counter becomes unable to get to 0.
> 
> Below is the best idea I've had so far..
> 
> Michal, what do you think?

IIRC we have discussed this with Jerome back then when I've introduced
this code and unless I misremember he said the current code was OK.
Maybe new users have started relying on a new semantic in the meantime,
back then, none of the notifier has even started any action in blocking
mode on a EAGAIN bailout. Most of them simply did trylock early in the
process and bailed out so there was nothing to do for the range_end
callback.

Has this changed?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 05/22] mm: export alloc_pages_vma

2019-06-26 Thread Michal Hocko
On Wed 26-06-19 09:14:32, Dan Williams wrote:
> On Tue, Jun 25, 2019 at 10:46 PM Michal Hocko  wrote:
> >
> > On Tue 25-06-19 12:52:18, Dan Williams wrote:
> > [...]
> > > > Documentation/process/stable-api-nonsense.rst
> > >
> > > That document has failed to preclude symbol export fights in the past
> > > and there is a reasonable argument to try not to retract functionality
> > > that had been previously exported regardless of that document.
> >
> > Can you point me to any specific example where this would be the case
> > for the core kernel symbols please?
> 
> The most recent example that comes to mind was the thrash around
> __kernel_fpu_{begin,end} [1].

Well, this seems more like a disagreement over a functionality that has
reduced its visibility rather than enforcement of a specific API. And I
do agree that the above document states that this is perfectly
legitimate and no out-of-tree code can rely on _any_ functionality to be
preserved.

On the other hand, I am not really surprised about the discussion
because d63e79b114c02 is a mere clean up not explaining why the
functionality should be restricted to GPL only code. So there certainly
is a room for clarification. Especially when the code has been exported
without this restriction in the past (see 8546c008924d5). So to me this
sounds more like a usual EXPORT_SYMBOL{_GPL} mess.

In any case I really do not see any relation to the maintenance cost
here. GPL symbols are not in any sense more stable than any other
exported symbol. They can change at any time. The only maintenance
burden is to update all _in_kernel_ users of the said symbol. Any
out-of-tree code is on its own to deal with this. Full stop.

GPL or non-GPL symbols are solely to define a scope of the usage.
Nothing less and nothing more.

> I referenced that when debating _GPL symbol policy with Jérôme [2].
> 
> [1]: https://lore.kernel.org/lkml/20190522100959.ga15...@kroah.com/
> [2]: 
> https://lore.kernel.org/lkml/CAPcyv4gb+r==rikfxkvz7ggdnke62ybmz7xoa4ubbbyhnk9...@mail.gmail.com/

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 06/25] mm: export alloc_pages_vma

2019-06-26 Thread Michal Hocko
On Wed 26-06-19 14:27:05, Christoph Hellwig wrote:
> nouveau is currently using this through an odd hmm wrapper, and I plan
> to switch it to the real thing later in this series.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: John Hubbard 

Acked-by: Michal Hocko 

Thanks!

> ---
>  mm/mempolicy.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 01600d80ae01..f48569aa1863 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2098,6 +2098,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> vm_area_struct *vma,
>  out:
>   return page;
>  }
> +EXPORT_SYMBOL(alloc_pages_vma);
>  
>  /**
>   *   alloc_pages_current - Allocate pages.
> -- 
> 2.20.1

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 05/22] mm: export alloc_pages_vma

2019-06-25 Thread Michal Hocko
On Tue 25-06-19 12:52:18, Dan Williams wrote:
[...]
> > Documentation/process/stable-api-nonsense.rst
> 
> That document has failed to preclude symbol export fights in the past
> and there is a reasonable argument to try not to retract functionality
> that had been previously exported regardless of that document.

Can you point me to any specific example where this would be the case
for the core kernel symbols please?
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken

2019-06-25 Thread Michal Hocko
On Tue 25-06-19 20:15:28, John Hubbard wrote:
> On 6/19/19 12:27 PM, Jason Gunthorpe wrote:
> > On Thu, Jun 13, 2019 at 06:23:04PM -0700, John Hubbard wrote:
> >> On 6/13/19 5:43 PM, Ira Weiny wrote:
> >>> On Thu, Jun 13, 2019 at 07:58:29PM +, Jason Gunthorpe wrote:
> >>>> On Thu, Jun 13, 2019 at 12:53:02PM -0700, Ralph Campbell wrote:
> >>>>>
> >> ...
> >>> So I think it is ok.  Frankly I was wondering if we should remove the 
> >>> public
> >>> type altogether but conceptually it seems ok.  But I don't see any users 
> >>> of it
> >>> so...  should we get rid of it in the code rather than turning the config 
> >>> off?
> >>>
> >>> Ira
> >>
> >> That seems reasonable. I recall that the hope was for those IBM Power 9
> >> systems to use _PUBLIC, as they have hardware-based coherent device (GPU)
> >> memory, and so the memory really is visible to the CPU. And the IBM team
> >> was thinking of taking advantage of it. But I haven't seen anything on
> >> that front for a while.
> > 
> > Does anyone know who those people are and can we encourage them to
> > send some patches? :)
> > 
> 
> I asked about this, and it seems that the idea was: DEVICE_PUBLIC was there
> in order to provide an alternative way to do things (such as migrate memory
> to and from a device), in case the combination of existing and near-future
> NUMA APIs was insufficient. This probably came as a follow-up to the early
> 2017-ish conversations about NUMA, in which the linux-mm recommendation was
> "try using HMM mechanisms, and if those are inadequate, then maybe we can
> look at enhancing NUMA so that it has better handling of advanced (GPU-like)
> devices".

Yes that was the original idea. It sounds so much better to use a common
framework rather than awkward special cased cpuless NUMA nodes with
a weird semantic. User of the neither of the two has shown up so I guess
that the envisioned HW just didn't materialized. Or has there been a
completely different approach chosen?

> In the end, however, _PUBLIC was never used, nor does anyone in the local
> (NVIDIA + IBM) kernel vicinity seem to have plans to use it.  So it really
> does seem safe to remove, although of course it's good to start with 
> BROKEN and see if anyone pops up and complains.

Well, I do not really see much of a difference. Preserving an unused
code which doesn't have any user in sight just adds a maintenance burden
whether the code depends on BROKEN or not. We can always revert patches
which remove the code once a real user shows up.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 05/22] mm: export alloc_pages_vma

2019-06-25 Thread Michal Hocko
On Tue 25-06-19 11:03:53, Dan Williams wrote:
> On Tue, Jun 25, 2019 at 8:01 AM Michal Hocko  wrote:
> >
> > On Tue 25-06-19 09:23:17, Christoph Hellwig wrote:
> > > On Mon, Jun 24, 2019 at 11:24:48AM -0700, Dan Williams wrote:
> > > > I asked for this simply because it was not exported historically. In
> > > > general I want to establish explicit export-type criteria so the
> > > > community can spend less time debating when to use EXPORT_SYMBOL_GPL
> > > > [1].
> > > >
> > > > The thought in this instance is that it is not historically exported
> > > > to modules and it is safer from a maintenance perspective to start
> > > > with GPL-only for new symbols in case we don't want to maintain that
> > > > interface long-term for out-of-tree modules.
> > > >
> > > > Yes, we always reserve the right to remove / change interfaces
> > > > regardless of the export type, but history has shown that external
> > > > pressure to keep an interface stable (contrary to
> > > > Documentation/process/stable-api-nonsense.rst) tends to be less for
> > > > GPL-only exports.
> > >
> > > Fully agreed.  In the end the decision is with the MM maintainers,
> > > though, although I'd prefer to keep it as in this series.
> >
> > I am sorry but I am not really convinced by the above reasoning wrt. to
> > the allocator API and it has been a subject of many changes over time. I
> > do not remember a single case where we would be bending the allocator
> > API because of external modules and I am pretty sure we will push back
> > heavily if that was the case in the future.
> 
> This seems to say that you have no direct experience of dealing with
> changing symbols that that a prominent out-of-tree module needs? GPU
> drivers and the core-mm are on a path to increase their cooperation on
> memory management mechanisms over time, and symbol export changes for
> out-of-tree GPU drivers have been a significant source of friction in
> the past.

I have an experience e.g. to rework semantic of some gfp flags and that is
something that users usualy get wrong and never heard that an out of
tree code would insist on an old semantic and pushing us to the corner.

> > So in this particular case I would go with consistency and export the
> > same way we do with other functions. Also we do not want people to
> > reinvent this API and screw that like we have seen in other cases when
> > external modules try reimplement core functionality themselves.
> 
> Consistency is a weak argument when the cost to the upstream community
> is negligible. If the same functionality was available via another /
> already exported interface *that* would be an argument to maintain the
> existing export policy. "Consistency" in and of itself is not a
> precedent we can use more widely in default export-type decisions.
> 
> Effectively I'm arguing EXPORT_SYMBOL_GPL by default with a later
> decision to drop the _GPL. Similar to how we are careful to mark sysfs
> interfaces in Documentation/ABI/ that we are not fully committed to
> maintaining over time, or are otherwise so new that there is not yet a
> good read on whether they can be made permanent.

Documentation/process/stable-api-nonsense.rst
Really. If you want to play with GPL vs. EXPORT_SYMBOL else this is up
to you but I do not see any technical argument to make this particular
interface to the page allocator any different from all others that are
exported to modules.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 05/22] mm: export alloc_pages_vma

2019-06-25 Thread Michal Hocko
On Tue 25-06-19 09:23:17, Christoph Hellwig wrote:
> On Mon, Jun 24, 2019 at 11:24:48AM -0700, Dan Williams wrote:
> > I asked for this simply because it was not exported historically. In
> > general I want to establish explicit export-type criteria so the
> > community can spend less time debating when to use EXPORT_SYMBOL_GPL
> > [1].
> > 
> > The thought in this instance is that it is not historically exported
> > to modules and it is safer from a maintenance perspective to start
> > with GPL-only for new symbols in case we don't want to maintain that
> > interface long-term for out-of-tree modules.
> > 
> > Yes, we always reserve the right to remove / change interfaces
> > regardless of the export type, but history has shown that external
> > pressure to keep an interface stable (contrary to
> > Documentation/process/stable-api-nonsense.rst) tends to be less for
> > GPL-only exports.
> 
> Fully agreed.  In the end the decision is with the MM maintainers,
> though, although I'd prefer to keep it as in this series.

I am sorry but I am not really convinced by the above reasoning wrt. to
the allocator API and it has been a subject of many changes over time. I
do not remember a single case where we would be bending the allocator
API because of external modules and I am pretty sure we will push back
heavily if that was the case in the future.

So in this particular case I would go with consistency and export the
same way we do with other functions. Also we do not want people to
reinvent this API and screw that like we have seen in other cases when
external modules try reimplement core functionality themselves.

Thanks!
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/22] mm: don't clear ->mapping in hmm_devmem_free

2019-06-20 Thread Michal Hocko
On Thu 13-06-19 11:43:07, Christoph Hellwig wrote:
> ->mapping isn't even used by HMM users, and the field at the same offset
> in the zone_device part of the union is declared as pad.  (Which btw is
> rather confusing, as DAX uses ->pgmap and ->mapping from two different
> sides of the union, but DAX doesn't use hmm_devmem_free).
> 
> Signed-off-by: Christoph Hellwig 

I cannot really judge here but setting mapping here without any comment
is quite confusing. So if this is safe to remove then it is certainly an
improvement.

> ---
>  mm/hmm.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 0c62426d1257..e1dc98407e7b 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -1347,8 +1347,6 @@ static void hmm_devmem_free(struct page *page, void 
> *data)
>  {
>   struct hmm_devmem *devmem = data;
>  
> - page->mapping = NULL;
> -
>   devmem->ops->free(devmem, page);
>  }
>  
> -- 
> 2.20.1

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 03/22] mm: remove hmm_devmem_add_resource

2019-06-20 Thread Michal Hocko
On Thu 13-06-19 11:43:06, Christoph Hellwig wrote:
> This function has never been used since it was first added to the kernel
> more than a year and a half ago, and if we ever grow a consumer of the
> MEMORY_DEVICE_PUBLIC infrastructure it can easily use devm_memremap_pages
> directly now that we've simplified the API for it.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Michal Hocko 

> ---
>  include/linux/hmm.h |  3 ---
>  mm/hmm.c| 54 -
>  2 files changed, 57 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 4867b9da1b6c..5761a39221a6 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -688,9 +688,6 @@ struct hmm_devmem {
>  struct hmm_devmem *hmm_devmem_add(const struct hmm_devmem_ops *ops,
> struct device *device,
> unsigned long size);
> -struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> -struct device *device,
> -struct resource *res);
>  
>  /*
>   * hmm_devmem_page_set_drvdata - set per-page driver data field
> diff --git a/mm/hmm.c b/mm/hmm.c
> index ff2598eb7377..0c62426d1257 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -1445,58 +1445,4 @@ struct hmm_devmem *hmm_devmem_add(const struct 
> hmm_devmem_ops *ops,
>   return devmem;
>  }
>  EXPORT_SYMBOL_GPL(hmm_devmem_add);
> -
> -struct hmm_devmem *hmm_devmem_add_resource(const struct hmm_devmem_ops *ops,
> -struct device *device,
> -struct resource *res)
> -{
> - struct hmm_devmem *devmem;
> - void *result;
> - int ret;
> -
> - if (res->desc != IORES_DESC_DEVICE_PUBLIC_MEMORY)
> - return ERR_PTR(-EINVAL);
> -
> - dev_pagemap_get_ops();
> -
> - devmem = devm_kzalloc(device, sizeof(*devmem), GFP_KERNEL);
> - if (!devmem)
> - return ERR_PTR(-ENOMEM);
> -
> - init_completion(&devmem->completion);
> - devmem->pfn_first = -1UL;
> - devmem->pfn_last = -1UL;
> - devmem->resource = res;
> - devmem->device = device;
> - devmem->ops = ops;
> -
> - ret = percpu_ref_init(&devmem->ref, &hmm_devmem_ref_release,
> -   0, GFP_KERNEL);
> - if (ret)
> - return ERR_PTR(ret);
> -
> - ret = devm_add_action_or_reset(device, hmm_devmem_ref_exit,
> - &devmem->ref);
> - if (ret)
> - return ERR_PTR(ret);
> -
> - devmem->pfn_first = devmem->resource->start >> PAGE_SHIFT;
> - devmem->pfn_last = devmem->pfn_first +
> -(resource_size(devmem->resource) >> PAGE_SHIFT);
> - devmem->page_fault = hmm_devmem_fault;
> -
> - devmem->pagemap.type = MEMORY_DEVICE_PUBLIC;
> - devmem->pagemap.res = *devmem->resource;
> - devmem->pagemap.page_free = hmm_devmem_free;
> - devmem->pagemap.altmap_valid = false;
> - devmem->pagemap.ref = &devmem->ref;
> - devmem->pagemap.data = devmem;
> - devmem->pagemap.kill = hmm_devmem_ref_kill;
> -
> - result = devm_memremap_pages(devmem->device, &devmem->pagemap);
> - if (IS_ERR(result))
> - return result;
> - return devmem;
> -}
> -EXPORT_SYMBOL_GPL(hmm_devmem_add_resource);
>  #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
> -- 
> 2.20.1

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 18/22] mm: mark DEVICE_PUBLIC as broken

2019-06-20 Thread Michal Hocko
On Thu 13-06-19 11:43:21, Christoph Hellwig wrote:
> The code hasn't been used since it was added to the tree, and doesn't
> appear to actually be usable.  Mark it as BROKEN until either a user
> comes along or we finally give up on it.

I would go even further and simply remove all the DEVICE_PUBLIC code.

> Signed-off-by: Christoph Hellwig 

Anyway
Acked-by: Michal Hocko 

> ---
>  mm/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0d2ba7e1f43e..406fa45e9ecc 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -721,6 +721,7 @@ config DEVICE_PRIVATE
>  config DEVICE_PUBLIC
>   bool "Addressable device memory (like GPU memory)"
>   depends on ARCH_HAS_HMM
> + depends on BROKEN
>   select HMM
>   select DEV_PAGEMAP_OPS
>  
> -- 
> 2.20.1

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 05/22] mm: export alloc_pages_vma

2019-06-20 Thread Michal Hocko
On Thu 13-06-19 11:43:08, Christoph Hellwig wrote:
> noveau is currently using this through an odd hmm wrapper, and I plan
> to switch it to the real thing later in this series.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/mempolicy.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 01600d80ae01..f9023b5fba37 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2098,6 +2098,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct 
> vm_area_struct *vma,
>  out:
>   return page;
>  }
> +EXPORT_SYMBOL_GPL(alloc_pages_vma);

All allocator exported symbols are EXPORT_SYMBOL, what is a reason to
have this one special?

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] kernel.h: Add non_block_start/end()

2019-05-21 Thread Michal Hocko
On Tue 21-05-19 14:43:38, Cristopher Lameter wrote:
> On Tue, 21 May 2019, Daniel Vetter wrote:
> 
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> 
> Just putting preempt on/off around these is not sufficient?

It is not a critical section. It is a _debugging_ facility to help
discover blocking contexts.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] kernel.h: Add non_block_start/end()

2019-05-21 Thread Michal Hocko
On Tue 21-05-19 12:06:11, Daniel Vetter wrote:
> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
> 
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress. Quoting Michal:
> 
> "The notifier is called from quite a restricted context - oom_reaper -
> which shouldn't depend on any locks or sleepable conditionals. The code
> should be swift as well but we mostly do care about it to make a forward
> progress. Checking for sleepable context is the best thing we could come
> up with that would describe these demands at least partially."
> 
> Peter also asked whether we want to catch spinlocks on top, but Michal
> said those are less of a problem because spinlocks can't have an
> indirect dependency upon the page allocator and hence close the loop
> with the oom reaper.
> 
> Suggested by Michal Hocko.
> 
> v2:
> - Improve commit message (Michal)
> - Also check in schedule, not just might_sleep (Peter)
> 
> v3: It works better when I actually squash in the fixup I had lying
> around :-/
> 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: David Rientjes 
> Cc: "Christian König" 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: Masahiro Yamada 
> Cc: Wei Wang 
> Cc: Andy Shevchenko 
> Cc: Thomas Gleixner 
> Cc: Jann Horn 
> Cc: Feng Tang 
> Cc: Kees Cook 
> Cc: Randy Dunlap 
> Cc: linux-ker...@vger.kernel.org
> Acked-by: Christian König  (v1)
> Signed-off-by: Daniel Vetter 

I like this in general. The implementation looks reasonable to me but I
didn't check deeply enough to give my R-by or A-by.

> ---
>  include/linux/kernel.h | 10 +-
>  include/linux/sched.h  |  4 
>  kernel/sched/core.c| 19 ++-
>  3 files changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 74b1ee9027f5..b5f2c2ff0eab 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -214,7 +214,9 @@ extern void __cant_sleep(const char *file, int line, int 
> preempt_offset);
>   * might_sleep - annotation for functions that can sleep
>   *
>   * this macro will print a stack trace if it is executed in an atomic
> - * context (spinlock, irq-handler, ...).
> + * context (spinlock, irq-handler, ...). Additional sections where blocking 
> is
> + * not allowed can be annotated with non_block_start() and non_block_end()
> + * pairs.
>   *
>   * This is a useful debugging help to be able to catch problems early and not
>   * be bitten later when the calling function happens to sleep when it is not
> @@ -230,6 +232,10 @@ extern void __cant_sleep(const char *file, int line, int 
> preempt_offset);
>  # define cant_sleep() \
>   do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
>  # define sched_annotate_sleep()  (current->task_state_change = 0)
> +# define non_block_start() \
> + do { current->non_block_count++; } while (0)
> +# define non_block_end() \
> + do { WARN_ON(current->non_block_count-- == 0); } while (0)
>  #else
>static inline void ___might_sleep(const char *file, int line,
>  int preempt_offset) { }
> @@ -238,6 +244,8 @@ extern void __cant_sleep(const char *file, int line, int 
> preempt_offset);
>  # define might_sleep() do { might_resched(); } while (0)
>  # define cant_sleep() do { } while (0)
>  # define sched_annotate_sleep() do { } while (0)
> +# define non_block_start() do { } while (0)
> +# define non_block_end() do { } while (0)
>  #endif
>  
>  #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 11837410690f..7f5b293e72df 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -908,6 +908,10 @@ struct task_struct {
>   struct mutex_waiter *blocked_on;
>  #endif
>  
> +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> + int non_block_count;
> +#endif
> +
>  #ifdef CONFIG_TRACE_IRQFLAGS
>   unsigned intirq_events;
>   unsigned long   hardirq_enable_ip;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 102dfcf0a29a..ed7755a28465 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3264,13 +3264,22 @@ static noinline void __schedule_bug(struct 
> task_struct *p

Re: [PATCH] mm: Don't let userspace spam allocations warnings

2019-02-20 Thread Michal Hocko
On Wed 20-02-19 21:40:58, Daniel Vetter wrote:
> memdump_user usually gets fed unchecked userspace input. Blasting a
> full backtrace into dmesg every time is a bit excessive - I'm not sure
> on the kernel rule in general, but at least in drm we're trying not to
> let unpriviledge userspace spam the logs freely. Definitely not entire
> warning backtraces.

Yes, this makes sense to me. This API sounds like an example where
returning ENOMEM to the userspace right away is much better than
spamming the log for large allocation requests. Smaller allocations
simply do not fail and the OOM killer report will be printed regardless
of __GFP_NOWARN.

> It also means more filtering for our CI, because our testsuite
> exercises these corner cases and so hits these a lot.
> 
> Signed-off-by: Daniel Vetter 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Roman Gushchin 
> Cc: Vlastimil Babka 
> Cc: Jan Stancek 
> Cc: Kees Cook 
> Cc: Andrey Ryabinin 
> Cc: "Michael S. Tsirkin" 
> Cc: Huang Ying 
> Cc: Bartosz Golaszewski 
> Cc: linux...@kvack.org

Acked-by: Michal Hocko 

> ---
>  mm/util.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 1ea055138043..379319b1bcfd 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -150,7 +150,7 @@ void *memdup_user(const void __user *src, size_t len)
>  {
>   void *p;
>  
> - p = kmalloc_track_caller(len, GFP_USER);
> + p = kmalloc_track_caller(len, GFP_USER | __GFP_NOWARN);
>   if (!p)
>   return ERR_PTR(-ENOMEM);
>  
> -- 
> 2.20.1
> 

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/4] kernel.h: Add non_block_start/end()

2018-12-10 Thread Michal Hocko
On Mon 10-12-18 16:22:53, Peter Zijlstra wrote:
> On Mon, Dec 10, 2018 at 04:01:59PM +0100, Michal Hocko wrote:
> > On Mon 10-12-18 15:47:11, Peter Zijlstra wrote:
> > > On Mon, Dec 10, 2018 at 03:13:37PM +0100, Michal Hocko wrote:
> > > > I do not see any scheduler guys Cced and it would be really great to get
> > > > their opinion here.
> > > > 
> > > > On Mon 10-12-18 11:36:39, Daniel Vetter wrote:
> > > > > In some special cases we must not block, but there's not a
> > > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > > > pair to annotate these.
> > > > > 
> > > > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > > > not allowed to make sure there's forward progress.
> > > > 
> > > > Considering the only alternative would be to abuse
> > > > preempt_{disable,enable}, and that really has a different semantic, I
> > > > think this makes some sense. The cotext is preemptible but we do not
> > > > want notifier to sleep on any locks, WQ etc.
> > > 
> > > I'm confused... what is this supposed to do?
> > > 
> > > And what does 'block' mean here? Without preempt_disable/IRQ-off we're
> > > subject to regular preemption and execution can stall for arbitrary
> > > amounts of time.
> > 
> > The notifier is called from quite a restricted context - oom_reaper - 
> > which shouldn't depend on any locks or sleepable conditionals. 
> 
> You want to exclude spinlocks too? We could maybe frob something with
> lockdep if you need that?

Spinlocks are less of a problem because you cannot have a (in)direct
dependency on the page allocator that would deadlock. Spinlocks, or
preemption disabled in general should be short enough to guarantee a
forward progress.

> > The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially.
> 
> OK, no real objections to the thing.  Just so long we're all on the same
> page as to what it does and doesn't do ;-)

I am not really sure whether there are other potential users besides
this one and whether the check as such is justified.

> I suppose you could extend the check to include schedule_debug() as
> well, maybe something like:

Do you mean to make the check cheaper?

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f66920173370..b1aaa278f1af 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3278,13 +3278,18 @@ static noinline void __schedule_bug(struct 
> task_struct *prev)
>  /*
>   * Various schedule()-time debugging checks and statistics:
>   */
> -static inline void schedule_debug(struct task_struct *prev)
> +static inline void schedule_debug(struct task_struct *prev, bool preempt)
>  {
>  #ifdef CONFIG_SCHED_STACK_END_CHECK
>   if (task_stack_end_corrupted(prev))
>   panic("corrupted stack end detected inside scheduler\n");
>  #endif
>  
> +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> + if (!preempt && prev->state && prev->non_block_count)
> + // splat
> +#endif
> +
>   if (unlikely(in_atomic_preempt_off())) {
>   __schedule_bug(prev);
>   preempt_count_set(PREEMPT_DISABLED);
> @@ -3391,7 +3396,7 @@ static void __sched notrace __schedule(bool preempt)
>   rq = cpu_rq(cpu);
>   prev = rq->curr;
>  
> - schedule_debug(prev);
> + schedule_debug(prev, preempt);
>  
>   if (sched_feat(HRTICK))
>   hrtick_clear(rq);

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/4] kernel.h: Add non_block_start/end()

2018-12-10 Thread Michal Hocko
On Mon 10-12-18 15:47:11, Peter Zijlstra wrote:
> On Mon, Dec 10, 2018 at 03:13:37PM +0100, Michal Hocko wrote:
> > I do not see any scheduler guys Cced and it would be really great to get
> > their opinion here.
> > 
> > On Mon 10-12-18 11:36:39, Daniel Vetter wrote:
> > > In some special cases we must not block, but there's not a
> > > spinlock, preempt-off, irqs-off or similar critical section already
> > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > pair to annotate these.
> > > 
> > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > not allowed to make sure there's forward progress.
> > 
> > Considering the only alternative would be to abuse
> > preempt_{disable,enable}, and that really has a different semantic, I
> > think this makes some sense. The cotext is preemptible but we do not
> > want notifier to sleep on any locks, WQ etc.
> 
> I'm confused... what is this supposed to do?
> 
> And what does 'block' mean here? Without preempt_disable/IRQ-off we're
> subject to regular preemption and execution can stall for arbitrary
> amounts of time.

The notifier is called from quite a restricted context - oom_reaper - 
which shouldn't depend on any locks or sleepable conditionals. The code
should be swift as well but we mostly do care about it to make a forward
progress. Checking for sleepable context is the best thing we could come
up with that would describe these demands at least partially.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/4] kernel.h: Add non_block_start/end()

2018-12-10 Thread Michal Hocko
I do not see any scheduler guys Cced and it would be really great to get
their opinion here.

On Mon 10-12-18 11:36:39, Daniel Vetter wrote:
> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
> 
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress.

Considering the only alternative would be to abuse
preempt_{disable,enable}, and that really has a different semantic, I
think this makes some sense. The cotext is preemptible but we do not
want notifier to sleep on any locks, WQ etc.

> Suggested by Michal Hocko.
> 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: David Rientjes 
> Cc: "Christian König" 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Signed-off-by: Daniel Vetter 
> ---
>  include/linux/kernel.h | 10 +-
>  include/linux/sched.h  |  4 
>  kernel/sched/core.c|  6 +++---
>  3 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index d6aac75b51ba..c2cf31515b3d 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -251,7 +251,9 @@ extern int _cond_resched(void);
>   * might_sleep - annotation for functions that can sleep
>   *
>   * this macro will print a stack trace if it is executed in an atomic
> - * context (spinlock, irq-handler, ...).
> + * context (spinlock, irq-handler, ...). Additional sections where blocking 
> is
> + * not allowed can be annotated with non_block_start() and non_block_end()
> + * pairs.
>   *
>   * This is a useful debugging help to be able to catch problems early and not
>   * be bitten later when the calling function happens to sleep when it is not
> @@ -260,6 +262,10 @@ extern int _cond_resched(void);
>  # define might_sleep() \
>   do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0)
>  # define sched_annotate_sleep()  (current->task_state_change = 0)
> +# define non_block_start() \
> + do { current->non_block_count++; } while (0)
> +# define non_block_end() \
> + do { WARN_ON(current->non_block_count-- == 0); } while (0)
>  #else
>static inline void ___might_sleep(const char *file, int line,
>  int preempt_offset) { }
> @@ -267,6 +273,8 @@ extern int _cond_resched(void);
>  int preempt_offset) { }
>  # define might_sleep() do { might_resched(); } while (0)
>  # define sched_annotate_sleep() do { } while (0)
> +# define non_block_start() do { } while (0)
> +# define non_block_end() do { } while (0)
>  #endif
>  
>  #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ecffd4e37453..41249dbf8f27 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -916,6 +916,10 @@ struct task_struct {
>   struct mutex_waiter *blocked_on;
>  #endif
>  
> +#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
> + int non_block_count;
> +#endif
> +
>  #ifdef CONFIG_TRACE_IRQFLAGS
>   unsigned intirq_events;
>   unsigned long   hardirq_enable_ip;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 6fedf3a98581..969d7a71f30c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6113,7 +6113,7 @@ void ___might_sleep(const char *file, int line, int 
> preempt_offset)
>   rcu_sleep_check();
>  
>   if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
> -  !is_idle_task(current)) ||
> +  !is_idle_task(current) && !current->non_block_count) ||
>   system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||
>   oops_in_progress)
>   return;
> @@ -6129,8 +6129,8 @@ void ___might_sleep(const char *file, int line, int 
> preempt_offset)
>   "BUG: sleeping function called from invalid context at %s:%d\n",
>   file, line);
>   printk(KERN_ERR
> - "in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
> - in_atomic(), irqs_disabled(),
> + "in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, 
> name: %s\n",
> + in_atomic(), irqs_disabled(), current->non_block_count,
>   current->pid, current->comm);
>  
>   if (task_stack_end_corrupted(current))
> -- 
> 2.20.0.rc1
> 

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/4] mm: Check if mmu notifier callbacks are allowed to fail

2018-12-10 Thread Michal Hocko
On Mon 10-12-18 11:36:38, Daniel Vetter wrote:
> Just a bit of paranoia, since if we start pushing this deep into
> callchains it's hard to spot all places where an mmu notifier
> implementation might fail when it's not allowed to.
> 
> Inspired by some confusion we had discussing i915 mmu notifiers and
> whether we could use the newly-introduced return value to handle some
> corner cases. Until we realized that these are only for when a task
> has been killed by the oom reaper.
> 
> An alternative approach would be to split the callback into two
> versions, one with the int return value, and the other with void
> return value like in older kernels. But that's a lot more churn for
> fairly little gain I think.
> 
> Summary from the m-l discussion on why we want something at warning
> level: This allows automated tooling in CI to catch bugs without
> humans having to look at everything. If we just upgrade the existing
> pr_info to a pr_warn, then we'll have false positives. And as-is, no
> one will ever spot the problem since it's lost in the massive amounts
> of overall dmesg noise.

OK, fair enough. If this is going to help with testing then I do not
have any objections of course.

> v2: Drop the full WARN_ON backtrace in favour of just a pr_warn for
> the problematic case (Michal Hocko).

Thanks!

> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: "Christian König" 
> Cc: David Rientjes 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: Paolo Bonzini 
> Signed-off-by: Daniel Vetter 
> ---
>  mm/mmu_notifier.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 5119ff846769..ccc22f21b735 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -190,6 +190,9 @@ int __mmu_notifier_invalidate_range_start(struct 
> mm_struct *mm,
>   pr_info("%pS callback failed with %d in 
> %sblockable context.\n",
>   
> mn->ops->invalidate_range_start, _ret,
>   !blockable ? "non-" : "");
> + if (blockable)
> + pr_warn("%pS callback failure not 
> allowed\n",
> + 
> mn->ops->invalidate_range_start);
>   ret = _ret;
>   }
>   }
> -- 
> 2.20.0.rc1
> 

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/3] mm: Check if mmu notifier callbacks are allowed to fail

2018-11-23 Thread Michal Hocko
On Fri 23-11-18 14:15:11, Daniel Vetter wrote:
> On Fri, Nov 23, 2018 at 1:43 PM Michal Hocko  wrote:
> > On Fri 23-11-18 13:30:57, Daniel Vetter wrote:
> > > On Fri, Nov 23, 2018 at 12:15:57PM +0100, Michal Hocko wrote:
> > > > On Thu 22-11-18 17:51:04, Daniel Vetter wrote:
> > > > > Just a bit of paranoia, since if we start pushing this deep into
> > > > > callchains it's hard to spot all places where an mmu notifier
> > > > > implementation might fail when it's not allowed to.
> > > >
> > > > What does WARN give you more than the existing pr_info? Is really
> > > > backtrace that interesting?
> > >
> > > Automated tools have to ignore everything at info level (there's too much
> > > of that). I guess I could do something like
> > >
> > > if (blockable)
> > >   pr_warn(...)
> > > else
> > >   pr_info(...)
> > >
> > > WARN() is simply my goto tool for getting something at warning level
> > > dumped into dmesg. But I think the pr_warn with the callback function
> > > should be enough indeed.
> >
> > I wouldn't mind s@pr_info@pr_warn@
> 
> Well that's too much, because then it would misfire in the oom
> testcase, where failing is ok (desireble even, we want to avoid
> blocking after all). So needs to be  a switch (or else we need to
> filter it in results, and that's a bit a maintenance headache from a
> CI pov).

I thought the failure should be rare enough that warning about them can
be actually useful. E.g. in the oom case we can live with the failure
because we want to release _some_ memory but know about a callback that
prevents us to go the full way might be interesting.

But I do not really feel strongly about this. I find WARN a bit abuse
because the trace is unlikely going to help us much. If you want to make
a verbosity depending on the blockable context then I will surely not
stand in the way.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/3] mm, notifier: Catch sleeping/blocking for !blockable

2018-11-23 Thread Michal Hocko
On Fri 23-11-18 13:38:38, Daniel Vetter wrote:
> On Fri, Nov 23, 2018 at 12:12:37PM +0100, Michal Hocko wrote:
> > On Thu 22-11-18 17:51:05, Daniel Vetter wrote:
> > > We need to make sure implementations don't cheat and don't have a
> > > possible schedule/blocking point deeply burried where review can't
> > > catch it.
> > > 
> > > I'm not sure whether this is the best way to make sure all the
> > > might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> > > But it gets the job done.
> > 
> > Yeah, it is quite ugly. Especially because it makes DEBUG config
> > bahavior much different. So is this really worth it? Has this already
> > discovered any existing bug?
> 
> Given that we need an oom trigger to hit this we're not hitting this in CI
> (oom is just way to unpredictable to even try). I'd kinda like to also add
> some debug interface so I can provoke an oom kill of a specially prepared
> process, to make sure we can reliably exercise this path without killing
> the kernel accidentally. We do similar tricks for our shrinker already.

Create a task with oom_score_adj = 1000 and trigger the oom killer via
sysrq and you should get a predictable oom invocation and execution.

[...]
> Wrt the behavior difference: I guess we could put another counter into the
> task struct, and change might_sleep() to check it. All under
> CONFIG_DEBUG_ATOMIC_SLEEP only ofc. That would avoid the preempt-disable
> sideeffect. My worry with that is that people will spot it, and abuse it
> in creative ways that do affect semantics. See horrors like
> drm_can_sleep() (and I'm sure gfx folks are not the only ones who
> seriously lacked taste here).
> 
> Up to the experts really how to best paint this shed I think.

Actually I like a way to say non_block_{begin,end} and might_sleep
firing inside that context.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/3] mm: Check if mmu notifier callbacks are allowed to fail

2018-11-23 Thread Michal Hocko
On Fri 23-11-18 13:30:57, Daniel Vetter wrote:
> On Fri, Nov 23, 2018 at 12:15:57PM +0100, Michal Hocko wrote:
> > On Thu 22-11-18 17:51:04, Daniel Vetter wrote:
> > > Just a bit of paranoia, since if we start pushing this deep into
> > > callchains it's hard to spot all places where an mmu notifier
> > > implementation might fail when it's not allowed to.
> > 
> > What does WARN give you more than the existing pr_info? Is really
> > backtrace that interesting?
> 
> Automated tools have to ignore everything at info level (there's too much
> of that). I guess I could do something like
> 
> if (blockable)
>   pr_warn(...)
> else
>   pr_info(...)
> 
> WARN() is simply my goto tool for getting something at warning level
> dumped into dmesg. But I think the pr_warn with the callback function
> should be enough indeed.

I wouldn't mind s@pr_info@pr_warn@
 
> If you wonder where all the info level stuff happens that we have to
> ignore: suspend/resume is a primary culprit (fairly important for
> gfx/desktops), but there's a bunch of other places. Even if we ignore
> everything at info and below we still need filters because some drivers
> are a bit too trigger-happy (i915 definitely included I guess, so everyone
> contributes to this problem).

Thanks for the clarification.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/3] mm: Check if mmu notifier callbacks are allowed to fail

2018-11-23 Thread Michal Hocko
On Thu 22-11-18 17:51:04, Daniel Vetter wrote:
> Just a bit of paranoia, since if we start pushing this deep into
> callchains it's hard to spot all places where an mmu notifier
> implementation might fail when it's not allowed to.

What does WARN give you more than the existing pr_info? Is really
backtrace that interesting?

> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: "Christian König" 
> Cc: David Rientjes 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: Paolo Bonzini 
> Signed-off-by: Daniel Vetter 
> ---
>  mm/mmu_notifier.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 5119ff846769..59e102589a25 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -190,6 +190,8 @@ int __mmu_notifier_invalidate_range_start(struct 
> mm_struct *mm,
>   pr_info("%pS callback failed with %d in 
> %sblockable context.\n",
>   
> mn->ops->invalidate_range_start, _ret,
>   !blockable ? "non-" : "");
> + WARN(blockable,"%pS callback failure not 
> allowed\n",
> +      mn->ops->invalidate_range_start);
>   ret = _ret;
>   }
>   }
> -- 
> 2.19.1
> 

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-gfx] [PATCH 1/3] mm: Check if mmu notifier callbacks are allowed to fail

2018-11-23 Thread Michal Hocko
On Fri 23-11-18 09:49:34, Daniel Vetter wrote:
> On Thu, Nov 22, 2018 at 04:53:34PM +, Chris Wilson wrote:
> > Quoting Daniel Vetter (2018-11-22 16:51:04)
> > > Just a bit of paranoia, since if we start pushing this deep into
> > > callchains it's hard to spot all places where an mmu notifier
> > > implementation might fail when it's not allowed to.
> > 
> > Most callers could handle the failure correctly. It looks like the
> > failure was not propagated for convenience.
> 
> I have no idea whether the mm is semantically ok if pte shootdown doesn't
> work for all sorts of strange reasons. From the commit that introduced the
> error code it souded like this was very much only ok in the limited case
> of an already killed process, in the oom killer path, where it's really
> only about trying to free any kind of memory. And where the process is
> gone already, so semantics of what exactly happens don't matter that much
> anymore.

Yes this was indeed the case. There is still the exit path which would
do the rest of the work so we are not leaving anything behind. 
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/3] mm, notifier: Catch sleeping/blocking for !blockable

2018-11-23 Thread Michal Hocko
On Thu 22-11-18 17:51:05, Daniel Vetter wrote:
> We need to make sure implementations don't cheat and don't have a
> possible schedule/blocking point deeply burried where review can't
> catch it.
> 
> I'm not sure whether this is the best way to make sure all the
> might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> But it gets the job done.

Yeah, it is quite ugly. Especially because it makes DEBUG config
bahavior much different. So is this really worth it? Has this already
discovered any existing bug?

> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: David Rientjes 
> Cc: "Christian König" 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Signed-off-by: Daniel Vetter 
> ---
>  mm/mmu_notifier.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 59e102589a25..4d282cfb296e 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -185,7 +185,13 @@ int __mmu_notifier_invalidate_range_start(struct 
> mm_struct *mm,
>   id = srcu_read_lock(&srcu);
>   hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
>   if (mn->ops->invalidate_range_start) {
> - int _ret = mn->ops->invalidate_range_start(mn, mm, 
> start, end, blockable);
> + int _ret;
> +
> + if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !blockable)
> + preempt_disable();
> + _ret = mn->ops->invalidate_range_start(mn, mm, start, 
> end, blockable);
> + if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !blockable)
> + preempt_enable();
>   if (_ret) {
>   pr_info("%pS callback failed with %d in 
> %sblockable context.\n",
>   
> mn->ops->invalidate_range_start, _ret,
> -- 
> 2.19.1
> 

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL

2018-11-23 Thread Michal Hocko
On Thu 22-11-18 17:38:58, Christoph Hellwig wrote:
> On Thu, Nov 22, 2018 at 02:30:13PM +0100, Michal Hocko wrote:
> > Whoever needs a wrapper around arch_add_memory can do so because this
> > symbol has no restriction for the usage.
> 
> arch_add_memory is not exported, and it really should not be.

It is not, but nobody really prevents from wrapping it and exporting.
I am definitely not arguing for that and I would even agree with you
that it shouldn't be exported at all unless there is a _very_ good
reason for that. Because usecases is what we care about here.

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v8 1/7] mm, devm_memremap_pages: Mark devm_memremap_pages() EXPORT_SYMBOL_GPL

2018-11-22 Thread Michal Hocko
On Tue 20-11-18 15:12:54, Dan Williams wrote:
> devm_memremap_pages() is a facility that can create struct page entries
> for any arbitrary range and give drivers the ability to subvert core
> aspects of page management.
> 
> Specifically the facility is tightly integrated with the kernel's memory
> hotplug functionality. It injects an altmap argument deep into the
> architecture specific vmemmap implementation to allow allocating from
> specific reserved pages, and it has Linux specific assumptions about
> page structure reference counting relative to get_user_pages() and
> get_user_pages_fast(). It was an oversight and a mistake that this was
> not marked EXPORT_SYMBOL_GPL from the outset.
> 
> Again, devm_memremap_pagex() exposes and relies upon core kernel
> internal assumptions and will continue to evolve along with 'struct
> page', memory hotplug, and support for new memory types / topologies.
> Only an in-kernel GPL-only driver is expected to keep up with this
> ongoing evolution. This interface, and functionality derived from this
> interface, is not suitable for kernel-external drivers.

As I've said earlier I do not buy this justification because there is
simply no stable API for modules by definition
(Documentation/process/stable-api-nonsense.rst). I do understand
your reasoning that you as an author never intended to export the symbol
this way. That is fair and justified reason for this patch.

Whoever needs a wrapper around arch_add_memory can do so because this
symbol has no restriction for the usage. It will be still the same
fiddling with struct page and deep mm internals. Do we care? I am not
convinced because once we grow any in tree user we have to cope with any
potential abuse like we have in other areas in the past. And out-of-tree
modules? Who cares. Those are on their own completely and have their
ways to go around.

> Cc: Michal Hocko 
> Cc: "Jérôme Glisse" 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Dan Williams 

That being said
Acked-by: Michal Hocko 

> ---
>  kernel/memremap.c |2 +-
>  tools/testing/nvdimm/test/iomap.c |2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 9eced2cc9f94..61dbcaa95530 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -233,7 +233,7 @@ void *devm_memremap_pages(struct device *dev, struct 
> dev_pagemap *pgmap)
>   err_array:
>   return ERR_PTR(error);
>  }
> -EXPORT_SYMBOL(devm_memremap_pages);
> +EXPORT_SYMBOL_GPL(devm_memremap_pages);
>  
>  unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
>  {
> diff --git a/tools/testing/nvdimm/test/iomap.c 
> b/tools/testing/nvdimm/test/iomap.c
> index ff9d3a5825e1..ed18a0cbc0c8 100644
> --- a/tools/testing/nvdimm/test/iomap.c
> +++ b/tools/testing/nvdimm/test/iomap.c
> @@ -113,7 +113,7 @@ void *__wrap_devm_memremap_pages(struct device *dev, 
> struct dev_pagemap *pgmap)
>   return nfit_res->buf + offset - nfit_res->res.start;
>   return devm_memremap_pages(dev, pgmap);
>  }
> -EXPORT_SYMBOL(__wrap_devm_memremap_pages);
> +EXPORT_SYMBOL_GPL(__wrap_devm_memremap_pages);
>  
>  pfn_t __wrap_phys_to_pfn_t(phys_addr_t addr, unsigned long flags)
>  {

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] mm: convert totalram_pages, totalhigh_pages and managed_pages to atomic.

2018-10-23 Thread Michal Hocko
On Mon 22-10-18 22:53:22, Arun KS wrote:
> Remove managed_page_count_lock spinlock and instead use atomic
> variables.

I assume this has been auto-generated. If yes, it would be better to
mention the script so that people can review it and regenerate for
comparision. Such a large change is hard to review manually.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 11:12:40, Jerome Glisse wrote:
[...]
> I am fine with Michal patch, i already said so couple month ago first time
> this discussion did pop up, Michal you can add:
> 
> Reviewed-by: Jérôme Glisse 

So I guess the below is the patch you were talking about?

From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Fri, 24 Aug 2018 15:32:24 +0200
Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition
 non-blocking mode

If invalidate_range_start is called for !blocking mode then all
callbacks have to guarantee they will no block/sleep. The same obviously
applies to invalidate_range_end because this operation pairs with the
former and they are called from the same context. Make sure this is
appropriately documented.

Signed-off-by: Michal Hocko 
---
 include/linux/mmu_notifier.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 133ba78820ee..698e371aafe3 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -153,7 +153,9 @@ struct mmu_notifier_ops {
 *
 * If blockable argument is set to false then the callback cannot
 * sleep and has to return with -EAGAIN. 0 should be returned
-* otherwise.
+* otherwise. Please note that if invalidate_range_start approves
+* a non-blocking behavior then the same applies to
+* invalidate_range_end.
 *
 */
int (*invalidate_range_start)(struct mmu_notifier *mn,
-- 
2.18.0

-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Michal Hocko
On Fri 24-08-18 23:52:25, Tetsuo Handa wrote:
> On 2018/08/24 22:32, Michal Hocko wrote:
> > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> >> I worry that (currently
> >> out-of-tree) users of this API are involving work / recursion.
> > 
> > I do not give a slightest about out-of-tree modules. They will have to
> > accomodate to the new API. I have no problems to extend the
> > documentation and be explicit about this expectation.
> 
> You don't need to care about out-of-tree modules. But you need to hear from
> mm/hmm.c authors/maintainers when making changes for mmu-notifiers.
> 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 133ba78820ee..698e371aafe3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
> >  *
> >  * If blockable argument is set to false then the callback cannot
> >  * sleep and has to return with -EAGAIN. 0 should be returned
> > -* otherwise.
> > +* otherwise. Please note that if invalidate_range_start approves
> > +* a non-blocking behavior then the same applies to
> > +* invalidate_range_end.
> 
> Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
> notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
> mmu-notifiers users.
> 
>   -* If both of these callbacks cannot block, and invalidate_range
>   -* cannot block, mmu_notifier_ops.flags should have
>   -* MMU_INVALIDATE_DOES_NOT_BLOCK set.
>   +* If blockable argument is set to false then the callback 
> cannot
>   +* sleep and has to return with -EAGAIN. 0 should be returned
>   +* otherwise.
> 
> Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
> make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.
> 
> Now we are in a merge window. And we noticed a possibility that out-of-tree
> mmu-notifiers users might have trouble with making changes immediately in 
> order
> to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
> And you are trying to ignore such possibility by just updating expected 
> behavior
> description instead of giving out-of-tree users a grace period to check and 
> update
> their code.

This is just ridiculous. I have no idea what you are trying to achieve
here but please read through Documentation/process/stable-api-nonsense.rst
before you try to make strong statements again.

I have changed an in-kernel interface. I have gone through all users and
fixed them up. It is really appreciated to double check after me and I
am willing to fix up any fallouts. But that is just about it. I do not
get a whit about _any_ out of tree drivers when changing the interface.
I am willing to answer any questions regarding this change so developers
of those drivers know how to do their change properly but doing so is
completely their business.
 
> >> and keeps "all operations protected by hmm->mirrors_sem held for write are
> >> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
> >> held
> >> for read will sleep (and in the worst case involves memory allocation
> >> dependency)".
> > 
> > Yes and so what? The clear expectation is that neither of the range
> > notifiers do not sleep in !blocking mode. I really fail to see what you
> > are trying to say.
> 
> I'm saying "Get ACK from Jérôme about mm/hmm.c changes".

HMM is a library layer for other driver, until those get merged the same
applies for them as well.
-- 
Michal Hocko
SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


  1   2   3   >