Re: [PATCH v6 08/13] mm: call pgmap->ops->page_free for DEVICE_GENERIC pages

2021-08-20 Thread Jerome Glisse
On Thu, Aug 19, 2021 at 10:05 PM Christoph Hellwig  wrote:
>
> On Tue, Aug 17, 2021 at 11:44:54AM -0400, Felix Kuehling wrote:
> > >> That's a good catch. Existing drivers shouldn't need a page_free
> > >> callback if they didn't have one before. That means we need to add a
> > >> NULL-pointer check in free_device_page.
> > > Also the other state clearing (__ClearPageWaiters/mem_cgroup_uncharge/
> > > ->mapping = NULL).
> > >
> > > In many ways this seems like you want to bring back the DEVICE_PUBLIC
> > > pgmap type that was removed a while ago due to the lack of users
> > > instead of overloading the generic type.
> >
> > I think so. I'm not clear about how DEVICE_PUBLIC differed from what
> > DEVICE_GENERIC is today. As I understand it, DEVICE_PUBLIC was removed
> > because it was unused and also known to be broken in some ways.
> > DEVICE_GENERIC seemed close enough to what we need, other than not being
> > supported in the migration helpers.
> >
> > Would you see benefit in re-introducing DEVICE_PUBLIC as a distinct
> > memory type from DEVICE_GENERIC? What would be the benefits of making
> > that distinction?
>
> The old DEVICE_PUBLIC mostly different in that it allowed the page
> to be returned from vm_normal_page, which I think was horribly buggy.

Why was that buggy ? If I were to do it now, i would return
DEVICE_PUBLIC page from vm_normal_page but i would ban pinning as
pinning is exceptionally wrong for GPU. If you migrate some random
anonymous/file back to your GPU memory and it gets pinned there then
there is no way for the GPU to migrate the page out. Quickly you will
run out of physically contiguous memory and things like big graphic
buffer allocation (anything that needs physically contiguous memory)
will fail. It is less of an issue on some hardware that rely less and
less on physically contiguous memory but i do not think it is
completely gone from all hw.

> But the point is not to bring back these old semantics.  The idea
> is to be able to differeniate between your new coherent on-device
> memory and the existing DEVICE_GENERIC.  That is call the
> code in free_devmap_managed_page that is currently only used
> for device private pages also for your new public device pages without
> affecting the devdax and xen use cases.

Yes, I would rather bring back DEVICE_PUBLIC then try to use
DEVICE_GENERIC, the GENERIC change was done for users that closely
matched DAX semantics and it is not the case here, at least not from
my point of view.

Jerome


Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount

2021-08-20 Thread Jerome Glisse
On Thu, Aug 19, 2021 at 11:00 AM Sierra Guiza, Alejandro (Alex)
 wrote:
>
>
> On 8/18/2021 2:28 PM, Ralph Campbell wrote:
> > On 8/17/21 5:35 PM, Felix Kuehling wrote:
> >> Am 2021-08-17 um 8:01 p.m. schrieb Ralph Campbell:
> >>> On 8/12/21 11:31 PM, Alex Sierra wrote:
>  From: Ralph Campbell 
> 
>  ZONE_DEVICE struct pages have an extra reference count that
>  complicates the
>  code for put_page() and several places in the kernel that need to
>  check the
>  reference count to see that a page is not being used (gup, compaction,
>  migration, etc.). Clean up the code so the reference count doesn't
>  need to
>  be treated specially for ZONE_DEVICE.
> 
>  v2:
>  AS: merged this patch in linux 5.11 version
> 
>  v5:
>  AS: add condition at try_grab_page to check for the zone device type,
>  while
>  page ref counter is checked less/equal to zero. In case of device
>  zone, pages
>  ref counter are initialized to zero.
> 
>  Signed-off-by: Ralph Campbell 
>  Signed-off-by: Alex Sierra 
>  ---
> arch/powerpc/kvm/book3s_hv_uvmem.c |  2 +-
> drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
> fs/dax.c   |  4 +-
> include/linux/dax.h|  2 +-
> include/linux/memremap.h   |  7 +--
> include/linux/mm.h | 13 +
> lib/test_hmm.c |  2 +-
> mm/internal.h  |  8 +++
> mm/memremap.c  | 68
>  +++---
> mm/migrate.c   |  5 --
> mm/page_alloc.c|  3 ++
> mm/swap.c  | 45 ++---
> 12 files changed, 46 insertions(+), 115 deletions(-)
> 
> >>> I haven't seen a response to the issues I raised back at v3 of this
> >>> series.
> >>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-mm%2F4f6dd918-d79b-1aa7-3a4c-caa67ddc29bc%40nvidia.com%2F&data=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=P7FxYm%2BkJaCkMFa3OHtuKrPOn7SvytFRmYQdIzq7rN4%3D&reserved=0
> >>>
> >>>
> >>>
> >>> Did I miss something?
> >> I think part of the response was that we did more testing. Alex added
> >> support for DEVICE_GENERIC pages to test_hmm and he ran DAX tests
> >> recommended by Theodore Tso. In that testing he ran into a WARN_ON_ONCE
> >> about a zero page refcount in try_get_page. The fix is in the latest
> >> version of patch 2. But it's already obsolete because John Hubbard is
> >> about to remove that function altogether.
> >>
> >> I think the issues you raised were more uncertainty than known bugs. It
> >> seems the fact that you can have DAX pages with 0 refcount is a feature
> >> more than a bug.
> >>
> >> Regards,
> >>Felix
> >
> > Did you test on a system without CONFIG_ARCH_HAS_PTE_SPECIAL defined?
> > In that case, mmap() of a DAX device will call insert_page() which calls
> > get_page() which would trigger VM_BUG_ON_PAGE().
> >
> > I can believe it is OK for PTE_SPECIAL page table entries to have no
> > struct page or that MEMORY_DEVICE_GENERIC struct pages be mapped with
> > a zero reference count using insert_pfn().
> Hi Ralph,
> We have tried the DAX tests with and without CONFIG_ARCH_HAS_PTE_SPECIAL
> defined.
> Apparently none of the tests touches that condition for a DAX device. Of
> course,
> that doesn't mean it could happen.
>
> Regards,
> Alex S.
>
> >
> >
> > I find it hard to believe that other MM developers don't see an issue
> > with a struct page with refcount == 0 and mapcount == 1.
> >
> > I don't see where init_page_count() is being called for the
> > MEMORY_DEVICE_GENERIC or MEMORY_DEVICE_PRIVATE struct pages the AMD
> > driver allocates and passes to migrate_vma_setup().
> > Looks like svm_migrate_get_vram_page() needs to call init_page_count()
> > instead of get_page(). (I'm looking at branch
> > origin/alexsierrag/device_generic
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver.git&data=04%7C01%7Calex.sierra%40amd.com%7Cd2bd2d4fbf764528540908d9627e5dcd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637649117156919772%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=IXe8HP2s8x5OdJdERBkGOYJCQk3iqCu5AYkwpDL8zec%3D&reserved=0)
> Yes, you're right. My bad. Thanks for catching this up. I didn't realize
> I was missing
> to define CONFIG_DEBUG_VM on my build. Therefore this BUG was never caught.
> It worked after I replaced get_pages by init_page_count at
> svm_migrate_get_vram_page. However, I don't think this is the

Re: [PATCH v6 02/13] mm: remove extra ZONE_DEVICE struct page refcount

2021-08-20 Thread Jerome Glisse
Note that you do not want GUP to succeed on device page, i do not see
where that is handled in the new code.

On Sun, Aug 15, 2021 at 1:40 PM John Hubbard  wrote:
>
> On 8/15/21 8:37 AM, Christoph Hellwig wrote:
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index 8ae31622deef..d48a1f0889d1 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -1218,7 +1218,7 @@ __maybe_unused struct page 
> >> *try_grab_compound_head(struct page *page, int refs,
> >>   static inline __must_check bool try_get_page(struct page *page)
> >>   {
> >>  page = compound_head(page);
> >> -if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> >> +if (WARN_ON_ONCE(page_ref_count(page) < 
> >> (int)!is_zone_device_page(page)))
> >
> > Please avoid the overly long line.  In fact I'd be tempted to just not
> > bother here and keep the old, more lose check.  Especially given that
> > John has a patch ready that removes try_get_page entirely.
> >
>
> Yes. Andrew has accepted it into mmotm.
>
> Ralph's patch here was written well before my cleanup that removed
> try_grab_page() [1]. But now that we're here, if you drop this hunk then
> it will make merging easier, I think.
>
>
> [1] https://lore.kernel.org/r/20210813044133.1536842-4-jhubb...@nvidia.com
>
> thanks,
> --
> John Hubbard
> NVIDIA
>


Re: [PATCH 0/9] Add support for SVM atomics in Nouveau

2021-02-09 Thread Jerome Glisse
On Tue, Feb 09, 2021 at 09:35:20AM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 09, 2021 at 11:57:28PM +1100, Alistair Popple wrote:
> > On Tuesday, 9 February 2021 9:27:05 PM AEDT Daniel Vetter wrote:
> > > >
> > > > Recent changes to pin_user_pages() prevent the creation of pinned pages 
> > > > in
> > > > ZONE_MOVABLE. This series allows pinned pages to be created in 
> > ZONE_MOVABLE
> > > > as attempts to migrate may fail which would be fatal to userspace.
> > > >
> > > > In this case migration of the pinned page is unnecessary as the page 
> > > > can 
> > be
> > > > unpinned at anytime by having the driver revoke atomic permission as it
> > > > does for the migrate_to_ram() callback. However a method of calling this
> > > > when memory needs to be moved has yet to be resolved so any discussion 
> > > > is
> > > > welcome.
> > > 
> > > Why do we need to pin for gpu atomics? You still have the callback for
> > > cpu faults, so you
> > > can move the page as needed, and hence a long-term pin sounds like the
> > > wrong approach.
> > 
> > Technically a real long term unmoveable pin isn't required, because as you 
> > say 
> > the page can be moved as needed at any time. However I needed some way of 
> > stopping the CPU page from being freed once the userspace mappings for it 
> > had 
> > been removed. 
> 
> The issue is you took the page out of the PTE it belongs to, which
> makes it orphaned and unlocatable by the rest of the mm?
> 
> Ideally this would leave the PTE in place so everything continues to
> work, just disable CPU access to it.
> 
> Maybe some kind of special swap entry?
> 
> I also don't much like the use of ZONE_DEVICE here, that should only
> be used for actual device memory, not as a temporary proxy for CPU
> pages.. Having two struct pages refer to the same physical memory is
> pretty ugly.
> 
> > The normal solution of registering an MMU notifier to unpin the page when 
> > it 
> > needs to be moved also doesn't work as the CPU page tables now point to the
> > device-private page and hence the migration code won't call any invalidate 
> > notifiers for the CPU page.
> 
> The fact the page is lost from the MM seems to be the main issue here.
> 
> > Yes, I would like to avoid the long term pin constraints as well if 
> > possible I 
> > just haven't found a solution yet. Are you suggesting it might be possible 
> > to 
> > add a callback in the page migration logic to specially deal with moving 
> > these 
> > pages?
> 
> How would migration even find the page?

Migration can scan memory from physical address (isolate_migratepages_range())
So the CPU mapping is not the only path to get to a page.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: HMM fence (was Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD)

2021-01-14 Thread Jerome Glisse
On Thu, Jan 14, 2021 at 02:37:36PM +0100, Christian König wrote:
> Am 14.01.21 um 12:52 schrieb Daniel Vetter:
> > [SNIP]
> > > > I had a new idea, i wanted to think more about it but have not yet,
> > > > anyway here it is. Adding a new callback to dma fence which ask the
> > > > question can it dead lock ? Any time a GPU driver has pending page
> > > > fault (ie something calling into the mm) it answer yes, otherwise
> > > > no. The GPU shrinker would ask the question before waiting on any
> > > > dma-fence and back of if it gets yes. Shrinker can still try many
> > > > dma buf object for which it does not get a yes on associated fence.
> > > > 
> > > > This does not solve the mmu notifier case, for this you would just
> > > > invalidate the gem userptr object (with a flag but not releasing the
> > > > page refcount) but you would not wait for the GPU (ie no dma fence
> > > > wait in that code path anymore). The userptr API never really made
> > > > the contract that it will always be in sync with the mm view of the
> > > > world so if different page get remapped to same virtual address
> > > > while GPU is still working with the old pages it should not be an
> > > > issue (it would not be in our usage of userptr for compositor and
> > > > what not).
> > > The current working idea in my mind goes into a similar direction.
> > > 
> > > But instead of a callback I'm adding a complete new class of HMM fences.
> > > 
> > > Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for
> > > the dma_fences and HMM fences are ignored in container objects.
> > > 
> > > When you handle an implicit or explicit synchronization request from
> > > userspace you need to block for HMM fences to complete before taking any
> > > resource locks.
> > Isnt' that what I call gang scheduling? I.e. you either run in HMM
> > mode, or in legacy fencing mode (whether implicit or explicit doesn't
> > really matter I think). By forcing that split we avoid the problem,
> > but it means occasionally full stalls on mixed workloads.
> > 
> > But that's not what Jerome wants (afaiui at least), I think his idea
> > is to track the reverse dependencies of all the fences floating
> > around, and then skip evicting an object if you have to wait for any
> > fence that is problematic for the current calling context. And I don't
> > think that's very feasible in practice.
> > 
> > So what kind of hmm fences do you have in mind here?
> 
> It's a bit more relaxed than your gang schedule.
> 
> See the requirements are as follow:
> 
> 1. dma_fences never depend on hmm_fences.
> 2. hmm_fences can never preempt dma_fences.
> 3. dma_fences must be able to preempt hmm_fences or we always reserve enough
> hardware resources (CUs) to guarantee forward progress of dma_fences.
> 
> Critical sections are MMU notifiers, page faults, GPU schedulers and
> dma_reservation object locks.
> 
> 4. It is valid to wait for a dma_fences in critical sections.
> 5. It is not valid to wait for hmm_fences in critical sections.
> 
> Fence creation either happens during command submission or by adding
> something like a barrier or signal command to your userspace queue.
> 
> 6. If we have an hmm_fence as implicit or explicit dependency for creating a
> dma_fence we must wait for that before taking any locks or reserving
> resources.
> 7. If we have a dma_fence as implicit or explicit dependency for creating an
> hmm_fence we can wait later on. So busy waiting or special WAIT hardware
> commands are valid.
> 
> This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same
> time on the hardware.
> 
> In other words we can have a high priority gfx queue running jobs based on
> dma_fences and a low priority compute queue running jobs based on
> hmm_fences.
> 
> Only when we switch from hmm_fence to dma_fence we need to block the
> submission until all the necessary resources (both memory as well as CUs)
> are available.
> 
> This is somewhat an extension to your gang submit idea.

What is hmm_fence ? You should not have fence with hmm at all.
So i am kind of scare now.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD

2021-01-13 Thread Jerome Glisse
On Wed, Jan 13, 2021 at 09:31:11PM +0100, Daniel Vetter wrote:
> On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse  wrote:
> > On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
> > > On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> > > > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> > > > > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> > > > >> This is the first version of our HMM based shared virtual memory 
> > > > >> manager
> > > > >> for KFD. There are still a number of known issues that we're working 
> > > > >> through
> > > > >> (see below). This will likely lead to some pretty significant 
> > > > >> changes in
> > > > >> MMU notifier handling and locking on the migration code paths. So 
> > > > >> don't
> > > > >> get hung up on those details yet.
> > > > >>
> > > > >> But I think this is a good time to start getting feedback. We're 
> > > > >> pretty
> > > > >> confident about the ioctl API, which is both simple and extensible 
> > > > >> for the
> > > > >> future. (see patches 4,16) The user mode side of the API can be 
> > > > >> found here:
> > > > >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> > > > >>
> > > > >> I'd also like another pair of eyes on how we're interfacing with the 
> > > > >> GPU VM
> > > > >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling 
> > > > >> (24,25),
> > > > >> and some retry IRQ handling changes (32).
> > > > >>
> > > > >>
> > > > >> Known issues:
> > > > >> * won't work with IOMMU enabled, we need to dma_map all pages 
> > > > >> properly
> > > > >> * still working on some race conditions and random bugs
> > > > >> * performance is not great yet
> > > > > Still catching up, but I think there's another one for your list:
> > > > >
> > > > >  * hmm gpu context preempt vs page fault handling. I've had a short
> > > > >discussion about this one with Christian before the holidays, and 
> > > > > also
> > > > >some private chats with Jerome. It's nasty since no easy fix, much 
> > > > > less
> > > > >a good idea what's the best approach here.
> > > >
> > > > Do you have a pointer to that discussion or any more details?
> > >
> > > Essentially if you're handling an hmm page fault from the gpu, you can
> > > deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> > > submissions or compute contexts with dma_fence_wait. Which deadlocks if
> > > you can't preempt while you have that page fault pending. Two solutions:
> > >
> > > - your hw can (at least for compute ctx) preempt even when a page fault is
> > >   pending
> > >
> > > - lots of screaming in trying to come up with an alternate solution. They
> > >   all suck.
> > >
> > > Note that the dma_fence_wait is hard requirement, because we need that for
> > > mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> > > management. Which is the current "ttm is self-limited to 50% of system
> > > memory" limitation Christian is trying to lift. So that's really not
> > > a restriction we can lift, at least not in upstream where we need to also
> > > support old style hardware which doesn't have page fault support and
> > > really has no other option to handle memory management than
> > > dma_fence_wait.
> > >
> > > Thread was here:
> > >
> > > https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=e...@mail.gmail.com/
> > >
> > > There's a few ways to resolve this (without having preempt-capable
> > > hardware), but they're all supremely nasty.
> > > -Daniel
> > >
> >
> > I had a new idea, i wanted to think more about it but have not yet,
> > anyway here it is. Adding a new callback to dma fence which ask the
> > question can it dead lock ? Any time a GPU driver has pending page
> > fault (ie something calling into the mm) it answer yes, 

Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD

2021-01-13 Thread Jerome Glisse
On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
> > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
> > > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> > >> This is the first version of our HMM based shared virtual memory manager
> > >> for KFD. There are still a number of known issues that we're working 
> > >> through
> > >> (see below). This will likely lead to some pretty significant changes in
> > >> MMU notifier handling and locking on the migration code paths. So don't
> > >> get hung up on those details yet.
> > >>
> > >> But I think this is a good time to start getting feedback. We're pretty
> > >> confident about the ioctl API, which is both simple and extensible for 
> > >> the
> > >> future. (see patches 4,16) The user mode side of the API can be found 
> > >> here:
> > >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wip/src/svm.c
> > >>
> > >> I'd also like another pair of eyes on how we're interfacing with the GPU 
> > >> VM
> > >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling 
> > >> (24,25),
> > >> and some retry IRQ handling changes (32).
> > >>
> > >>
> > >> Known issues:
> > >> * won't work with IOMMU enabled, we need to dma_map all pages properly
> > >> * still working on some race conditions and random bugs
> > >> * performance is not great yet
> > > Still catching up, but I think there's another one for your list:
> > >
> > >  * hmm gpu context preempt vs page fault handling. I've had a short
> > >discussion about this one with Christian before the holidays, and also
> > >some private chats with Jerome. It's nasty since no easy fix, much less
> > >a good idea what's the best approach here.
> > 
> > Do you have a pointer to that discussion or any more details?
> 
> Essentially if you're handling an hmm page fault from the gpu, you can
> deadlock by calling dma_fence_wait on a (chain of, possibly) other command
> submissions or compute contexts with dma_fence_wait. Which deadlocks if
> you can't preempt while you have that page fault pending. Two solutions:
> 
> - your hw can (at least for compute ctx) preempt even when a page fault is
>   pending
> 
> - lots of screaming in trying to come up with an alternate solution. They
>   all suck.
> 
> Note that the dma_fence_wait is hard requirement, because we need that for
> mmu notifiers and shrinkers, disallowing that would disable dynamic memory
> management. Which is the current "ttm is self-limited to 50% of system
> memory" limitation Christian is trying to lift. So that's really not
> a restriction we can lift, at least not in upstream where we need to also
> support old style hardware which doesn't have page fault support and
> really has no other option to handle memory management than
> dma_fence_wait.
> 
> Thread was here:
> 
> https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1NBM7K0=e...@mail.gmail.com/
> 
> There's a few ways to resolve this (without having preempt-capable
> hardware), but they're all supremely nasty.
> -Daniel
> 

I had a new idea, i wanted to think more about it but have not yet,
anyway here it is. Adding a new callback to dma fence which ask the
question can it dead lock ? Any time a GPU driver has pending page
fault (ie something calling into the mm) it answer yes, otherwise
no. The GPU shrinker would ask the question before waiting on any
dma-fence and back of if it gets yes. Shrinker can still try many
dma buf object for which it does not get a yes on associated fence.

This does not solve the mmu notifier case, for this you would just
invalidate the gem userptr object (with a flag but not releasing the
page refcount) but you would not wait for the GPU (ie no dma fence
wait in that code path anymore). The userptr API never really made
the contract that it will always be in sync with the mm view of the
world so if different page get remapped to same virtual address
while GPU is still working with the old pages it should not be an
issue (it would not be in our usage of userptr for compositor and
what not).

Maybe i overlook something there.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 00/35] Add HMM-based SVM memory manager to KFD

2021-01-13 Thread Jerome Glisse
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
> This is the first version of our HMM based shared virtual memory manager
> for KFD. There are still a number of known issues that we're working through
> (see below). This will likely lead to some pretty significant changes in
> MMU notifier handling and locking on the migration code paths. So don't
> get hung up on those details yet.

[...]

> Known issues:
> * won't work with IOMMU enabled, we need to dma_map all pages properly
> * still working on some race conditions and random bugs
> * performance is not great yet

What would those changes looks like ? Seeing the issue below i do not
see how they inter-play with mmu notifier. Can you elaborate.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] drm/ttm: rework ttm_tt page limit v2

2020-12-17 Thread Jerome Glisse
On Thu, Dec 17, 2020 at 07:19:07PM +0100, Daniel Vetter wrote:
> On Thu, Dec 17, 2020 at 7:09 PM Jerome Glisse  wrote:
> >
> > Adding few folks on cc just to raise awareness and so that i
> > could get corrected if i said anything wrong.
> >
> > On Thu, Dec 17, 2020 at 04:45:55PM +0100, Daniel Vetter wrote:
> > > On Thu, Dec 17, 2020 at 4:36 PM Christian König
> > >  wrote:
> > > > Am 17.12.20 um 16:26 schrieb Daniel Vetter:
> > > > > On Thu, Dec 17, 2020 at 4:10 PM Christian König
> > > > >  wrote:
> > > > >> Am 17.12.20 um 15:36 schrieb Daniel Vetter:
> > > > >>> On Thu, Dec 17, 2020 at 2:46 PM Christian König
> > > > >>>  wrote:
> > > > >>>> Am 16.12.20 um 16:09 schrieb Daniel Vetter:
> > > > >>>>> On Wed, Dec 16, 2020 at 03:04:26PM +0100, Christian König wrote:
> > > > >>>>> [SNIP]
> > > > >>>>>> +
> > > > >>>>>> +/* As long as pages are available make sure to release at least 
> > > > >>>>>> one */
> > > > >>>>>> +static unsigned long ttm_tt_shrinker_scan(struct shrinker 
> > > > >>>>>> *shrink,
> > > > >>>>>> +  struct shrink_control *sc)
> > > > >>>>>> +{
> > > > >>>>>> +struct ttm_operation_ctx ctx = {
> > > > >>>>>> +.no_wait_gpu = true
> > > > >>>>> Iirc there's an eventual shrinker limit where it gets desperate. 
> > > > >>>>> I think
> > > > >>>>> once we hit that, we should allow gpu waits. But it's not passed 
> > > > >>>>> to
> > > > >>>>> shrinkers for reasons, so maybe we should have a second round 
> > > > >>>>> that tries
> > > > >>>>> to more actively shrink objects if we fell substantially short of 
> > > > >>>>> what
> > > > >>>>> reclaim expected us to do?
> > > > >>>> I think we should try to avoid waiting for the GPU in the shrinker 
> > > > >>>> callback.
> > > > >>>>
> > > > >>>> When we get HMM we will have cases where the shrinker is called 
> > > > >>>> from
> > > > >>>> there and we can't wait for the GPU then without causing deadlocks.
> > > > >>> Uh that doesn't work. Also, the current rules are that you are 
> > > > >>> allowed
> > > > >>> to call dma_fence_wait from shrinker callbacks, so that shipped 
> > > > >>> sailed
> > > > >>> already. This is because shrinkers are a less restrictive context 
> > > > >>> than
> > > > >>> mmu notifier invalidation, and we wait in there too.
> > > > >>>
> > > > >>> So if you can't wait in shrinkers, you also can't wait in mmu
> > > > >>> notifiers (and also not in HMM, wĥich is the same thing). Why do you
> > > > >>> need this?
> > > > >> The core concept of HMM is that pages are faulted in on demand and 
> > > > >> it is
> > > > >> perfectly valid for one of those pages to be on disk.
> > > > >>
> > > > >> So when a page fault happens we might need to be able to allocate 
> > > > >> memory
> > > > >> and fetch something from disk to handle that.
> > > > >>
> > > > >> When this memory allocation then in turn waits for the GPU which is
> > > > >> running the HMM process we are pretty much busted.
> > > > > Yeah you can't do that. That's the entire infinite fences discussions.
> > > >
> > > > Yes, exactly.
> > > >
> > > > > For HMM to work, we need to stop using dma_fence for userspace sync,
> > > >
> > > > I was considering of separating that into a dma_fence and a hmm_fence.
> > > > Or something like this.
> > >
> > > The trouble is that dma_fence it all its forms is uapi. And on gpus
> > > without page fault support dma_fence_wait is still required in
> > > allocation contexts. So creating a new kernel structure doesn't really

Re: [PATCH 1/2] drm/ttm: rework ttm_tt page limit v2

2020-12-17 Thread Jerome Glisse
Adding few folks on cc just to raise awareness and so that i
could get corrected if i said anything wrong.

On Thu, Dec 17, 2020 at 04:45:55PM +0100, Daniel Vetter wrote:
> On Thu, Dec 17, 2020 at 4:36 PM Christian König
>  wrote:
> > Am 17.12.20 um 16:26 schrieb Daniel Vetter:
> > > On Thu, Dec 17, 2020 at 4:10 PM Christian König
> > >  wrote:
> > >> Am 17.12.20 um 15:36 schrieb Daniel Vetter:
> > >>> On Thu, Dec 17, 2020 at 2:46 PM Christian König
> > >>>  wrote:
> >  Am 16.12.20 um 16:09 schrieb Daniel Vetter:
> > > On Wed, Dec 16, 2020 at 03:04:26PM +0100, Christian König wrote:
> > > [SNIP]
> > >> +
> > >> +/* As long as pages are available make sure to release at least one 
> > >> */
> > >> +static unsigned long ttm_tt_shrinker_scan(struct shrinker *shrink,
> > >> +  struct shrink_control *sc)
> > >> +{
> > >> +struct ttm_operation_ctx ctx = {
> > >> +.no_wait_gpu = true
> > > Iirc there's an eventual shrinker limit where it gets desperate. I 
> > > think
> > > once we hit that, we should allow gpu waits. But it's not passed to
> > > shrinkers for reasons, so maybe we should have a second round that 
> > > tries
> > > to more actively shrink objects if we fell substantially short of what
> > > reclaim expected us to do?
> >  I think we should try to avoid waiting for the GPU in the shrinker 
> >  callback.
> > 
> >  When we get HMM we will have cases where the shrinker is called from
> >  there and we can't wait for the GPU then without causing deadlocks.
> > >>> Uh that doesn't work. Also, the current rules are that you are allowed
> > >>> to call dma_fence_wait from shrinker callbacks, so that shipped sailed
> > >>> already. This is because shrinkers are a less restrictive context than
> > >>> mmu notifier invalidation, and we wait in there too.
> > >>>
> > >>> So if you can't wait in shrinkers, you also can't wait in mmu
> > >>> notifiers (and also not in HMM, wĥich is the same thing). Why do you
> > >>> need this?
> > >> The core concept of HMM is that pages are faulted in on demand and it is
> > >> perfectly valid for one of those pages to be on disk.
> > >>
> > >> So when a page fault happens we might need to be able to allocate memory
> > >> and fetch something from disk to handle that.
> > >>
> > >> When this memory allocation then in turn waits for the GPU which is
> > >> running the HMM process we are pretty much busted.
> > > Yeah you can't do that. That's the entire infinite fences discussions.
> >
> > Yes, exactly.
> >
> > > For HMM to work, we need to stop using dma_fence for userspace sync,
> >
> > I was considering of separating that into a dma_fence and a hmm_fence.
> > Or something like this.
> 
> The trouble is that dma_fence it all its forms is uapi. And on gpus
> without page fault support dma_fence_wait is still required in
> allocation contexts. So creating a new kernel structure doesn't really
> solve anything I think, it needs entire new uapi completely decoupled
> from memory management. Last time we've done new uapi was probably
> modifiers, and that's still not rolled out years later.

With hmm there should not be any fence ! You do not need them.
If you feel you need them than you are doing something horribly
wrong. See below on what HMM needs and what it means.


> > > and you can only use the amdkfd style preempt fences. And preempting
> > > while the pagefault is pending is I thought something we require.
> >
> > Yeah, problem is that most hardware can't do that :)
> >
> > Getting page faults to work is hard enough, preempting while waiting for
> > a fault to return is not something which was anticipated :)
> 
> Hm last summer in a thread you said you've blocked that because it
> doesn't work. I agreed, page fault without preempt is rather tough to
> make work.
> 
> > > Iow, the HMM page fault handler must not be a dma-fence critical
> > > section, i.e. it's not allowed to hold up any dma_fence, ever.
> >
> > What do you mean with that?
> 
> dma_fence_signalling_begin/end() annotations essentially, i.e.
> cross-release dependencies. Or the other way round, if you want to be
> able to allocate memory you have to guarantee that you're never
> holding up a dma_fence.

Correct nothing regarding dma/ttm/gem should creep into HMM code
path.


For HMM what you want when handling GPU fault is doing it without
holding any GPU driver locks so that the regular page fault handler
code path can go back into the GPU driver (through shrinker) without
worrying about it.

This is how nouveau does it:
- get event about page fault (might hold some GPU lock)
- walk the event buffer to get all faulting addresses
  (might hold some GPU lock)

! DROP ALL GPU/DRIVER LOCK !

- try to coallesce fault together (adjacent address
  trigger a fault for a single range)
- call in HMM/mmu notifier helpers to handle the fault
- t

Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-22 Thread Jerome Glisse
On Mon, Jun 22, 2020 at 08:46:17AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 04:31:47PM -0400, Jerome Glisse wrote:
> > Not doable as page refcount can change for things unrelated to GUP, with
> > John changes we can identify GUP and we could potentialy copy GUPed page
> > instead of COW but this can potentialy slow down fork() and i am not sure
> > how acceptable this would be. Also this does not solve GUP against page
> > that are already in fork tree ie page P0 is in process A which forks,
> > we now have page P0 in process A and B. Now we have process A which forks
> > again and we have page P0 in A, B, and C. Here B and C are two branches
> > with root in A. B and/or C can keep forking and grow the fork tree.
> 
> For a long time now RDMA has broken COW pages when creating user DMA
> regions.
> 
> The problem has been that fork re-COW's regions that had their COW
> broken.
> 
> So, if you break the COW upon mapping and prevent fork (and others)
> from copying DMA pinned then you'd cover the cases.

I am not sure we want to prevent COW for pinned GUP pages, this would
change current semantic and potentialy break/slow down existing apps.

Anyway i think we focus too much on fork/COW, it is just an unfixable
broken corner cases, mmu notifier allows you to avoid it. Forcing real
copy on fork would likely be seen as regression by most people.


> > Semantic was change with 17839856fd588f4ab6b789f482ed3ffd7c403e1f to some
> > what "fix" that but GUP fast is still succeptible to this.
> 
> Ah, so everyone breaks the COW now, not just RDMA..
> 
> What do you mean 'GUP fast is still succeptible to this' ?

Not all GUP fast path are updated (intentionaly) __get_user_pages_fast()
for instance still keeps COW intact. People using GUP should really knows
what they are doing.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 10:43:20PM +0200, Daniel Vetter wrote:
> On Fri, Jun 19, 2020 at 10:10 PM Jerome Glisse  wrote:
> >
> > On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > > > On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > > > >
> > > > > > The madness is only that device B's mmu notifier might need to wait
> > > > > > for fence_B so that the dma operation finishes. Which in turn has to
> > > > > > wait for device A to finish first.
> > > > >
> > > > > So, it sound, fundamentally you've got this graph of operations across
> > > > > an unknown set of drivers and the kernel cannot insert itself in
> > > > > dma_fence hand offs to re-validate any of the buffers involved?
> > > > > Buffers which by definition cannot be touched by the hardware yet.
> > > > >
> > > > > That really is a pretty horrible place to end up..
> > > > >
> > > > > Pinning really is right answer for this kind of work flow. I think
> > > > > converting pinning to notifers should not be done unless notifier
> > > > > invalidation is relatively bounded.
> > > > >
> > > > > I know people like notifiers because they give a bit nicer performance
> > > > > in some happy cases, but this cripples all the bad cases..
> > > > >
> > > > > If pinning doesn't work for some reason maybe we should address that?
> > > >
> > > > Note that the dma fence is only true for user ptr buffer which predate
> > > > any HMM work and thus were using mmu notifier already. You need the
> > > > mmu notifier there because of fork and other corner cases.
> > >
> > > I wonder if we should try to fix the fork case more directly - RDMA
> > > has this same problem and added MADV_DONTFORK a long time ago as a
> > > hacky way to deal with it.
> > >
> > > Some crazy page pin that resolved COW in a way that always kept the
> > > physical memory with the mm that initiated the pin?
> >
> > Just no way to deal with it easily, i thought about forcing the
> > anon_vma (page->mapping for anonymous page) to the anon_vma that
> > belongs to the vma against which the GUP was done but it would
> > break things if page is already in other branch of a fork tree.
> > Also this forbid fast GUP.
> >
> > Quite frankly the fork was not the main motivating factor. GPU
> > can pin potentialy GBytes of memory thus we wanted to be able
> > to release it but since Michal changes to reclaim code this is
> > no longer effective.
> 
> What where how? My patch to annote reclaim paths with mmu notifier
> possibility just landed in -mm, so if direct reclaim can't reclaim mmu
> notifier'ed stuff anymore we need to know.
> 
> Also this would resolve the entire pain we're discussing in this
> thread about dma_fence_wait deadlocking against anything that's not
> GFP_ATOMIC ...

Sorry my bad, reclaim still works, only oom skip. It was couple
years ago and i thought that some of the things discuss while
back did make it upstream.

It is probably a good time to also point out that what i wanted
to do is have all the mmu notifier callback provide some kind
of fence (not dma fence) so that we can split the notification
into step:
A- schedule notification on all devices/system get fences
   this step should minimize lock dependency and should
   not have to wait for anything also best if you can avoid
   memory allocation for instance by pre-allocating what
   you need for notification.
B- mm can do things like unmap but can not map new page
   so write special swap pte to cpu page table
C- wait on each fences from A
... resume old code ie replace pte or finish unmap ...

The idea here is that at step C the core mm can decide to back
off if any fence returned from A have to wait. This means that
every device is invalidating for nothing but if we get there
then it might still be a good thing as next time around maybe
the kernel would be successfull without a wait.

This would allow things like reclaim to make forward progress
and skip over or limit wait time to given timeout.

Also I thought to extend this even to multi-cpu tlb flush so
that device and CPUs follow same pattern and we can make //
progress on each.


Getting to such scheme is a lot of work. My plan was to first
get the fence as part of the notifier user API and hide it from
mm inside notifier common code. Then update each core mm path to
new model and see if there is any benefit from it. Reclaim would
be first candidate.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 04:55:38PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 03:48:49PM -0400, Felix Kuehling wrote:
> > Am 2020-06-19 um 2:18 p.m. schrieb Jason Gunthorpe:
> > > On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > >> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > >>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > >>>
> > >>>> The madness is only that device B's mmu notifier might need to wait
> > >>>> for fence_B so that the dma operation finishes. Which in turn has to
> > >>>> wait for device A to finish first.
> > >>> So, it sound, fundamentally you've got this graph of operations across
> > >>> an unknown set of drivers and the kernel cannot insert itself in
> > >>> dma_fence hand offs to re-validate any of the buffers involved?
> > >>> Buffers which by definition cannot be touched by the hardware yet.
> > >>>
> > >>> That really is a pretty horrible place to end up..
> > >>>
> > >>> Pinning really is right answer for this kind of work flow. I think
> > >>> converting pinning to notifers should not be done unless notifier
> > >>> invalidation is relatively bounded. 
> > >>>
> > >>> I know people like notifiers because they give a bit nicer performance
> > >>> in some happy cases, but this cripples all the bad cases..
> > >>>
> > >>> If pinning doesn't work for some reason maybe we should address that?
> > >> Note that the dma fence is only true for user ptr buffer which predate
> > >> any HMM work and thus were using mmu notifier already. You need the
> > >> mmu notifier there because of fork and other corner cases.
> > > I wonder if we should try to fix the fork case more directly - RDMA
> > > has this same problem and added MADV_DONTFORK a long time ago as a
> > > hacky way to deal with it.
> > >
> > > Some crazy page pin that resolved COW in a way that always kept the
> > > physical memory with the mm that initiated the pin?
> > >
> > > (isn't this broken for O_DIRECT as well anyhow?)
> > >
> > > How does mmu_notifiers help the fork case anyhow? Block fork from
> > > progressing?
> > 
> > How much the mmu_notifier blocks fork progress depends, on quickly we
> > can preempt GPU jobs accessing affected memory. If we don't have
> > fine-grained preemption capability (graphics), the best we can do is
> > wait for the GPU jobs to complete. We can also delay submission of new
> > GPU jobs to the same memory until the MMU notifier is done. Future jobs
> > would use the new page addresses.
> > 
> > With fine-grained preemption (ROCm compute), we can preempt GPU work on
> > the affected adders space to minimize the delay seen by fork.
> > 
> > With recoverable device page faults, we can invalidate GPU page table
> > entries, so device access to the affected pages stops immediately.
> > 
> > In all cases, the end result is, that the device page table gets updated
> > with the address of the copied pages before the GPU accesses the COW
> > memory again.Without the MMU notifier, we'd end up with the GPU
> > corrupting memory of the other process.
> 
> The model here in fork has been wrong for a long time, and I do wonder
> how O_DIRECT manages to not be broken too.. I guess the time windows
> there are too small to get unlucky.

This was discuss extensively in the GUP works John have been doing.
Yes O_DIRECT can potentialy break but only if you are writting to
COW pages and you initiated the O_DIRECT right before the fork and
GUP happen before fork was able to write protect the pages.

If you O_DIRECT but use memory as input ie you are writting the
memory to the file not reading from the file. Then fork is harmless
as you are just reading memory. You can still face the COW uncertainty
(the process against which you did the O_DIRECT get "new" pages but your
O_DIRECT goes on with the "old" pages) but doing O_DIRECT and fork
concurently is asking for trouble.

> 
> If you have a write pin on a page then it should not be COW'd into the
> fork'd process but copied with the originating page remaining with the
> original mm.
> 
> I wonder if there is some easy way to achive that - if that is the
> main reason to use notifiers then it would be a better solution.

Not doable as page refcount can change for things unrelated to GUP, with
John changes we can identify

Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 03:18:49PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 02:09:35PM -0400, Jerome Glisse wrote:
> > On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> > > 
> > > > The madness is only that device B's mmu notifier might need to wait
> > > > for fence_B so that the dma operation finishes. Which in turn has to
> > > > wait for device A to finish first.
> > > 
> > > So, it sound, fundamentally you've got this graph of operations across
> > > an unknown set of drivers and the kernel cannot insert itself in
> > > dma_fence hand offs to re-validate any of the buffers involved?
> > > Buffers which by definition cannot be touched by the hardware yet.
> > > 
> > > That really is a pretty horrible place to end up..
> > > 
> > > Pinning really is right answer for this kind of work flow. I think
> > > converting pinning to notifers should not be done unless notifier
> > > invalidation is relatively bounded. 
> > > 
> > > I know people like notifiers because they give a bit nicer performance
> > > in some happy cases, but this cripples all the bad cases..
> > > 
> > > If pinning doesn't work for some reason maybe we should address that?
> > 
> > Note that the dma fence is only true for user ptr buffer which predate
> > any HMM work and thus were using mmu notifier already. You need the
> > mmu notifier there because of fork and other corner cases.
> 
> I wonder if we should try to fix the fork case more directly - RDMA
> has this same problem and added MADV_DONTFORK a long time ago as a
> hacky way to deal with it.
>
> Some crazy page pin that resolved COW in a way that always kept the
> physical memory with the mm that initiated the pin?

Just no way to deal with it easily, i thought about forcing the
anon_vma (page->mapping for anonymous page) to the anon_vma that
belongs to the vma against which the GUP was done but it would
break things if page is already in other branch of a fork tree.
Also this forbid fast GUP.

Quite frankly the fork was not the main motivating factor. GPU
can pin potentialy GBytes of memory thus we wanted to be able
to release it but since Michal changes to reclaim code this is
no longer effective.

User buffer should never end up in those weird corner case, iirc
the first usage was for xorg exa texture upload, then generalize
to texture upload in mesa and latter on to more upload cases
(vertices, ...). At least this is what i remember today. So in
those cases we do not expect fork, splice, mremap, mprotect, ...

Maybe we can audit how user ptr buffer are use today and see if
we can define a usage pattern that would allow to cut corner in
kernel. For instance we could use mmu notifier just to block CPU
pte update while we do GUP and thus never wait on dma fence.

Then GPU driver just keep the GUP pin around until they are done
with the page. They can also use the mmu notifier to keep a flag
so that the driver know if it needs to redo a GUP ie:

The notifier path:
   GPU_mmu_notifier_start_callback(range)
gpu_lock_cpu_pagetable(range)
for_each_bo_in(bo, range) {
bo->need_gup = true;
}
gpu_unlock_cpu_pagetable(range)

   GPU_validate_buffer_pages(bo)
if (!bo->need_gup)
return;
put_pages(bo->pages);
range = bo_vaddr_range(bo)
gpu_lock_cpu_pagetable(range)
GUP(bo->pages, range)
gpu_unlock_cpu_pagetable(range)


Depending on how user_ptr are use today this could work.


> (isn't this broken for O_DIRECT as well anyhow?)

Yes it can in theory, if you have an application that does O_DIRECT
and fork concurrently (ie O_DIRECT in one thread and fork in another).
Note that O_DIRECT after fork is fine, it is an issue only if GUP_fast
was able to lookup a page with write permission before fork had the
chance to update it to read only for COW.

But doing O_DIRECT (or anything that use GUP fast) in one thread and
fork in another is inherently broken ie there is no way to fix it.

See 17839856fd588f4ab6b789f482ed3ffd7c403e1f

> 
> How does mmu_notifiers help the fork case anyhow? Block fork from
> progressing?

It enforce ordering between fork and GUP, if fork is first it blocks
GUP and if forks is last then fork waits on GUP and then user buffer
get invalidated.

> 
> > I probably need to warn AMD folks again that using HMM means that you
> > must be able to update the GPU page table asynchronously without
> > fence wait.
> 
> It is kind of unrelated to HMM, it just shouldn't be using mmu
> notifiers to replace page pinning..

Well my POV is that if you abid

Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 03:30:32PM -0400, Felix Kuehling wrote:
> 
> Am 2020-06-19 um 3:11 p.m. schrieb Alex Deucher:
> > On Fri, Jun 19, 2020 at 2:09 PM Jerome Glisse  wrote:
> >> On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> >>> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> >>>
> >>>> The madness is only that device B's mmu notifier might need to wait
> >>>> for fence_B so that the dma operation finishes. Which in turn has to
> >>>> wait for device A to finish first.
> >>> So, it sound, fundamentally you've got this graph of operations across
> >>> an unknown set of drivers and the kernel cannot insert itself in
> >>> dma_fence hand offs to re-validate any of the buffers involved?
> >>> Buffers which by definition cannot be touched by the hardware yet.
> >>>
> >>> That really is a pretty horrible place to end up..
> >>>
> >>> Pinning really is right answer for this kind of work flow. I think
> >>> converting pinning to notifers should not be done unless notifier
> >>> invalidation is relatively bounded.
> >>>
> >>> I know people like notifiers because they give a bit nicer performance
> >>> in some happy cases, but this cripples all the bad cases..
> >>>
> >>> If pinning doesn't work for some reason maybe we should address that?
> >> Note that the dma fence is only true for user ptr buffer which predate
> >> any HMM work and thus were using mmu notifier already. You need the
> >> mmu notifier there because of fork and other corner cases.
> >>
> >> For nouveau the notifier do not need to wait for anything it can update
> >> the GPU page table right away. Modulo needing to write to GPU memory
> >> using dma engine if the GPU page table is in GPU memory that is not
> >> accessible from the CPU but that's never the case for nouveau so far
> >> (but i expect it will be at one point).
> >>
> >>
> >> So i see this as 2 different cases, the user ptr case, which does pin
> >> pages by the way, where things are synchronous. Versus the HMM cases
> >> where everything is asynchronous.
> >>
> >>
> >> I probably need to warn AMD folks again that using HMM means that you
> >> must be able to update the GPU page table asynchronously without
> >> fence wait. The issue for AMD is that they already update their GPU
> >> page table using DMA engine. I believe this is still doable if they
> >> use a kernel only DMA engine context, where only kernel can queue up
> >> jobs so that you do not need to wait for unrelated things and you can
> >> prioritize GPU page table update which should translate in fast GPU
> >> page table update without DMA fence.
> > All devices which support recoverable page faults also have a
> > dedicated paging engine for the kernel driver which the driver already
> > makes use of.  We can also update the GPU page tables with the CPU.
> 
> We have a potential problem with CPU updating page tables while the GPU
> is retrying on page table entries because 64 bit CPU transactions don't
> arrive in device memory atomically.
> 
> We are using SDMA for page table updates. This currently goes through a
> the DRM GPU scheduler to a special SDMA queue that's used by kernel-mode
> only. But since it's based on the DRM GPU scheduler, we do use dma-fence
> to wait for completion.

Yeah my worry is mostly that some cross dma fence leak into it but
it should never happen realy, maybe there is a way to catch if it
does and print a warning.

So yes you can use dma fence, as long as they do not have cross-dep.
Another expectation is that they complete quickly and usualy page
table update do.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Thu, Jun 11, 2020 at 07:35:35PM -0400, Felix Kuehling wrote:
> Am 2020-06-11 um 10:15 a.m. schrieb Jason Gunthorpe:
> > On Thu, Jun 11, 2020 at 10:34:30AM +0200, Daniel Vetter wrote:
> >>> I still have my doubts about allowing fence waiting from within shrinkers.
> >>> IMO ideally they should use a trywait approach, in order to allow memory
> >>> allocation during command submission for drivers that
> >>> publish fences before command submission. (Since early reservation object
> >>> release requires that).
> >> Yeah it is a bit annoying, e.g. for drm/scheduler I think we'll end up
> >> with a mempool to make sure it can handle it's allocations.
> >>
> >>> But since drivers are already waiting from within shrinkers and I take 
> >>> your
> >>> word for HMM requiring this,
> >> Yeah the big trouble is HMM and mmu notifiers. That's the really awkward
> >> one, the shrinker one is a lot less established.
> > I really question if HW that needs something like DMA fence should
> > even be using mmu notifiers - the best use is HW that can fence the
> > DMA directly without having to get involved with some command stream
> > processing.
> >
> > Or at the very least it should not be a generic DMA fence but a
> > narrowed completion tied only into the same GPU driver's command
> > completion processing which should be able to progress without
> > blocking.
> >
> > The intent of notifiers was never to endlessly block while vast
> > amounts of SW does work.
> >
> > Going around and switching everything in a GPU to GFP_ATOMIC seems
> > like bad idea.
> >
> >> I've pinged a bunch of armsoc gpu driver people and ask them how much this
> >> hurts, so that we have a clear answer. On x86 I don't think we have much
> >> of a choice on this, with userptr in amd and i915 and hmm work in nouveau
> >> (but nouveau I think doesn't use dma_fence in there). 
> 
> Soon nouveau will get company. We're working on a recoverable page fault
> implementation for HMM in amdgpu where we'll need to update page tables
> using the GPUs SDMA engine and wait for corresponding fences in MMU
> notifiers.

Note that HMM mandate, and i stressed that several time in the past,
that all GPU page table update are asynchronous and do not have to
wait on _anything_.

I understand that you use DMA engine for GPU page table update but
if you want to do so with HMM then you need a GPU page table update
only DMA context where all GPU page table update goes through and
where user space can not queue up job.

It can be for HMM only but if you want to mix HMM with non HMM then
everything need to be on that queue and other command queue will have
to depends on it.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Linaro-mm-sig] [PATCH 04/18] dma-fence: prime lockdep annotations

2020-06-19 Thread Jerome Glisse
On Fri, Jun 19, 2020 at 02:23:08PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 19, 2020 at 06:19:41PM +0200, Daniel Vetter wrote:
> 
> > The madness is only that device B's mmu notifier might need to wait
> > for fence_B so that the dma operation finishes. Which in turn has to
> > wait for device A to finish first.
> 
> So, it sound, fundamentally you've got this graph of operations across
> an unknown set of drivers and the kernel cannot insert itself in
> dma_fence hand offs to re-validate any of the buffers involved?
> Buffers which by definition cannot be touched by the hardware yet.
> 
> That really is a pretty horrible place to end up..
> 
> Pinning really is right answer for this kind of work flow. I think
> converting pinning to notifers should not be done unless notifier
> invalidation is relatively bounded. 
> 
> I know people like notifiers because they give a bit nicer performance
> in some happy cases, but this cripples all the bad cases..
> 
> If pinning doesn't work for some reason maybe we should address that?

Note that the dma fence is only true for user ptr buffer which predate
any HMM work and thus were using mmu notifier already. You need the
mmu notifier there because of fork and other corner cases.

For nouveau the notifier do not need to wait for anything it can update
the GPU page table right away. Modulo needing to write to GPU memory
using dma engine if the GPU page table is in GPU memory that is not
accessible from the CPU but that's never the case for nouveau so far
(but i expect it will be at one point).


So i see this as 2 different cases, the user ptr case, which does pin
pages by the way, where things are synchronous. Versus the HMM cases
where everything is asynchronous.


I probably need to warn AMD folks again that using HMM means that you
must be able to update the GPU page table asynchronously without
fence wait. The issue for AMD is that they already update their GPU
page table using DMA engine. I believe this is still doable if they
use a kernel only DMA engine context, where only kernel can queue up
jobs so that you do not need to wait for unrelated things and you can
prioritize GPU page table update which should translate in fast GPU
page table update without DMA fence.


> > Full disclosure: We are aware that we've designed ourselves into an
> > impressive corner here, and there's lots of talks going on about
> > untangling the dma synchronization from the memory management
> > completely. But
> 
> I think the documenting is really important: only GPU should be using
> this stuff and driving notifiers this way. Complete NO for any
> totally-not-a-GPU things in drivers/accel for sure.

Yes for user that expect HMM they need to be asynchronous. But it is
hard to revert user ptr has it was done a long time ago.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [GIT PULL] Please pull hmm changes

2019-12-05 Thread Jerome Glisse
On Tue, Dec 03, 2019 at 02:42:12AM +, Jason Gunthorpe wrote:
> On Sat, Nov 30, 2019 at 10:23:31AM -0800, Linus Torvalds wrote:
> > On Sat, Nov 30, 2019 at 10:03 AM Linus Torvalds
> >  wrote:
> > >
> > > I'll try to figure the code out, but my initial reaction was "yeah,
> > > not in my VM".
> > 
> > Why is it ok to sometimes do
> > 
> > WRITE_ONCE(mni->invalidate_seq, cur_seq);
> > 
> > (to pair with the unlocked READ_ONCE), and sometimes then do
> > 
> > mni->invalidate_seq = mmn_mm->invalidate_seq;
> > 
> > My initial guess was that latter is only done at initialization time,
> > but at least in one case it's done *after* the mni has been added to
> > the mmn_mm (oh, how I despise those names - I can only repeat: WTF?).
> 
> Yes, the only occurrences are in the notifier_insert, under the
> spinlock. The one case where it is out of the natural order was to
> make the manipulation of seq a bit saner, but in all cases since the
> spinlock is held there is no way for another thread to get the pointer
> to the 'mmu_interval_notifier *' to do the unlocked read.
> 
> Regarding the ugly names.. Naming has been really hard here because
> currently everything is a 'mmu notifier' and the natural abberviations
> from there are crummy. Here is the basic summary:
> 
> struct mmu_notifier_mm (ie the mm->mmu_notifier_mm)
>-> mmn_mm
> struct mm_struct 
>-> mm
> struct mmu_notifier (ie the user subscription to the mm_struct)
>-> mn
> struct mmu_interval_notifier (the other kind of user subscription)
>-> mni

What about "interval" the context should already tell people
it is related to mmu notifier and thus a notifier. I would
just remove the notifier suffix, this would match the below
range.

> struct mmu_notifier_range (ie the args to invalidate_range)
>-> range

Yeah range as context should tell you it is related to mmu
notifier.

> 
> I can send a patch to switch mmn_mm to mmu_notifier_mm, which is the
> only pre-existing name for this value. But IIRC, it is a somewhat ugly
> with long line wrapping. 'mni' is a pain, I have to reflect on that.
> (honesly, I dislike mmu_notififer_mm quite a lot too)
> 
> I think it would be overall nicer with better names for the original
> structs. Perhaps:
> 
>  mmn_* - MMU notifier prefix
>  mmn_state <- struct mmu_notifier_mm
>  mmn_subscription (mmn_sub) <- struct mmu_notifier
>  mmn_range_subscription (mmn_range_sub) <- struct mmu_interval_notifier
>  mmn_invalidate_desc <- struct mmu_notifier_range

This looks good.

> 
> At least this is how I describe them in my mind..  This is a lot of
> churn, and spreads through many drivers. This is why I kept the names
> as-is and we ended up with the also quite bad 'mmu_interval_notifier'
> 
> Maybe just switch mmu_notifier_mm for mmn_state and leave the drivers
> alone?
> 
> Anyone on the CC list have advice?

Maybe we can do a semantic patch to do convertion and then Linus
can easily apply the patch by just re-running the coccinelle.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC 06/13] drm/i915/svm: Page table mirroring support

2019-12-04 Thread Jerome Glisse
On Tue, Dec 03, 2019 at 11:19:43AM -0800, Niranjan Vishwanathapura wrote:
> On Tue, Nov 26, 2019 at 06:32:52PM +, Jason Gunthorpe wrote:
> > On Mon, Nov 25, 2019 at 11:33:27AM -0500, Jerome Glisse wrote:
> > > On Fri, Nov 22, 2019 at 11:33:12PM +, Jason Gunthorpe wrote:
> > > > On Fri, Nov 22, 2019 at 12:57:27PM -0800, Niranjana Vishwanathapura 
> > > > wrote:
> > > 
> > > [...]
> > > 
> > > > > +static int
> > > > > +i915_range_fault(struct i915_svm *svm, struct hmm_range *range)
> > > > > +{
> > > > > + long ret;
> > > > > +
> > > > > + range->default_flags = 0;
> > > > > + range->pfn_flags_mask = -1UL;
> > > > > +
> > > > > + ret = hmm_range_register(range, &svm->mirror);
> > > > > + if (ret) {
> > > > > + up_read(&svm->mm->mmap_sem);
> > > > > + return (int)ret;
> > > > > + }
> > > >
> > > >
> > > > Using a temporary range is the pattern from nouveau, is it really
> > > > necessary in this driver?
> > > 
> > > Just to comment on this, for GPU the usage model is not application
> > > register range of virtual address it wants to use. It is GPU can
> > > access _any_ CPU valid address just like the CPU would (modulo mmap
> > > of device file).
> > > 
> > > This is because the API you want in userspace is application passing
> > > random pointer to the GPU and GPU being able to chase down any kind
> > > of random pointer chain (assuming all valid ie pointing to valid
> > > virtual address for the process).
> > > 
> > > This is unlike the RDMA case.
> > 
> > No, RDMA has the full address space option as well, it is called
> > 'implicit ODP'
> > 
> > This is implemented by registering ranges at a level in our page
> > table (currently 512G) whenever that level has populated pages below
> > it.
> > 
> > I think is a better strategy than temporary ranges.

HMM original design did not have range and was well suited to nouveau.
Recent changes make it more tie to the range and less suited to nouveau.
I would not consider 512GB implicit range as good thing. Plan i have is
to create implicit range and align them on vma. I do not know when i will
have time to get to that.

> > 
> > But other GPU drivers like AMD are using BO and TTM objects with fixed
> > VA ranges and the range is tied to the BO/TTM.
> > 
> > I'm not sure if this i915 usage is closer to AMD or closer to nouveau.
> > 
> 
> I don't know the full details of the HMM usecases in amd and nouveau.
> AMD seems to be using it for usrptr objects which is tied to a BO.
> I am not sure if nouveau has any BO tied to these address ranges.

It is closer to nouveau, AMD usage is old userptr usecase where you have
a BO tie to range. While SVM means any valid CPU address and thus imply
that there is no BO tie to it (there is still BO usecase that must still
work here).

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC 06/13] drm/i915/svm: Page table mirroring support

2019-11-25 Thread Jerome Glisse
On Mon, Nov 25, 2019 at 01:24:18PM +, Jason Gunthorpe wrote:
> On Sun, Nov 24, 2019 at 01:12:47PM -0800, Niranjan Vishwanathapura wrote:
> 
> > > > > Using a temporary range is the pattern from nouveau, is it really
> > > > > necessary in this driver?
> > > > 
> > > > Yah, not required. In my local build I tried with proper default_flags
> > > > and set pfn_flags_mask to 0 and it is working fine.
> > > 
> > > Sorry, I ment calling hmm_range_register during fault processing.
> > > 
> > > If your driver works around user space objects that cover a VA then
> > > the range should be created when the object is created.
> > > 
> > 
> > Oh ok. No, there is no user space object here.
> > Binding the user space object to device page table is handled in
> > patch 03 of this series (no HMM there).
> > This is for binding a system allocated (malloc) memory. User calls
> > the bind ioctl with the VA range.
> > 
> > > > > > +   /*
> > > > > > +* No needd to dma map the host pages and later unmap it, as
> > > > > > +* GPU is not allowed to access it with SVM. Hence, no need
> > > > > > +* of any intermediate data strucutre to hold the mappings.
> > > > > > +*/
> > > > > > +   for (i = 0; i < npages; i++) {
> > > > > > +   u64 addr = range->pfns[i] & ~((1UL << range->pfn_shift) 
> > > > > > - 1);
> > > > > > +
> > > > > > +   if (sg && (addr == (sg_dma_address(sg) + sg->length))) {
> > > > > > +   sg->length += PAGE_SIZE;
> > > > > > +   sg_dma_len(sg) += PAGE_SIZE;
> > > > > > +   continue;
> > > > > > +   }
> > > > > > +
> > > > > > +   if (sg)
> > > > > > +   sg_page_sizes |= sg->length;
> > > > > > +
> > > > > > +   sg =  sg ? __sg_next(sg) : st->sgl;
> > > > > > +   sg_dma_address(sg) = addr;
> > > > > > +   sg_dma_len(sg) = PAGE_SIZE;
> > > > > > +   sg->length = PAGE_SIZE;
> > > > > > +   st->nents++;
> > > > >
> > > > > It is odd to build the range into a sgl.
> > > > >
> > > > > IMHO it is not a good idea to use the sg_dma_address like this, that
> > > > > should only be filled in by a dma map. Where does it end up being
> > > > > used?
> > > > 
> > > > The sgl is used to plug into the page table update function in i915.
> > > > 
> > > > For the device memory in discrete card, we don't need dma map which
> > > > is the case here.
> > > 
> > > How did we get to device memory on a card? Isn't range->pfns a CPU PFN
> > > at this point?
> > > 
> > > I'm confused.
> > 
> > Device memory plugin is done through devm_memremap_pages() in patch 07 of
> > this series. In that patch, we convert the CPU PFN to device PFN before
> > building the sgl (this is similar to the nouveau driver).
> 
> But earlier just called hmm_range_fault(), it can return all kinds of
> pages. If these are only allowed to be device pages here then that
> must be checked (under lock)
> 
> And putting the cpu PFN of a ZONE_DEVICE device page into
> sg_dma_address still looks very wrong to me

Yeah, nouveau has different code path but this is because nouveau
driver architecture allows it, i do not see any easy way to hammer
this inside i915 current architecture. I will ponder on this a bit
more.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC 06/13] drm/i915/svm: Page table mirroring support

2019-11-25 Thread Jerome Glisse
On Fri, Nov 22, 2019 at 11:33:12PM +, Jason Gunthorpe wrote:
> On Fri, Nov 22, 2019 at 12:57:27PM -0800, Niranjana Vishwanathapura wrote:

[...]

> > +static int
> > +i915_range_fault(struct i915_svm *svm, struct hmm_range *range)
> > +{
> > +   long ret;
> > +
> > +   range->default_flags = 0;
> > +   range->pfn_flags_mask = -1UL;
> > +
> > +   ret = hmm_range_register(range, &svm->mirror);
> > +   if (ret) {
> > +   up_read(&svm->mm->mmap_sem);
> > +   return (int)ret;
> > +   }
> 
> 
> Using a temporary range is the pattern from nouveau, is it really
> necessary in this driver?

Just to comment on this, for GPU the usage model is not application
register range of virtual address it wants to use. It is GPU can
access _any_ CPU valid address just like the CPU would (modulo mmap
of device file).

This is because the API you want in userspace is application passing
random pointer to the GPU and GPU being able to chase down any kind
of random pointer chain (assuming all valid ie pointing to valid
virtual address for the process).

This is unlike the RDMA case.


That being said, for best performance we still expect well behaving
application to provide hint to kernel so that we know if a range of
virtual address is likely to be use by the GPU or not. But this is
not, and should not be a requirement.


I posted patchset and given talks about this, but long term i believe
we want a common API to manage hint provided by userspace (see my
talk at LPC this year about new syscall to bind memory to device).
With such thing in place we could hang mmu notifier range to it. But
the driver will still need to be able to handle the case where there
is no hint provided by userspace and thus no before knowledge of what
VA might be accessed.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread Jerome Glisse
On Wed, Nov 13, 2019 at 02:00:06PM -0800, Dan Williams wrote:
> On Wed, Nov 13, 2019 at 11:23 AM Dan Williams  
> wrote:
> >
> > On Tue, Nov 12, 2019 at 8:27 PM John Hubbard  wrote:
> > >
> > > An upcoming patch changes and complicates the refcounting and
> > > especially the "put page" aspects of it. In order to keep
> > > everything clean, refactor the devmap page release routines:
> > >
> > > * Rename put_devmap_managed_page() to page_is_devmap_managed(),
> > >   and limit the functionality to "read only": return a bool,
> > >   with no side effects.
> > >
> > > * Add a new routine, put_devmap_managed_page(), to handle checking
> > >   what kind of page it is, and what kind of refcount handling it
> > >   requires.
> > >
> > > * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
> > >   and limit the functionality to unconditionally freeing a devmap
> > >   page.
> > >
> > > This is originally based on a separate patch by Ira Weiny, which
> > > applied to an early version of the put_user_page() experiments.
> > > Since then, Jérôme Glisse suggested the refactoring described above.
> > >
> > > Suggested-by: Jérôme Glisse 
> > > Signed-off-by: Ira Weiny 
> > > Signed-off-by: John Hubbard 
> > > ---
> > >  include/linux/mm.h | 27 ---
> > >  mm/memremap.c  | 67 --
> > >  2 files changed, 53 insertions(+), 41 deletions(-)
> > >
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index a2adf95b3f9c..96228376139c 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct 
> > > page *page)
> > >  #endif
> > >
> > >  #ifdef CONFIG_DEV_PAGEMAP_OPS
> > > -void __put_devmap_managed_page(struct page *page);
> > > +void free_devmap_managed_page(struct page *page);
> > >  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> > > -static inline bool put_devmap_managed_page(struct page *page)
> > > +
> > > +static inline bool page_is_devmap_managed(struct page *page)
> > >  {
> > > if (!static_branch_unlikely(&devmap_managed_key))
> > > return false;
> > > @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct 
> > > page *page)
> > > switch (page->pgmap->type) {
> > > case MEMORY_DEVICE_PRIVATE:
> > > case MEMORY_DEVICE_FS_DAX:
> > > -   __put_devmap_managed_page(page);
> > > return true;
> > > default:
> > > break;
> > > @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct 
> > > page *page)
> > > return false;
> > >  }
> > >
> > > +static inline bool put_devmap_managed_page(struct page *page)
> > > +{
> > > +   bool is_devmap = page_is_devmap_managed(page);
> > > +
> > > +   if (is_devmap) {
> > > +   int count = page_ref_dec_return(page);
> > > +
> > > +   /*
> > > +* devmap page refcounts are 1-based, rather than 
> > > 0-based: if
> > > +* refcount is 1, then the page is free and the refcount 
> > > is
> > > +* stable because nobody holds a reference on the page.
> > > +*/
> > > +   if (count == 1)
> > > +   free_devmap_managed_page(page);
> > > +   else if (!count)
> > > +   __put_page(page);
> > > +   }
> > > +
> > > +   return is_devmap;
> > > +}
> > > +
> > >  #else /* CONFIG_DEV_PAGEMAP_OPS */
> > >  static inline bool put_devmap_managed_page(struct page *page)
> > >  {
> > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > index 03ccbdfeb697..bc7e2a27d025 100644
> > > --- a/mm/memremap.c
> > > +++ b/mm/memremap.c
> > > @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long 
> > > pfn,
> > >  EXPORT_SYMBOL_GPL(get_dev_pagemap);
> > >
> > >  #ifdef CONFIG_DEV_PAGEMAP_OPS
> > > -void __put_devmap_managed_page(struct page *page)
> > > +void free_devmap_managed_page(struct page *page)
> > >  {
> > > -   int count = page_ref_dec_return(page);
> > > +   /* Clear Active bit in case of parallel mark_page_accessed */
> > > +   __ClearPageActive(page);
> > > +   __ClearPageWaiters(page);
> > > +
> > > +   mem_cgroup_uncharge(page);
> >
> > Ugh, when did all this HMM specific manipulation sneak into the
> > generic ZONE_DEVICE path? It used to be gated by pgmap type with its
> > own put_zone_device_private_page(). For example it's certainly
> > unnecessary and might be broken (would need to check) to call
> > mem_cgroup_uncharge() on a DAX page. ZONE_DEVICE users are not a
> > monolith and the HMM use case leaks pages into code paths that DAX
> > explicitly avoids.
> 
> It's been this way for a while and I did not react previously,
> apologies for that. I think __ClearPageActive, __ClearPageWaiters, and
> mem_cgroup_uncharge, belong behind a device-private conditional. The
> history here is:
> 
> Move some, 

Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread Jerome Glisse
On Tue, Nov 12, 2019 at 08:26:51PM -0800, John Hubbard wrote:
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
> 
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
>   and limit the functionality to "read only": return a bool,
>   with no side effects.
> 
> * Add a new routine, put_devmap_managed_page(), to handle checking
>   what kind of page it is, and what kind of refcount handling it
>   requires.
> 
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
>   and limit the functionality to unconditionally freeing a devmap
>   page.
> 
> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, Jérôme Glisse suggested the refactoring described above.
> 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mm.h | 27 ---
>  mm/memremap.c  | 67 --
>  2 files changed, 53 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2adf95b3f9c..96228376139c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page 
> *page)
>  #endif
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>   if (!static_branch_unlikely(&devmap_managed_key))
>   return false;
> @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   switch (page->pgmap->type) {
>   case MEMORY_DEVICE_PRIVATE:
>   case MEMORY_DEVICE_FS_DAX:
> - __put_devmap_managed_page(page);
>   return true;
>   default:
>   break;
> @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   return false;
>  }
>  
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_dec_return(page);
> +
> + /*
> +  * devmap page refcounts are 1-based, rather than 0-based: if
> +  * refcount is 1, then the page is free and the refcount is
> +  * stable because nobody holds a reference on the page.
> +  */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..bc7e2a27d025 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
>  {
> - int count = page_ref_dec_return(page);
> + /* Clear Active bit in case of parallel mark_page_accessed */
> + __ClearPageActive(page);
> + __ClearPageWaiters(page);
> +
> + mem_cgroup_uncharge(page);
>  
>   /*
> -  * If refcount is 1 then page is freed and refcount is stable as nobody
> -  * holds a reference on the page.
> +  * When a device_private page is freed, the page->mapping field
> +  * may still contain a (stale) mapping value. For example, the
> +  * lower bits of page->mapping may still identify the page as
> +  * an anonymous page. Ultimately, this entire field is just
> +  * stale and wrong, and it will cause errors if not cleared.
> +  * One example is:
> +  *
> +  *  migrate_vma_pages()
> +  *migrate_vma_insert_page()
> +  *  page_add_new_anon_rmap()
> +  *__page_set_anon_rmap()
> +  *  ...checks page->mapping, via PageAnon(page) call,
> +  *and incorrectly concludes that the page is an
> +  *anonymous page. Therefore, it incorrectly,
> +  *silently fails to set up the new anon rmap.
> +  *
> +  * For other types of ZONE_DEVICE pages, migration is either
> +  * handled differently or not done at all, so there is no need
> +  * to clear page->mapping.
>*/
> - if (count == 1) {
> - /* Clear Active bit in case of parallel mark_page_accessed */
> -  

Re: Proposal to report GPU private memory allocations with sysfs nodes [plain text version]

2019-11-12 Thread Jerome Glisse
On Tue, Nov 12, 2019 at 10:17:10AM -0800, Yiwei Zhang wrote:
> Hi folks,
> 
> What do you think about:
> > For the sysfs approach, I'm assuming the upstream vendors still need
> > to provide a pair of UMD and KMD, and this ioctl to label the BO is
> > kept as driver private ioctl. Then will each driver just define their
> > own set of "label"s and the KMD will only consume the corresponding
> > ones so that the sysfs nodes won't change at all? Report zero if
> > there's no allocation or re-use under a particular "label".

To me this looks like a way to abuse the kernel into provide a specific
message passing API between process only for GPU. It would be better to
use existing kernel/userspace API to pass message between process than
add a new one just for a special case.

Note that I believe that listing GPU allocation for a process might
useful but only if it is a generic thing accross all GPU (for upstream
GPU driver we do not care about non upstream).

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-08 Thread Jerome Glisse
On Thu, Nov 07, 2019 at 10:33:02PM -0800, Christoph Hellwig wrote:
> On Thu, Nov 07, 2019 at 08:06:08PM +, Jason Gunthorpe wrote:
> > > 
> > > enum mmu_range_notifier_event {
> > >   MMU_NOTIFY_RELEASE,
> > > };
> > > 
> > > ...assuming that we stay with "mmu_range_notifier" as a core name for 
> > > this 
> > > whole thing.
> > > 
> > > Also, it is best moved down to be next to the new MNR structs, so that 
> > > all the
> > > MNR stuff is in one group.
> > 
> > I agree with Jerome, this enum is part of the 'struct
> > mmu_notifier_range' (ie the description of the invalidation) and it
> > doesn't really matter that only these new notifiers can be called with
> > this type, it is still part of the mmu_notifier_range.
> > 
> > The comment already says it only applies to the mmu_range_notifier
> > scheme..
> 
> In fact the enum is entirely unused.  We might as well just kill it off
> entirely.

I had patches to use it, i need to re-post them. I posted them long ago
and i droped the ball. I will re-spin after this.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-07 Thread Jerome Glisse
On Fri, Nov 08, 2019 at 12:32:25AM +, Jason Gunthorpe wrote:
> On Thu, Nov 07, 2019 at 04:04:08PM -0500, Jerome Glisse wrote:
> > On Thu, Nov 07, 2019 at 08:11:06PM +, Jason Gunthorpe wrote:
> > > On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> > > 
> > > > > 
> > > > > Extra credit: IMHO, this clearly deserves to all be in a new 
> > > > > mmu_range_notifier.h
> > > > > header file, but I know that's extra work. Maybe later as a follow-up 
> > > > > patch,
> > > > > if anyone has the time.
> > > > 
> > > > The range notifier should get the event too, it would be a waste, i 
> > > > think it is
> > > > an oversight here. The release event is fine so NAK to you separate 
> > > > event. Event
> > > > is really an helper for notifier i had a set of patch for nouveau to 
> > > > leverage
> > > > this i need to resucite them. So no need to split thing, i would just 
> > > > forward
> > > > the event ie add event to mmu_range_notifier_ops.invalidate() i failed 
> > > > to catch
> > > > that in v1 sorry.
> > > 
> > > I think what you mean is already done?
> > > 
> > > struct mmu_range_notifier_ops {
> > >   bool (*invalidate)(struct mmu_range_notifier *mrn,
> > >  const struct mmu_notifier_range *range,
> > >  unsigned long cur_seq);
> > 
> > Yes it is sorry, i got confuse with mmu_range_notifier and 
> > mmu_notifier_range :)
> > It is almost a palyndrome structure ;)
> 
> Lets change the name then, this is clearly not working. I'll reflow
> everything tomorrow

Semantic patch to do that run from your linux kernel directory with your patch
applied (you can run it one patch after the other and the git commit -a --fixup 
HEAD)

spatch --sp-file name-of-the-file-below --dir . --all-includes --in-place

%< --
@@
@@
struct
-mmu_range_notifier
+mmu_interval_notifier

@@
@@
struct
-mmu_range_notifier
+mmu_interval_notifier
{...};

// Change mrn name to mmu_in
@@
struct mmu_interval_notifier *mrn;
@@
-mrn
+mmu_in

@@
identifier fn;
@@
fn(..., 
-struct mmu_interval_notifier *mrn,
+struct mmu_interval_notifier *mmu_in,
...) {...}
-- >%

You need coccinelle (which provides spatch). It is untested but it should work
also i could not come up with a nice name to update mrn as min is way too
confusing. If you have better name feel free to use it.

Oh and coccinelle is pretty clever about code formating so it should do a good
jobs at keeping things nicely formated and align.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-07 Thread Jerome Glisse
On Thu, Nov 07, 2019 at 08:11:06PM +, Jason Gunthorpe wrote:
> On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> 
> > > 
> > > Extra credit: IMHO, this clearly deserves to all be in a new 
> > > mmu_range_notifier.h
> > > header file, but I know that's extra work. Maybe later as a follow-up 
> > > patch,
> > > if anyone has the time.
> > 
> > The range notifier should get the event too, it would be a waste, i think 
> > it is
> > an oversight here. The release event is fine so NAK to you separate event. 
> > Event
> > is really an helper for notifier i had a set of patch for nouveau to 
> > leverage
> > this i need to resucite them. So no need to split thing, i would just 
> > forward
> > the event ie add event to mmu_range_notifier_ops.invalidate() i failed to 
> > catch
> > that in v1 sorry.
> 
> I think what you mean is already done?
> 
> struct mmu_range_notifier_ops {
>   bool (*invalidate)(struct mmu_range_notifier *mrn,
>  const struct mmu_notifier_range *range,
>  unsigned long cur_seq);

Yes it is sorry, i got confuse with mmu_range_notifier and mmu_notifier_range :)
It is almost a palyndrome structure ;)

> 
> > No it is always odd, you must call mmu_range_set_seq() only from the
> > op->invalidate_range() callback at which point the seq is odd. As well
> > when mrn is added and its seq first set it is set to an odd value
> > always. Maybe the comment, should read:
> > 
> >  * mrn->invalidate_seq is always, yes always, set to an odd value. This 
> > ensures
> > 
> > To stress that it is not an error.
> 
> I went with this:
> 
>   /*
>* mrn->invalidate_seq must always be set to an odd value via
>* mmu_range_set_seq() using the provided cur_seq from
>* mn_itree_inv_start_range(). This ensures that if seq does wrap we
>* will always clear the below sleep in some reasonable time as
>* mmn_mm->invalidate_seq is even in the idle state.
>*/

Yes fine with me.

[...]

> > > > +   might_lock(&mm->mmap_sem);
> > > > +
> > > > +   mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
> > > 
> > > What does the above pair with? Should have a comment that specifies that.
> > 
> > It was discussed in v1 but maybe a comment of what was said back then would
> > be helpful. Something like:
> > 
> > /*
> >  * We need to insure that all writes to mm->mmu_notifier_mm are visible 
> > before
> >  * any checks we do on mmn_mm below as otherwise CPU might re-order write 
> > done
> >  * by another CPU core to mm->mmu_notifier_mm structure fields after the 
> > read
> >  * belows.
> >  */
> 
> This comment made it, just at the store side:
> 
>   /*
>* Serialize the update against mmu_notifier_unregister. A
>* side note: mmu_notifier_release can't run concurrently with
>* us because we hold the mm_users pin (either implicitly as
>* current->mm or explicitly with get_task_mm() or similar).
>* We can't race against any other mmu notifier method either
>* thanks to mm_take_all_locks().
>*
>* release semantics on the initialization of the mmu_notifier_mm's
>  * contents are provided for unlocked readers.  acquire can only be
>  * used while holding the mmgrab or mmget, and is safe because once
>  * created the mmu_notififer_mm is not freed until the mm is
>  * destroyed.  As above, users holding the mmap_sem or one of the
>  * mm_take_all_locks() do not need to use acquire semantics.
>*/
>   if (mmu_notifier_mm)
>   smp_store_release(&mm->mmu_notifier_mm, mmu_notifier_mm);
> 
> Which I think is really overly belaboring the typical smp
> store/release pattern, but people do seem unfamiliar with them...

Perfect with me. I think also sometimes you forgot what memory model is
and thus store/release pattern do, i know i do and i need to refresh my
mind.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-06 Thread Jerome Glisse
On Wed, Nov 06, 2019 at 04:23:21PM -0800, John Hubbard wrote:
> On 10/28/19 1:10 PM, Jason Gunthorpe wrote:

[...]

> >  /**
> >   * enum mmu_notifier_event - reason for the mmu notifier callback
> > @@ -32,6 +34,9 @@ struct mmu_notifier_range;
> >   * access flags). User should soft dirty the page in the end callback to 
> > make
> >   * sure that anyone relying on soft dirtyness catch pages that might be 
> > written
> >   * through non CPU mappings.
> > + *
> > + * @MMU_NOTIFY_RELEASE: used during mmu_range_notifier invalidate to 
> > signal that
> > + * the mm refcount is zero and the range is no longer accessible.
> >   */
> >  enum mmu_notifier_event {
> > MMU_NOTIFY_UNMAP = 0,
> > @@ -39,6 +44,7 @@ enum mmu_notifier_event {
> > MMU_NOTIFY_PROTECTION_VMA,
> > MMU_NOTIFY_PROTECTION_PAGE,
> > MMU_NOTIFY_SOFT_DIRTY,
> > +   MMU_NOTIFY_RELEASE,
> >  };
> 
> 
> OK, let the naming debates begin! ha. Anyway, after careful study of the 
> overall
> patch, and some browsing of the larger patchset, it's clear that:
> 
> * The new "MMU range notifier" that you've created is, approximately, a new
> object. It uses classic mmu notifiers inside, as an implementation detail, and
> it does *similar* things (notifications) as mmn's. But it's certainly not the 
> same
> as mmn's, as shown later when you say the need to an entirely new ops struct, 
> and 
> data struct too.
> 
> Therefore, you need a separate events enum as well. This is important. MMN's
> won't be issuing MMN_NOTIFY_RELEASE events, nor will MNR's be issuing the 
> first
> four prexisting MMU_NOTIFY_* items. So it would be a design mistake to glom 
> them
> together, unless you ultimately decided to merge these MMN and MNR objects 
> (which
> I don't really see any intention of, and that's fine).
> 
> So this should read:
> 
> enum mmu_range_notifier_event {
>   MMU_NOTIFY_RELEASE,
> };
> 
> ...assuming that we stay with "mmu_range_notifier" as a core name for this 
> whole thing.
> 
> Also, it is best moved down to be next to the new MNR structs, so that all the
> MNR stuff is in one group.
> 
> Extra credit: IMHO, this clearly deserves to all be in a new 
> mmu_range_notifier.h
> header file, but I know that's extra work. Maybe later as a follow-up patch,
> if anyone has the time.

The range notifier should get the event too, it would be a waste, i think it is
an oversight here. The release event is fine so NAK to you separate event. Event
is really an helper for notifier i had a set of patch for nouveau to leverage
this i need to resucite them. So no need to split thing, i would just forward
the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
that in v1 sorry.


[...]

> > +struct mmu_range_notifier_ops {
> > +   bool (*invalidate)(struct mmu_range_notifier *mrn,
> > +  const struct mmu_notifier_range *range,
> > +  unsigned long cur_seq);
> > +};
> > +
> > +struct mmu_range_notifier {
> > +   struct interval_tree_node interval_tree;
> > +   const struct mmu_range_notifier_ops *ops;
> > +   struct hlist_node deferred_item;
> > +   unsigned long invalidate_seq;
> > +   struct mm_struct *mm;
> > +};
> > +
> 
> Again, now we have the new struct mmu_range_notifier, and the old 
> struct mmu_notifier_range, and it's not good.
> 
> Ideas:
> 
> a) Live with it.
> 
> b) (Discarded, too many callers): rename old one. Nope.
> 
> c) Rename new one. Ideas:
> 
> struct mmu_interval_notifier
> struct mmu_range_intersection
> ...other ideas?

I vote for interval_notifier we do want notifier in name but i am also
fine with current name.

[...]

> > + *
> > + * Note that the core mm creates nested invalidate_range_start()/end() 
> > regions
> > + * within the same thread, and runs invalidate_range_start()/end() in 
> > parallel
> > + * on multiple CPUs. This is designed to not reduce concurrency or block
> > + * progress on the mm side.
> > + *
> > + * As a secondary function, holding the full write side also serves to 
> > prevent
> > + * writers for the itree, this is an optimization to avoid extra locking
> > + * during invalidate_range_start/end notifiers.
> > + *
> > + * The write side has two states, fully excluded:
> > + *  - mm->active_invalidate_ranges != 0
> > + *  - mnn->invalidate_seq & 1 == True
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is not allowed to change
> > + *
> > + * And partially excluded:
> > + *  - mm->active_invalidate_ranges != 0
> 
> I assume this implies mnn->invalidate_seq & 1 == False in this case? If so,
> let's say so. I'm probably getting that wrong, too.

Yes (mnn->invalidate_seq & 1) == 0

> 
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is allowed to change
> > + *
> > + * The later state avoids some expensive work on inv_end in the common 
> > case of
> > + * no mrn monitoring the VA.
> > + */
> > +static bool mn_itree_is_invalidating(struct mmu_notifier_mm

Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 02:49:18PM -0800, John Hubbard wrote:
> On 11/4/19 10:52 AM, Jerome Glisse wrote:
> > On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
> >> Add tracking of pages that were pinned via FOLL_PIN.
> >>
> >> As mentioned in the FOLL_PIN documentation, callers who effectively set
> >> FOLL_PIN are required to ultimately free such pages via put_user_page().
> >> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> >> for DIO and/or RDMA use".
> >>
> >> Pages that have been pinned via FOLL_PIN are identifiable via a
> >> new function call:
> >>
> >>bool page_dma_pinned(struct page *page);
> >>
> >> What to do in response to encountering such a page, is left to later
> >> patchsets. There is discussion about this in [1].
> >>
> >> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> >>
> >> This also has a couple of trivial, non-functional change fixes to
> >> try_get_compound_head(). That function got moved to the top of the
> >> file.
> > 
> > Maybe split that as a separate trivial patch.
> 
> 
> Will do.
> 
> 
> > 
> >>
> >> This includes the following fix from Ira Weiny:
> >>
> >> DAX requires detection of a page crossing to a ref count of 1.  Fix this
> >> for GUP pages by introducing put_devmap_managed_user_page() which
> >> accounts for GUP_PIN_COUNTING_BIAS now used by GUP.
> > 
> > Please do the put_devmap_managed_page() changes in a separate
> > patch, it would be a lot easier to follow, also on that front
> > see comments below.
> 
> 
> Oh! OK. It makes sense when you say it out loud. :)
> 
> 
> ...
> >> +static inline bool put_devmap_managed_page(struct page *page)
> >> +{
> >> +  bool is_devmap = page_is_devmap_managed(page);
> >> +
> >> +  if (is_devmap) {
> >> +  int count = page_ref_dec_return(page);
> >> +
> >> +  __put_devmap_managed_page(page, count);
> >> +  }
> >> +
> >> +  return is_devmap;
> >> +}
> > 
> > I think the __put_devmap_managed_page() should be rename
> > to free_devmap_managed_page() and that the count != 1
> > case move to this inline function ie:
> > 
> > static inline bool put_devmap_managed_page(struct page *page)
> > {
> > bool is_devmap = page_is_devmap_managed(page);
> > 
> > if (is_devmap) {
> > int count = page_ref_dec_return(page);
> > 
> > /*
> >  * If refcount is 1 then page is freed and refcount is stable 
> > as nobody
> >  * holds a reference on the page.
> >  */
> > if (count == 1)
> > free_devmap_managed_page(page, count);
> > else if (!count)
> > __put_page(page);
> > }
> > 
> > return is_devmap;
> > }
> > 
> 
> Thanks, that does look cleaner and easier to read.
> 
> > 
> >> +
> >>  #else /* CONFIG_DEV_PAGEMAP_OPS */
> >>  static inline bool put_devmap_managed_page(struct page *page)
> >>  {
> >> @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct 
> >> page *page)
> >>return true;
> >>  }
> >>  
> >> +__must_check bool user_page_ref_inc(struct page *page);
> >> +
> > 
> > What about having it as an inline here as it is pretty small.
> 
> 
> You mean move it to a static inline function in mm.h? It's worse than it 
> looks, though: *everything* that it calls is also a static function, local
> to gup.c. So I'd have to expose both try_get_compound_head() and
> __update_proc_vmstat(). And that also means calling mod_node_page_state() from
> mm.h, and it goes south right about there. :)

Ok fair enough

> ...  
> >> +/**
> >> + * page_dma_pinned() - report if a page is pinned by a call to 
> >> pin_user_pages*()
> >> + * or pin_longterm_pages*()
> >> + * @page: pointer to page to be queried.
> >> + * @Return:   True, if it is likely that the page has been 
> >> "dma-pinned".
> >> + *False, if the page is definitely not dma-pinned.
> >> + */
> > 
> > Maybe add a small comment about wrap around :)
> 
> 
> I don't *think* the count can wrap around, due to the checks in 
> user_page_ref_inc().
> 
> But it's true th

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 12:33:09PM -0800, David Rientjes wrote:
> 
> 
> On Sun, 3 Nov 2019, John Hubbard wrote:
> 
> > Introduce pin_user_pages*() variations of get_user_pages*() calls,
> > and also pin_longterm_pages*() variations.
> > 
> > These variants all set FOLL_PIN, which is also introduced, and
> > thoroughly documented.
> > 
> > The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> > to FOLL_PIN:
> > 
> > pin_user_pages()
> > pin_user_pages_remote()
> > pin_user_pages_fast()
> > 
> > pin_longterm_pages()
> > pin_longterm_pages_remote()
> > pin_longterm_pages_fast()
> > 
> > All pages that are pinned via the above calls, must be unpinned via
> > put_user_page().
> > 
> 
> Hi John,
> 
> I'm curious what consideration is given to what pageblock migrate types 
> that FOLL_PIN and FOLL_LONGTERM pages originate from, assuming that 
> longterm would want to originate from MIGRATE_UNMOVABLE pageblocks for the 
> purposes of anti-fragmentation?

We do not control page block, GUP can happens on _any_ page that is
map inside a process (anonymous private vma or regular file back one).

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 12:09:05PM -0800, John Hubbard wrote:
> Jason, a question for you at the bottom.
> 
> On 11/4/19 11:52 AM, Jerome Glisse wrote:
> ...
> >> CASE 3: ODP
> >> ---
> >> RDMA hardware with page faulting support. Here, a well-written driver 
> >> doesn't
> > 
> > CASE3: Hardware with page fault support
> > ---
> > 
> > Here, a well-written 
> > 
> 
> Ah, OK. So just drop the first sentence, yes.
> 
> ...
> >>>>>> +   */
> >>>>>> +  gup_flags |= FOLL_REMOTE | FOLL_PIN;
> >>>>>
> >>>>> Wouldn't it be better to not add pin_longterm_pages_remote() until
> >>>>> it can be properly implemented ?
> >>>>>
> >>>>
> >>>> Well, the problem is that I need each call site that requires FOLL_PIN
> >>>> to use a proper wrapper. It's the FOLL_PIN that is the focus here, 
> >>>> because
> >>>> there is a hard, bright rule, which is: if and only if a caller sets
> >>>> FOLL_PIN, then the dma-page tracking happens, and put_user_page() must
> >>>> be called.
> >>>>
> >>>> So this leaves me with only two reasonable choices:
> >>>>
> >>>> a) Convert the call site as above: pin_longterm_pages_remote(), which 
> >>>> sets
> >>>> FOLL_PIN (the key point!), and leaves the FOLL_LONGTERM situation exactly
> >>>> as it has been so far. When the FOLL_LONGTERM situation is fixed, the 
> >>>> call
> >>>> site *might* not need any changes to adopt the working gup.c code.
> >>>>
> >>>> b) Convert the call site to pin_user_pages_remote(), which also sets
> >>>> FOLL_PIN, and also leaves the FOLL_LONGTERM situation exactly as before.
> >>>> There would also be a comment at the call site, to the effect of, "this
> >>>> is the wrong call to make: it really requires FOLL_LONGTERM behavior".
> >>>>
> >>>> When the FOLL_LONGTERM situation is fixed, the call site will need to be
> >>>> changed to pin_longterm_pages_remote().
> >>>>
> >>>> So you can probably see why I picked (a).
> >>>
> >>> But right now nobody has FOLL_LONGTERM and FOLL_REMOTE. So you should
> >>> never have the need for pin_longterm_pages_remote(). My fear is that
> >>> longterm has implication and it would be better to not drop this 
> >>> implication
> >>> by adding a wrapper that does not do what the name says.
> >>>
> >>> So do not introduce pin_longterm_pages_remote() until its first user
> >>> happens. This is option c)
> >>>
> >>
> >> Almost forgot, though: there is already another user: Infiniband:
> >>
> >> drivers/infiniband/core/umem_odp.c:646: npages = 
> >> pin_longterm_pages_remote(owning_process, owning_mm,
> > 
> > odp do not need that, i thought the HMM convertion was already upstream
> > but seems not, in any case odp do not need the longterm case it only
> > so best is to revert that user to gup_fast or something until it get
> > converted to HMM.
> > 
> 
> Note for Jason: the (a) or (b) items are talking about the vfio case, which is
> one of the two call sites that now use pin_longterm_pages_remote(), and the
> other one is infiniband:
> 
> drivers/infiniband/core/umem_odp.c:646: npages = 
> pin_longterm_pages_remote(owning_process, owning_mm,
> drivers/vfio/vfio_iommu_type1.c:353:ret = 
> pin_longterm_pages_remote(NULL, mm, vaddr, 1,

vfio should be reverted until it can be properly implemented.
The issue is that when you fix the implementation you might
break vfio existing user and thus regress the kernel from user
point of view. So i rather have the change to vfio reverted,
i believe it was not well understood when it got upstream,
between in my 5.4 tree it is still gup_remote not longterm.


> Jerome, Jason: I really don't want to revert the put_page() to 
> put_user_page() 
> conversions that are already throughout the IB driver--pointless churn, right?
> I'd rather either delete them in Jason's tree, or go with what I have here
> while waiting for the deletion.
> 
> Maybe we should just settle on (a) or (b), so that the IB driver ends up with
> the wrapper functions? In fact, if it's getting deleted, then I'd prefer 
> leaving
> it at (a), since that's simple...
> 
> Jason should weigh in on how he wants this to go, with respect to branching
> and merging, since it sounds like that will conflict with the hmm branch 
> (ha, I'm overdue in reviewing his mmu notifier series, that's what I get for
> being late).
> 
> thanks,
> 
> John Hubbard
> NVIDIA

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 11:30:32AM -0800, John Hubbard wrote:
> On 11/4/19 11:18 AM, Jerome Glisse wrote:
> > On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
> >> On 11/4/19 9:33 AM, Jerome Glisse wrote:
> >> ...
> >>>
> >>> Few nitpick belows, nonetheless:
> >>>
> >>> Reviewed-by: Jérôme Glisse 
> >>> [...]
> >>>> +
> >>>> +CASE 3: ODP
> >>>> +---
> >>>> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> >>>> +replayable page faulting). There are GUP references to pages serving as 
> >>>> DMA
> >>>> +buffers. For ODP, MMU notifiers are used to synchronize with 
> >>>> page_mkclean()
> >>>> +and munmap(). Therefore, normal GUP calls are sufficient, so neither 
> >>>> flag
> >>>> +needs to be set.
> >>>
> >>> I would not include ODP or anything like it here, they do not use
> >>> GUP anymore and i believe it is more confusing here. I would how-
> >>> ever include some text in this documentation explaining that hard-
> >>> ware that support page fault is superior as it does not incur any
> >>> of the issues described here.
> >>
> >> OK, agreed, here's a new write up that I'll put in v3:
> >>
> >>
> >> CASE 3: ODP
> >> ---
> > 
> > ODP is RDMA, maybe Hardware with page fault support instead
> > 
> >> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
> 
> OK, so:
> 
> "RDMA hardware with page faulting support."
> 
> for the first sentence.

I would drop RDMA completely, RDMA is just one example, they are GPU, FPGA and
others that are in that category. See below

> 
> 
> >> Here, a well-written driver doesn't normally need to pin pages at all. 
> >> However,
> >> if the driver does choose to do so, it can register MMU notifiers for the 
> >> range,
> >> and will be called back upon invalidation. Either way (avoiding page 
> >> pinning, or
> >> using MMU notifiers to unpin upon request), there is proper 
> >> synchronization with 
> >> both filesystem and mm (page_mkclean(), munmap(), etc).
> >>
> >> Therefore, neither flag needs to be set.
> > 
> > In fact GUP should never be use with those.
> 
> 
> Yes. The next paragraph says that, but maybe not strong enough.
> 
> 
> >>
> >> It's worth mentioning here that pinning pages should not be the first 
> >> design
> >> choice. If page fault capable hardware is available, then the software 
> >> should
> >> be written so that it does not pin pages. This allows mm and filesystems to
> >> operate more efficiently and reliably.
> 
> Here's what we have after the above changes:
> 
> CASE 3: ODP
> ---
> RDMA hardware with page faulting support. Here, a well-written driver doesn't

CASE3: Hardware with page fault support
---

Here, a well-written 


> normally need to pin pages at all. However, if the driver does choose to do 
> so,
> it can register MMU notifiers for the range, and will be called back upon
> invalidation. Either way (avoiding page pinning, or using MMU notifiers to 
> unpin
> upon request), there is proper synchronization with both filesystem and mm
> (page_mkclean(), munmap(), etc).
> 
> Therefore, neither flag needs to be set.
> 
> In this case, ideally, neither get_user_pages() nor pin_user_pages() should 
> be 
> called. Instead, the software should be written so that it does not pin 
> pages. 
> This allows mm and filesystems to operate more efficiently and reliably.
> 
> >>> [...]
> >>>
> >>>> @@ -1014,7 +1018,16 @@ static __always_inline long 
> >>>> __get_user_pages_locked(struct task_struct *tsk,
> >>>>  BUG_ON(*locked != 1);
> >>>>  }
> >>>>  
> >>>> -if (pages)
> >>>> +/*
> >>>> + * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional 
> >>>> behavior
> >>>> + * is to set FOLL_GET if the caller wants pages[] filled in 
> >>>> (but has
> >>>> + * carelessly failed to specify FOLL_GET), so keep doing that, 
> >>>> but only
> >>>> + * for FOLL_GET, not for the newer FOLL_PIN.
> >>>> + *

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Mon, Nov 04, 2019 at 11:04:38AM -0800, John Hubbard wrote:
> On 11/4/19 9:33 AM, Jerome Glisse wrote:
> ...
> > 
> > Few nitpick belows, nonetheless:
> > 
> > Reviewed-by: Jérôme Glisse 
> > [...]
> >> +
> >> +CASE 3: ODP
> >> +---
> >> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> >> +replayable page faulting). There are GUP references to pages serving as 
> >> DMA
> >> +buffers. For ODP, MMU notifiers are used to synchronize with 
> >> page_mkclean()
> >> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> >> +needs to be set.
> > 
> > I would not include ODP or anything like it here, they do not use
> > GUP anymore and i believe it is more confusing here. I would how-
> > ever include some text in this documentation explaining that hard-
> > ware that support page fault is superior as it does not incur any
> > of the issues described here.
> 
> OK, agreed, here's a new write up that I'll put in v3:
> 
> 
> CASE 3: ODP
> ---

ODP is RDMA, maybe Hardware with page fault support instead

> Advanced, but non-CPU (DMA) hardware that supports replayable page faults.
> Here, a well-written driver doesn't normally need to pin pages at all. 
> However,
> if the driver does choose to do so, it can register MMU notifiers for the 
> range,
> and will be called back upon invalidation. Either way (avoiding page pinning, 
> or
> using MMU notifiers to unpin upon request), there is proper synchronization 
> with 
> both filesystem and mm (page_mkclean(), munmap(), etc).
> 
> Therefore, neither flag needs to be set.

In fact GUP should never be use with those.

> 
> It's worth mentioning here that pinning pages should not be the first design
> choice. If page fault capable hardware is available, then the software should
> be written so that it does not pin pages. This allows mm and filesystems to
> operate more efficiently and reliably.
> 
> > [...]
> > 
> >> diff --git a/mm/gup.c b/mm/gup.c
> >> index 199da99e8ffc..1aea48427879 100644
> >> --- a/mm/gup.c
> >> +++ b/mm/gup.c
> > 
> > [...]
> > 
> >> @@ -1014,7 +1018,16 @@ static __always_inline long 
> >> __get_user_pages_locked(struct task_struct *tsk,
> >>BUG_ON(*locked != 1);
> >>}
> >>  
> >> -  if (pages)
> >> +  /*
> >> +   * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> >> +   * is to set FOLL_GET if the caller wants pages[] filled in (but has
> >> +   * carelessly failed to specify FOLL_GET), so keep doing that, but only
> >> +   * for FOLL_GET, not for the newer FOLL_PIN.
> >> +   *
> >> +   * FOLL_PIN always expects pages to be non-null, but no need to assert
> >> +   * that here, as any failures will be obvious enough.
> >> +   */
> >> +  if (pages && !(flags & FOLL_PIN))
> >>flags |= FOLL_GET;
> > 
> > Did you look at user that have pages and not FOLL_GET set ?
> > I believe it would be better to first fix them to end up
> > with FOLL_GET set and then error out if pages is != NULL but
> > nor FOLL_GET or FOLL_PIN is set.
> > 
> 
> I was perhaps overly cautious, and didn't go there. However, it's probably
> doable, given that there was already the following in __get_user_pages():
> 
> VM_BUG_ON(!!pages != !!(gup_flags & FOLL_GET));
> 
> ...which will have conditioned people and code to set FOLL_GET together with
> pages. So I agree that the time is right.
> 
> In order to make bisecting future failures simpler, I can insert a patch 
> right 
> before this one, that changes the FOLL_GET setting into an assert, like this:
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..be338961e80d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1014,8 +1014,8 @@ static __always_inline long 
> __get_user_pages_locked(struct task_struct *tsk,
> BUG_ON(*locked != 1);
> }
>  
> -   if (pages)
> -   flags |= FOLL_GET;
> +   if (pages && WARN_ON_ONCE(!(gup_flags & FOLL_GET)))
> +   return -EINVAL;
>  
> pages_done = 0;
> lock_dropped = false;
> 
> 
> ...and then add in FOLL_PIN, with this patch.

looks good but double check that it should not happens, i will try
to check on my side too.

> 
> >>  
> >>pages_done = 0;
> > 
> >> @@ -2373,24 +2402,9 @@ static int __gup_longterm_unlocked(unsigned long 
> >&

Re: [PATCH v2 12/18] mm/gup: track FOLL_PIN pages

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:18:07PM -0800, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> This also has a couple of trivial, non-functional change fixes to
> try_get_compound_head(). That function got moved to the top of the
> file.

Maybe split that as a separate trivial patch.

> 
> This includes the following fix from Ira Weiny:
> 
> DAX requires detection of a page crossing to a ref count of 1.  Fix this
> for GUP pages by introducing put_devmap_managed_user_page() which
> accounts for GUP_PIN_COUNTING_BIAS now used by GUP.

Please do the put_devmap_managed_page() changes in a separate
patch, it would be a lot easier to follow, also on that front
see comments below.

> 
> [1] https://lwn.net/Articles/784574/ "Some slow progress on
> get_user_pages()"
> 
> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: Ira Weiny 
> Signed-off-by: John Hubbard 
> ---
>  include/linux/mm.h   |  80 +++
>  include/linux/mmzone.h   |   2 +
>  include/linux/page_ref.h |  10 ++
>  mm/gup.c | 213 +++
>  mm/huge_memory.c |  32 +-
>  mm/hugetlb.c |  28 -
>  mm/memremap.c|   4 +-
>  mm/vmstat.c  |   2 +
>  8 files changed, 300 insertions(+), 71 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cdfb6fedb271..03b3600843b7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -972,9 +972,10 @@ static inline bool is_zone_device_page(const struct page 
> *page)
>  #endif
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void __put_devmap_managed_page(struct page *page, int count);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>   if (!static_branch_unlikely(&devmap_managed_key))
>   return false;
> @@ -983,7 +984,6 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   switch (page->pgmap->type) {
>   case MEMORY_DEVICE_PRIVATE:
>   case MEMORY_DEVICE_FS_DAX:
> - __put_devmap_managed_page(page);
>   return true;
>   default:
>   break;
> @@ -991,6 +991,19 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   return false;
>  }
>  
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_dec_return(page);
> +
> + __put_devmap_managed_page(page, count);
> + }
> +
> + return is_devmap;
> +}

I think the __put_devmap_managed_page() should be rename
to free_devmap_managed_page() and that the count != 1
case move to this inline function ie:

static inline bool put_devmap_managed_page(struct page *page)
{
bool is_devmap = page_is_devmap_managed(page);

if (is_devmap) {
int count = page_ref_dec_return(page);

/*
 * If refcount is 1 then page is freed and refcount is stable 
as nobody
 * holds a reference on the page.
 */
if (count == 1)
free_devmap_managed_page(page, count);
else if (!count)
__put_page(page);
}

return is_devmap;
}


> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
> @@ -1038,6 +1051,8 @@ static inline __must_check bool try_get_page(struct 
> page *page)
>   return true;
>  }
>  
> +__must_check bool user_page_ref_inc(struct page *page);
> +

What about having it as an inline here as it is pretty small.


>  static inline void put_page(struct page *page)
>  {
>   page = compound_head(page);
> @@ -1055,31 +1070,56 @@ static inline void put_page(struct page *page)
>   __put_page(page);
>  }
>  
> -/**
> - * put_user_page() - release a gup-pinned page
> - * @page:pointer to page to be released
> +/*
> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
> + * the page's refcount so that two separate items are tracked: the original 
> page
> + * reference count, and also a new count of how many

Re: [PATCH v2 09/18] drm/via: set FOLL_PIN via pin_user_pages_fast()

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:18:04PM -0800, John Hubbard wrote:
> Convert drm/via to use the new pin_user_pages_fast() call, which sets
> FOLL_PIN. Setting FOLL_PIN is now required for code that requires
> tracking of pinned pages, and therefore for any code that calls
> put_user_page().
> 
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Please be more explicit that via_dmablit.c is already using put_user_page()
as i am expecting that any conversion to pin_user_pages*() must be pair with
a put_user_page(). I find above commit message bit unclear from that POV.

Reviewed-by: Jérôme Glisse 


> ---
>  drivers/gpu/drm/via/via_dmablit.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/via/via_dmablit.c 
> b/drivers/gpu/drm/via/via_dmablit.c
> index 3db000aacd26..37c5e572993a 100644
> --- a/drivers/gpu/drm/via/via_dmablit.c
> +++ b/drivers/gpu/drm/via/via_dmablit.c
> @@ -239,7 +239,7 @@ via_lock_all_dma_pages(drm_via_sg_info_t *vsg,  
> drm_via_dmablit_t *xfer)
>   vsg->pages = vzalloc(array_size(sizeof(struct page *), vsg->num_pages));
>   if (NULL == vsg->pages)
>   return -ENOMEM;
> - ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
> + ret = pin_user_pages_fast((unsigned long)xfer->mem_addr,
>   vsg->num_pages,
>   vsg->direction == DMA_FROM_DEVICE ? FOLL_WRITE : 0,
>   vsg->pages);
> -- 
> 2.23.0
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 08/18] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:18:03PM -0800, John Hubbard wrote:
> Convert process_vm_access to use the new pin_user_pages_remote()
> call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
> code that requires tracking of pinned pages.
> 
> Also, release the pages via put_user_page*().
> 
> Also, rename "pages" to "pinned_pages", as this makes for
> easier reading of process_vm_rw_single_vec().
> 
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/process_vm_access.c | 28 +++-
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
> index 357aa7bef6c0..fd20ab675b85 100644
> --- a/mm/process_vm_access.c
> +++ b/mm/process_vm_access.c
> @@ -42,12 +42,11 @@ static int process_vm_rw_pages(struct page **pages,
>   if (copy > len)
>   copy = len;
>  
> - if (vm_write) {
> + if (vm_write)
>   copied = copy_page_from_iter(page, offset, copy, iter);
> - set_page_dirty_lock(page);
> - } else {
> + else
>   copied = copy_page_to_iter(page, offset, copy, iter);
> - }
> +
>   len -= copied;
>   if (copied < copy && iov_iter_count(iter))
>   return -EFAULT;
> @@ -96,7 +95,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
>   flags |= FOLL_WRITE;
>  
>   while (!rc && nr_pages && iov_iter_count(iter)) {
> - int pages = min(nr_pages, max_pages_per_loop);
> + int pinned_pages = min(nr_pages, max_pages_per_loop);
>   int locked = 1;
>   size_t bytes;
>  
> @@ -106,14 +105,15 @@ static int process_vm_rw_single_vec(unsigned long addr,
>* current/current->mm
>*/
>   down_read(&mm->mmap_sem);
> - pages = get_user_pages_remote(task, mm, pa, pages, flags,
> -   process_pages, NULL, &locked);
> + pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages,
> +  flags, process_pages,
> +  NULL, &locked);
>   if (locked)
>   up_read(&mm->mmap_sem);
> - if (pages <= 0)
> + if (pinned_pages <= 0)
>   return -EFAULT;
>  
> - bytes = pages * PAGE_SIZE - start_offset;
> + bytes = pinned_pages * PAGE_SIZE - start_offset;
>   if (bytes > len)
>   bytes = len;
>  
> @@ -122,10 +122,12 @@ static int process_vm_rw_single_vec(unsigned long addr,
>vm_write);
>   len -= bytes;
>   start_offset = 0;
> - nr_pages -= pages;
> - pa += pages * PAGE_SIZE;
> - while (pages)
> - put_page(process_pages[--pages]);
> + nr_pages -= pinned_pages;
> + pa += pinned_pages * PAGE_SIZE;
> +
> + /* If vm_write is set, the pages need to be made dirty: */
> + put_user_pages_dirty_lock(process_pages, pinned_pages,
> +   vm_write);
>   }
>  
>   return rc;
> -- 
> 2.23.0
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 05/18] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:18:00PM -0800, John Hubbard wrote:
> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
> pin_user_pages()
> pin_user_pages_remote()
> pin_user_pages_fast()
> 
> pin_longterm_pages()
> pin_longterm_pages_remote()
> pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 
> The underlying rules are:
> 
> * These are gup-internal flags, so the call sites should not directly
> set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
> assertions, for the new FOLL_PIN flag. However, for the pre-existing
> FOLL_LONGTERM flag, which has some call sites that still directly
> set FOLL_LONGTERM, there is no assertion yet.
> 
> * Call sites that want to indicate that they are going to do DirectIO
>   ("DIO") or something with similar characteristics, should call a
>   get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
>   will:
> * Start with "pin_user_pages" instead of "get_user_pages". That
>   makes it easy to find and audit the call sites.
> * Set FOLL_PIN
> 
> * For pages that are received via FOLL_PIN, those pages must be returned
>   via put_user_page().
> 
> Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
> in this documentation. (I've reworded it and expanded on it slightly.)
> 
> Cc: Jonathan Corbet 
> Cc: Ira Weiny 
> Signed-off-by: John Hubbard 

Few nitpick belows, nonetheless:

Reviewed-by: Jérôme Glisse 

> ---
>  Documentation/vm/index.rst  |   1 +
>  Documentation/vm/pin_user_pages.rst | 212 ++
>  include/linux/mm.h  |  62 ++-
>  mm/gup.c| 265 +---
>  4 files changed, 514 insertions(+), 26 deletions(-)
>  create mode 100644 Documentation/vm/pin_user_pages.rst
> 

[...]

> diff --git a/Documentation/vm/pin_user_pages.rst 
> b/Documentation/vm/pin_user_pages.rst
> new file mode 100644
> index ..3910f49ca98c
> --- /dev/null
> +++ b/Documentation/vm/pin_user_pages.rst

[...]

> +
> +FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
> +==
> +
> +Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for 
> describing
> +these categories:
> +
> +CASE 1: Direct IO (DIO)
> +---
> +There are GUP references to pages that are serving
> +as DIO buffers. These buffers are needed for a relatively short time (so they
> +are not "long term"). No special synchronization with page_mkclean() or
> +munmap() is provided. Therefore, flags to set at the call site are: ::
> +
> +FOLL_PIN
> +
> +...but rather than setting FOLL_PIN directly, call sites should use one of
> +the pin_user_pages*() routines that set FOLL_PIN.
> +
> +CASE 2: RDMA
> +
> +There are GUP references to pages that are serving as DMA
> +buffers. These buffers are needed for a long time ("long term"). No special
> +synchronization with page_mkclean() or munmap() is provided. Therefore, flags
> +to set at the call site are: ::
> +
> +FOLL_PIN | FOLL_LONGTERM
> +
> +NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. 
> That's
> +because DAX pages do not have a separate page cache, and so "pinning" implies
> +locking down file system blocks, which is not (yet) supported in that way.
> +
> +CASE 3: ODP
> +---
> +(Mellanox/Infiniband On Demand Paging: the hardware supports
> +replayable page faulting). There are GUP references to pages serving as DMA
> +buffers. For ODP, MMU notifiers are used to synchronize with page_mkclean()
> +and munmap(). Therefore, normal GUP calls are sufficient, so neither flag
> +needs to be set.

I would not include ODP or anything like it here, they do not use
GUP anymore and i believe it is more confusing here. I would how-
ever include some text in this documentation explaining that hard-
ware that support page fault is superior as it does not incur any
of the issues described here.

> +
> +CASE 4: Pinning for struct page manipulation only
> +-
> +Here, normal GUP calls are sufficient, so neither flag needs to be set.
> +

[...]

> diff --git a/mm/gup.c b/mm/gup.c
> index 199da99e8ffc..1aea48427879 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c

[...]

> @@ -1014,7 +1018,16 @@ static __always_inline long 
> __get_user_pages_locked(struct task_struct *tsk,
>   BUG_ON(*locked != 1);
>   }
>  
> - if (pages)
> + /*
> +  * FOLL_PIN and FOLL_GET are mutually exclusive. Traditional behavior
> +  * is to set FOLL_GET if the caller wants pages[] filled in (but has
> +  * carelessly failed t

Re: [PATCH v2 03/18] goldish_pipe: rename local pin_user_pages() routine

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:17:58PM -0800, John Hubbard wrote:
> 1. Avoid naming conflicts: rename local static function from
> "pin_user_pages()" to "pin_goldfish_pages()".
> 
> An upcoming patch will introduce a global pin_user_pages()
> function.
> 
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Reviewed-by: Jérôme Glisse 

> ---
>  drivers/platform/goldfish/goldfish_pipe.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
> b/drivers/platform/goldfish/goldfish_pipe.c
> index cef0133aa47a..7ed2a21a0bac 100644
> --- a/drivers/platform/goldfish/goldfish_pipe.c
> +++ b/drivers/platform/goldfish/goldfish_pipe.c
> @@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
>   }
>  }
>  
> -static int pin_user_pages(unsigned long first_page,
> -   unsigned long last_page,
> -   unsigned int last_page_size,
> -   int is_write,
> -   struct page *pages[MAX_BUFFERS_PER_COMMAND],
> -   unsigned int *iter_last_page_size)
> +static int pin_goldfish_pages(unsigned long first_page,
> +   unsigned long last_page,
> +   unsigned int last_page_size,
> +   int is_write,
> +   struct page *pages[MAX_BUFFERS_PER_COMMAND],
> +   unsigned int *iter_last_page_size)
>  {
>   int ret;
>   int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
> @@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe 
> *pipe,
>   if (mutex_lock_interruptible(&pipe->lock))
>   return -ERESTARTSYS;
>  
> - pages_count = pin_user_pages(first_page, last_page,
> -  last_page_size, is_write,
> -  pipe->pages, &iter_last_page_size);
> + pages_count = pin_goldfish_pages(first_page, last_page,
> +  last_page_size, is_write,
> +  pipe->pages, &iter_last_page_size);
>   if (pages_count < 0) {
>   mutex_unlock(&pipe->lock);
>   return pages_count;
> -- 
> 2.23.0
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 02/18] mm/gup: factor out duplicate code from four routines

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:17:57PM -0800, John Hubbard wrote:
> There are four locations in gup.c that have a fair amount of code
> duplication. This means that changing one requires making the same
> changes in four places, not to mention reading the same code four
> times, and wondering if there are subtle differences.
> 
> Factor out the common code into static functions, thus reducing the
> overall line count and the code's complexity.
> 
> Also, take the opportunity to slightly improve the efficiency of the
> error cases, by doing a mass subtraction of the refcount, surrounded
> by get_page()/put_page().
> 
> Also, further simplify (slightly), by waiting until the the successful
> end of each routine, to increment *nr.
> 
> Cc: Ira Weiny 
> Cc: Christoph Hellwig 
> Cc: Aneesh Kumar K.V 
> Signed-off-by: John Hubbard 

Good cleanup.

Reviewed-by: Jérôme Glisse 

> ---
>  mm/gup.c | 104 ---
>  1 file changed, 45 insertions(+), 59 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 85caf76b3012..199da99e8ffc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1969,6 +1969,34 @@ static int __gup_device_huge_pud(pud_t pud, pud_t 
> *pudp, unsigned long addr,
>  }
>  #endif
>  
> +static int __record_subpages(struct page *page, unsigned long addr,
> +  unsigned long end, struct page **pages, int nr)
> +{
> + int nr_recorded_pages = 0;
> +
> + do {
> + pages[nr] = page;
> + nr++;
> + page++;
> + nr_recorded_pages++;
> + } while (addr += PAGE_SIZE, addr != end);
> + return nr_recorded_pages;
> +}
> +
> +static void put_compound_head(struct page *page, int refs)
> +{
> + /* Do a get_page() first, in case refs == page->_refcount */
> + get_page(page);
> + page_ref_sub(page, refs);
> + put_page(page);
> +}
> +
> +static void __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr)
> +{
> + *nr += nr_recorded_pages;
> + SetPageReferenced(head);
> +}
> +
>  #ifdef CONFIG_ARCH_HAS_HUGEPD
>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
> unsigned long sz)
> @@ -1998,33 +2026,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
> unsigned long addr,
>   /* hugepages are never "special" */
>   VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  
> - refs = 0;
>   head = pte_page(pte);
> -
>   page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> - do {
> - VM_BUG_ON(compound_head(page) != head);
> - pages[*nr] = page;
> - (*nr)++;
> - page++;
> - refs++;
> - } while (addr += PAGE_SIZE, addr != end);
> + refs = __record_subpages(page, addr, end, pages, *nr);
>  
>   head = try_get_compound_head(head, refs);
> - if (!head) {
> - *nr -= refs;
> + if (!head)
>   return 0;
> - }
>  
>   if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> - /* Could be optimized better */
> - *nr -= refs;
> - while (refs--)
> - put_page(head);
> + put_compound_head(head, refs);
>   return 0;
>   }
>  
> - SetPageReferenced(head);
> + __huge_pt_done(head, refs, nr);
>   return 1;
>  }
>  
> @@ -2071,29 +2086,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>pages, nr);
>   }
>  
> - refs = 0;
>   page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> - do {
> - pages[*nr] = page;
> - (*nr)++;
> - page++;
> - refs++;
> - } while (addr += PAGE_SIZE, addr != end);
> + refs = __record_subpages(page, addr, end, pages, *nr);
>  
>   head = try_get_compound_head(pmd_page(orig), refs);
> - if (!head) {
> - *nr -= refs;
> + if (!head)
>   return 0;
> - }
>  
>   if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> - *nr -= refs;
> - while (refs--)
> - put_page(head);
> + put_compound_head(head, refs);
>   return 0;
>   }
>  
> - SetPageReferenced(head);
> + __huge_pt_done(head, refs, nr);
>   return 1;
>  }
>  
> @@ -2114,29 +2119,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> unsigned long addr,
>pages, nr);
>   }
>  
> - refs = 0;
>   page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> - do {
> - pages[*nr] = page;
> - (*nr)++;
> - page++;
> - refs++;
> - } while (addr += PAGE_SIZE, addr != end);
> + refs = __record_subpages(page, addr, end, pages, *nr);
>  
>   head = try_get_compound_head(pud_page(orig), refs);
> - if (!head) {
> - *nr -= refs;
> + if (!head)
>

Re: [PATCH v2 01/18] mm/gup: pass flags arg to __gup_device_* functions

2019-11-04 Thread Jerome Glisse
On Sun, Nov 03, 2019 at 01:17:56PM -0800, John Hubbard wrote:
> A subsequent patch requires access to gup flags, so
> pass the flags argument through to the __gup_device_*
> functions.
> 
> Also placate checkpatch.pl by shortening a nearby line.
> 
> Reviewed-by: Ira Weiny 
> Cc: Kirill A. Shutemov 
> Signed-off-by: John Hubbard 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/gup.c | 28 ++--
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..85caf76b3012 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1890,7 +1890,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
> unsigned long end,
>  
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
> defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   int nr_start = *nr;
>   struct dev_pagemap *pgmap = NULL;
> @@ -1916,13 +1917,14 @@ static int __gup_device_huge(unsigned long pfn, 
> unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> @@ -1933,13 +1935,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t 
> *pmdp, unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> @@ -1950,14 +1953,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t 
> *pudp, unsigned long addr,
>  }
>  #else
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
>  }
>  
>  static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
> @@ -2062,7 +2067,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>   if (pmd_devmap(orig)) {
>   if (unlikely(flags & FOLL_LONGTERM))
>   return 0;
> - return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
> + return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
> +  pages, nr);
>   }
>  
>   refs = 0;
> @@ -2092,7 +2098,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>  }
>  
>  static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, unsigned int flags, struct page **pages, int 
> *nr)
> + unsigned long end, unsigned int flags,
> + struct page **pages, int *nr)
>  {
>   struct page *head, *page;
>   int refs;
> @@ -2103,7 +2110,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> unsigned long addr,
>   if (pud_devmap(orig)) {
>   if (unlikely(flags & FOLL_LONGTERM))
>   return 0;
> - return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
> + return __gup_device_huge_pud(orig, pudp, addr, end, flags,
> +  pages, nr);
>   }
>  
>   refs = 0;
> -- 
> 2.23.0
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: Proposal to report GPU private memory allocations with sysfs nodes [plain text version]

2019-10-28 Thread Jerome Glisse
On Fri, Oct 25, 2019 at 11:35:32AM -0700, Yiwei Zhang wrote:
> Hi folks,
> 
> This is the plain text version of the previous email in case that was
> considered as spam.
> 
> --- Background ---
> On the downstream Android, vendors used to report GPU private memory
> allocations with debugfs nodes in their own formats. However, debugfs nodes
> are getting deprecated in the next Android release.

Maybe explain why it is useful first ?

> 
> --- Proposal ---
> We are taking the chance to unify all the vendors to migrate their existing
> debugfs nodes into a standardized sysfs node structure. Then the platform
> is able to do a bunch of useful things: memory profiling, system health
> coverage, field metrics, local shell dump, in-app api, etc. This proposal
> is better served upstream as all GPU vendors can standardize a gpu memory
> structure and reduce fragmentation across Android and Linux that clients
> can rely on.
> 
> --- Detailed design ---
> The sysfs node structure looks like below:
> /sys/devices///
> e.g. "/sys/devices/mali0/gpu_mem/606/gl_buffer" and the gl_buffer is a node
> having the comma separated size values: "4096,81920,...,4096".

How does kernel knows what API the allocation is use for ? With the
open source driver you never specify what API is creating a gem object
(opengl, vulkan, ...) nor what purpose (transient, shader, ...).


> For the top level root, vendors can choose their own names based on the
> value of ro.gfx.sysfs.0 the vendors set. (1) For the multiple gpu driver
> cases, we can use ro.gfx.sysfs.1, ro.gfx.sysfs.2 for the 2nd and 3rd KMDs.
> (2) It's also allowed to put some sub-dir for example "kgsl/gpu_mem" or
> "mali0/gpu_mem" in the ro.gfx.sysfs. property if the root name
> under /sys/devices/ is already created and used for other purposes.

On one side you want to standardize on the other you want to give
complete freedom on the top level naming scheme. I would rather see a
consistent naming scheme (ie something more restraint and with little
place for interpration by individual driver)
.

> For the 2nd level "pid", there are usually just a couple of them per
> snapshot, since we only takes snapshot for the active ones.

? Do not understand here, you can have any number of applications with
GPU objects ? And thus there is no bound on the number of PID. Please
consider desktop too, i do not know what kind of limitation android
impose.

> For the 3rd level "type_name", the type name will be one of the GPU memory
> object types in lower case, and the value will be a comma separated
> sequence of size values for all the allocations under that specific type.
> 
> We especially would like some comments on this part. For the GPU memory
> object types, we defined 9 different types for Android:
> (1) UNKNOWN // not accounted for in any other category
> (2) SHADER // shader binaries
> (3) COMMAND // allocations which have a lifetime similar to a
> VkCommandBuffer
> (4) VULKAN // backing for VkDeviceMemory
> (5) GL_TEXTURE // GL Texture and RenderBuffer
> (6) GL_BUFFER // GL Buffer
> (7) QUERY // backing for query
> (8) DESCRIPTOR // allocations which have a lifetime similar to a
> VkDescriptorSet
> (9) TRANSIENT // random transient things that the driver needs
>
> We are wondering if those type enumerations make sense to the upstream side
> as well, or maybe we just deal with our own different type sets. Cuz on the
> Android side, we'll just read those nodes named after the types we defined
> in the sysfs node structure.

See my above point of open source driver and kernel being unaware
of the allocation purpose and use.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-23 Thread Jerome Glisse
On Mon, Oct 21, 2019 at 07:06:00PM +, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 02:40:41PM -0400, Jerome Glisse wrote:
> > On Tue, Oct 15, 2019 at 03:12:27PM -0300, Jason Gunthorpe wrote:
> > > From: Jason Gunthorpe 
> > > 
> > > 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> > > scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> > > they only use invalidate_range_start/end and immediately check the
> > > invalidating range against some driver data structure to tell if the
> > > driver is interested. Half of them use an interval_tree, the others are
> > > simple linear search lists.
> > > 
> > > Of the ones I checked they largely seem to have various kinds of races,
> > > bugs and poor implementation. This is a result of the complexity in how
> > > the notifier interacts with get_user_pages(). It is extremely difficult to
> > > use it correctly.
> > > 
> > > Consolidate all of this code together into the core mmu_notifier and
> > > provide a locking scheme similar to hmm_mirror that allows the user to
> > > safely use get_user_pages() and reliably know if the page list still
> > > matches the mm.
> > > 
> > > This new arrangment plays nicely with the !blockable mode for
> > > OOM. Scanning the interval tree is done such that the intersection test
> > > will always succeed, and since there is no invalidate_range_end exposed to
> > > drivers the scheme safely allows multiple drivers to be subscribed.
> > > 
> > > Four places are converted as an example of how the new API is used.
> > > Four are left for future patches:
> > >  - i915_gem has complex locking around destruction of a registration,
> > >needs more study
> > >  - hfi1 (2nd user) needs access to the rbtree
> > >  - scif_dma has a complicated logic flow
> > >  - vhost's mmu notifiers are already being rewritten
> > > 
> > > This is still being tested, but I figured to send it to start getting help
> > > from the xen, amd and hfi drivers which I cannot test here.
> > 
> > It might be a good oportunity to also switch those users to
> > hmm_range_fault() instead of GUP as GUP is pointless for those
> > users. In fact the GUP is an impediment to normal mm operations.
> 
> I think vhost can use hmm_range_fault
> 
> hfi1 does actually need to have the page pin, it doesn't fence DMA
> during invalidate.
> 
> i915_gem feels alot like amdgpu, so probably it would benefit
> 
> No idea about scif_dma
> 
> > I will test on nouveau.
> 
> Thanks, hopefully it still works, I think Ralph was able to do some
> basic checks. But it is a pretty complicated series, I probably made
> some mistakes.

So it seems to work ok with nouveau, will let tests run in loop thought
there are not very advance test.

> 
> FWIW, I know that nouveau gets a lockdep splat now from Daniel
> Vetter's recent changes, it tries to do GFP_KERENEL allocations under
> a lock also held by the invalidate_range_start path.

I have not seen any splat so far, is it throught some new kernel config ?

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-23 Thread Jerome Glisse
On Wed, Oct 23, 2019 at 11:32:16AM +0200, Christian König wrote:
> Am 23.10.19 um 11:08 schrieb Daniel Vetter:
> > On Tue, Oct 22, 2019 at 03:01:13PM +, Jason Gunthorpe wrote:
> > > On Tue, Oct 22, 2019 at 09:57:35AM +0200, Daniel Vetter wrote:
> > > 
> > > > > The unusual bit in all of this is using a lock's critical region to
> > > > > 'protect' data for read, but updating that same data before the lock's
> > > > > critical secion. ie relying on the unlock barrier to 'release' program
> > > > > ordered stores done before the lock's own critical region, and the
> > > > > lock side barrier to 'acquire' those stores.
> > > > I think this unusual use of locks as barriers for other unlocked 
> > > > accesses
> > > > deserves comments even more than just normal barriers. Can you pls add
> > > > them? I think the design seeems sound ...
> > > > 
> > > > Also the comment on the driver's lock hopefully prevents driver
> > > > maintainers from moving the driver_lock around in a way that would very
> > > > subtle break the scheme, so I think having the acquire barrier commented
> > > > in each place would be really good.
> > > There is already a lot of documentation, I think it would be helpful
> > > if you could suggest some specific places where you think an addition
> > > would help? I think the perspective of someone less familiar with this
> > > design would really improve the documentation
> > Hm I just meant the usual recommendation that "barriers must have comments
> > explaining what they order, and where the other side of the barrier is".
> > Using unlock/lock as a barrier imo just makes that an even better idea.
> > Usually what I do is something like "we need to order $this against $that
> > below, and the other side of this barrier is in function()." With maybe a
> > bit more if it's not obvious how things go wrong if the orderin is broken.
> > 
> > Ofc seqlock.h itself skimps on that rule and doesn't bother explaining its
> > barriers :-/
> > 
> > > I've been tempted to force the driver to store the seq number directly
> > > under the driver lock - this makes the scheme much clearer, ie
> > > something like this:
> > > 
> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
> > > b/drivers/gpu/drm/nouveau/nouveau_svm.c
> > > index 712c99918551bc..738fa670dcfb19 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> > > @@ -488,7 +488,8 @@ struct svm_notifier {
> > >   };
> > >   static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
> > > -const struct mmu_notifier_range 
> > > *range)
> > > +const struct mmu_notifier_range 
> > > *range,
> > > +unsigned long seq)
> > >   {
> > >  struct svm_notifier *sn =
> > >  container_of(mrn, struct svm_notifier, notifier);
> > > @@ -504,6 +505,7 @@ static bool nouveau_svm_range_invalidate(struct 
> > > mmu_range_notifier *mrn,
> > >  mutex_lock(&sn->svmm->mutex);
> > >  else if (!mutex_trylock(&sn->svmm->mutex))
> > >  return false;
> > > +   mmu_range_notifier_update_seq(mrn, seq);
> > >  mutex_unlock(&sn->svmm->mutex);
> > >  return true;
> > >   }
> > > 
> > > 
> > > At the cost of making the driver a bit more complex, what do you
> > > think?
> > Hm, spinning this further ... could we initialize the mmu range notifier
> > with a pointer to the driver lock, so that we could put a
> > lockdep_assert_held into mmu_range_notifier_update_seq? I think that would
> > make this scheme substantially more driver-hacker proof :-)
> 
> Going another step further what hinders us to put the lock into the mmu
> range notifier itself and have _lock()/_unlock() helpers?
> 
> I mean having the lock in the driver only makes sense when the driver would
> be using the same lock for multiple things, e.g. multiple MMU range
> notifiers under the same lock. But I really don't see that use case here.

I actualy do, nouveau use one lock to protect the page table and that's the
lock that matter. You can have multiple range for a single page table, idea
being only a sub-set of the process address space is ever accessed by the
GPU and those it is better to focus on this sub-set and track invalidation in
a finer grain.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm 02/15] mm/mmu_notifier: add an interval tree notifier

2019-10-21 Thread Jerome Glisse
On Mon, Oct 21, 2019 at 07:24:53PM +, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 03:11:57PM -0400, Jerome Glisse wrote:
> > > Since that reader is not locked we need release semantics here to
> > > ensure the unlocked reader sees a fully initinalized mmu_notifier_mm
> > > structure when it observes the pointer.
> > 
> > I thought the mm_take_all_locks() would have had a barrier and thus
> > that you could not see mmu_notifier struct partialy initialized. 
> 
> Not sure, usually a lock acquire doesn't have a store barrier?

Yeah likely.

> 
> Even if it did, we would still need some pairing read barrier..
> 
> > having the acquire/release as safety net does not hurt. Maybe add a
> > comment about the struct initialization needing to be visible before
> > pointer is set.
> 
> Is this clear?
> 
>  * release semantics on the initialization of the
>  * mmu_notifier_mm's contents are provided for unlocked readers.
>* acquire can only be used while holding the
>  * mmgrab or mmget, and is safe because once created the
>  * mmu_notififer_mm is not freed until the mm is destroyed.
>  * Users holding the mmap_sem or one of the
>* mm_take_all_locks() do not need to use acquire semantics.
> 
> It also helps explain why there is no locking around the other
> readers, which has puzzled me in the past at least.

Perfect.

Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm 15/15] mm/hmm: remove hmm_mirror and related

2019-10-21 Thread Jerome Glisse
On Mon, Oct 21, 2019 at 06:57:42PM +, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 02:38:24PM -0400, Jerome Glisse wrote:
> > On Tue, Oct 15, 2019 at 03:12:42PM -0300, Jason Gunthorpe wrote:
> > > From: Jason Gunthorpe 
> > > 
> > > The only two users of this are now converted to use mmu_range_notifier,
> > > delete all the code and update hmm.rst.
> > 
> > I guess i should point out that the reasons for hmm_mirror and hmm
> > was for:
> > 1) Maybe define a common API for userspace to provide memory
> >placement hints (NUMA for GPU)
> 
> Do you think this needs special code in the notifiers?

Just need a place where to hang userspace policy hint the hmm_range
was the prime suspect. I need to revisit this once the nouveau user
space is in better shape.

> 
> > 2) multi-devices sharing same mirror page table
> 
> Oh neat, but I think this just means the GPU driver has to register a
> single notifier for multiple GPUs??

Yes that was the idea a single notifier with share page table, but
at this time this is non existent code so no need to hinder change
just for the sake of it.

> 
> > But support for multi-GPU in nouveau is way behind and i guess such
> > optimization will have to re-materialize what is necessary once that
> > happens.
> 
> Sure, it will be easier to understand what is needed with a bit of
> code!
> 
> > Note this patch should also update kernel/fork.c and the mm_struct
> > definition AFAICT. With those changes you can add my:
> 
> Can you please elaborate what updates you mean? I'm not sure. 
> 
> Maybe I already got the things you are thinking of with the get/put
> changes?

Oh i forgot this was already taken care of by this. So yes all is
fine:

Reviewed-by: Jérôme Glisse 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm 02/15] mm/mmu_notifier: add an interval tree notifier

2019-10-21 Thread Jerome Glisse
On Mon, Oct 21, 2019 at 06:54:25PM +, Jason Gunthorpe wrote:
> On Mon, Oct 21, 2019 at 02:30:56PM -0400, Jerome Glisse wrote:
> 
> > > +/**
> > > + * mmu_range_read_retry - End a read side critical section against a VA 
> > > range
> > > + * mrn: The range under lock
> > > + * seq: The return of the paired mmu_range_read_begin()
> > > + *
> > > + * This MUST be called under a user provided lock that is also held
> > > + * unconditionally by op->invalidate(). That lock provides the required 
> > > SMP
> > > + * barrier for handling invalidate_seq.
> > > + *
> > > + * Each call should be paired with a single mmu_range_read_begin() and
> > > + * should be used to conclude the read side.
> > > + *
> > > + * Returns true if an invalidation collided with this critical section, 
> > > and
> > > + * the caller should retry.
> > > + */
> > > +static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
> > > + unsigned long seq)
> > > +{
> > > + return READ_ONCE(mrn->invalidate_seq) != seq;
> > > +}
> > 
> > What about calling this mmu_range_read_end() instead ? To match
> > with the mmu_range_read_begin().
> 
> _end make some sense too, but I picked _retry for symmetry with the
> seqcount_* family of functions which used retry.
> 
> I think retry makes it clearer that it is expected to fail and retry
> is required.

Fair enough.

> 
> > > + /*
> > > +  * The inv_end incorporates a deferred mechanism like rtnl. Adds and
> > 
> > The rtnl reference is lost on people unfamiliar with the network :)
> > code maybe like rtnl_lock()/rtnl_unlock() so people have a chance to
> > grep the right function. Assuming i am myself getting the right
> > reference :)
> 
> Yep, you got it, I will update
> 
> > > + /*
> > > +  * mrn->invalidate_seq is always set to an odd value. This ensures
> > > +  * that if seq does wrap we will always clear the below sleep in some
> > > +  * reasonable time as mmn_mm->invalidate_seq is even in the idle
> > > +  * state.
> > 
> > I think this comment should be with the struct mmu_range_notifier
> > definition and you should just point to it from here as the same
> > comment would be useful down below.
> 
> I had it here because it is critical to understanding the wait_event
> and why it doesn't just block indefinitely, but yes this property
> comes up below too which refers back here.
> 
> Fundamentally this wait event is why this approach to keep an odd
> value in the mrn is used.

The comment is fine, it is just i read the patch out of order and
in insert function i was pondering on why it must be odd while the
explanation was here. It is more a taste thing, i prefer comments
about this to be part of the struct definition comments so that
multiple place can refer to the same struct definition it is more
resiliant to code change as struct definition is always easy to
find and thus reference to it can be sprinkle all over where it is
necessary.


> 
> > > -int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range 
> > > *range)
> > > +static int mn_itree_invalidate(struct mmu_notifier_mm *mmn_mm,
> > > +  const struct mmu_notifier_range *range)
> > > +{
> > > + struct mmu_range_notifier *mrn;
> > > + unsigned long cur_seq;
> > > +
> > > + for (mrn = mn_itree_inv_start_range(mmn_mm, range, &cur_seq); mrn;
> > > +  mrn = mn_itree_inv_next(mrn, range)) {
> > > + bool ret;
> > > +
> > > + WRITE_ONCE(mrn->invalidate_seq, cur_seq);
> > > + ret = mrn->ops->invalidate(mrn, range);
> > > + if (!ret && !WARN_ON(mmu_notifier_range_blockable(range)))
> > 
> > Isn't the logic wrong here ? We want to warn if the range
> > was mark as blockable and invalidate returned false. Also
> > we went to backoff no matter what if the invalidate return
> > false ie:
> 
> If invalidate returned false and the caller is blockable then we do
> not want to return, we must continue processing other ranges - to try
> to cope with the defective driver.
> 
> Callers in blocking mode ignore the return value and go ahead to
> invalidate..
> 
> Would it be clearer as 
> 
> if (!ret) {
>if (WARN_ON(mmu_notifier_range_blockable(range)))
>continue;
>goto out_would_block;
> }
> 
> ?

Yes look clearer to me at least.

> 
> > > @@ -284,21 +5

Re: [PATCH hmm 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:27PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> they only use invalidate_range_start/end and immediately check the
> invalidating range against some driver data structure to tell if the
> driver is interested. Half of them use an interval_tree, the others are
> simple linear search lists.
> 
> Of the ones I checked they largely seem to have various kinds of races,
> bugs and poor implementation. This is a result of the complexity in how
> the notifier interacts with get_user_pages(). It is extremely difficult to
> use it correctly.
> 
> Consolidate all of this code together into the core mmu_notifier and
> provide a locking scheme similar to hmm_mirror that allows the user to
> safely use get_user_pages() and reliably know if the page list still
> matches the mm.
> 
> This new arrangment plays nicely with the !blockable mode for
> OOM. Scanning the interval tree is done such that the intersection test
> will always succeed, and since there is no invalidate_range_end exposed to
> drivers the scheme safely allows multiple drivers to be subscribed.
> 
> Four places are converted as an example of how the new API is used.
> Four are left for future patches:
>  - i915_gem has complex locking around destruction of a registration,
>needs more study
>  - hfi1 (2nd user) needs access to the rbtree
>  - scif_dma has a complicated logic flow
>  - vhost's mmu notifiers are already being rewritten
> 
> This is still being tested, but I figured to send it to start getting help
> from the xen, amd and hfi drivers which I cannot test here.

It might be a good oportunity to also switch those users to
hmm_range_fault() instead of GUP as GUP is pointless for those
users. In fact the GUP is an impediment to normal mm operations.

I will test on nouveau.

Cheers,
Jérôme

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm 15/15] mm/hmm: remove hmm_mirror and related

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:42PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> The only two users of this are now converted to use mmu_range_notifier,
> delete all the code and update hmm.rst.

I guess i should point out that the reasons for hmm_mirror and hmm
was for:
1) Maybe define a common API for userspace to provide memory
   placement hints (NUMA for GPU)
2) multi-devices sharing same mirror page table

But support for multi-GPU in nouveau is way behind and i guess such
optimization will have to re-materialize what is necessary once that
happens.

Note this patch should also update kernel/fork.c and the mm_struct
definition AFAICT. With those changes you can add my:

Reviewed-by: Jérôme Glisse 

> 
> Signed-off-by: Jason Gunthorpe 
> ---
>  Documentation/vm/hmm.rst | 105 ---
>  include/linux/hmm.h  | 183 +
>  mm/Kconfig   |   1 -
>  mm/hmm.c | 284 +--
>  4 files changed, 33 insertions(+), 540 deletions(-)
> 
> diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
> index 0a5960beccf76d..a247643035c4e2 100644
> --- a/Documentation/vm/hmm.rst
> +++ b/Documentation/vm/hmm.rst
> @@ -147,49 +147,16 @@ Address space mirroring implementation and API
>  Address space mirroring's main objective is to allow duplication of a range 
> of
>  CPU page table into a device page table; HMM helps keep both synchronized. A
>  device driver that wants to mirror a process address space must start with 
> the
> -registration of an hmm_mirror struct::
> -
> - int hmm_mirror_register(struct hmm_mirror *mirror,
> - struct mm_struct *mm);
> -
> -The mirror struct has a set of callbacks that are used
> -to propagate CPU page tables::
> -
> - struct hmm_mirror_ops {
> - /* release() - release hmm_mirror
> -  *
> -  * @mirror: pointer to struct hmm_mirror
> -  *
> -  * This is called when the mm_struct is being released.  The callback
> -  * must ensure that all access to any pages obtained from this mirror
> -  * is halted before the callback returns. All future access should
> -  * fault.
> -  */
> - void (*release)(struct hmm_mirror *mirror);
> -
> - /* sync_cpu_device_pagetables() - synchronize page tables
> -  *
> -  * @mirror: pointer to struct hmm_mirror
> -  * @update: update information (see struct mmu_notifier_range)
> -  * Return: -EAGAIN if update.blockable false and callback need to
> -  * block, 0 otherwise.
> -  *
> -  * This callback ultimately originates from mmu_notifiers when the CPU
> -  * page table is updated. The device driver must update its page table
> -  * in response to this callback. The update argument tells what action
> -  * to perform.
> -  *
> -  * The device driver must not return from this callback until the device
> -  * page tables are completely updated (TLBs flushed, etc); this is a
> -  * synchronous call.
> -  */
> - int (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
> -   const struct hmm_update *update);
> - };
> -
> -The device driver must perform the update action to the range (mark range
> -read only, or fully unmap, etc.). The device must complete the update before
> -the driver callback returns.
> +registration of a mmu_range_notifier::
> +
> + mrn->ops = &driver_ops;
> + int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> +   unsigned long start, unsigned long length,
> +   struct mm_struct *mm);
> +
> +During the driver_ops->invalidate() callback the device driver must perform
> +the update action to the range (mark range read only, or fully unmap,
> +etc.). The device must complete the update before the driver callback 
> returns.
>  
>  When the device driver wants to populate a range of virtual addresses, it can
>  use::
> @@ -216,70 +183,46 @@ The usage pattern is::
>struct hmm_range range;
>...
>  
> +  range.notifier = &mrn;
>range.start = ...;
>range.end = ...;
>range.pfns = ...;
>range.flags = ...;
>range.values = ...;
>range.pfn_shift = ...;
> -  hmm_range_register(&range, mirror);
>  
> -  /*
> -   * Just wait for range to be valid, safe to ignore return value as we
> -   * will use the return value of hmm_range_fault() below under the
> -   * mmap_sem to ascertain the validity of the range.
> -   */
> -  hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
> +  if (!mmget_not_zero(mrn->notifier.mm))
> +  return -EFAULT;
>  
>   again:
> +  range.notifier_seq = mmu_range_read_begin(&mrn);
>down_read(&mm->mmap_sem);
>ret = hmm_range_fault(&range, HMM_RANGE_SNAPSHOT);
>if (ret) {
>up_read(&mm->mmap_sem);
> -  if (ret == -EBUSY) {
> -  

Re: [PATCH hmm 03/15] mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:30PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> hmm_mirror's handling of ranges does not use a sequence count which
> results in this bug:
> 
>  CPU0   CPU1
>  hmm_range_wait_until_valid(range)
>  valid == true
>  hmm_range_fault(range)
> hmm_invalidate_range_start()
>range->valid = false
> hmm_invalidate_range_end()
>range->valid = true
>  hmm_range_valid(range)
>   valid == true
> 
> Where the hmm_range_valid should not have succeeded.
> 
> Adding the required sequence count would make it nearly identical to the
> new mmu_range_notifier. Instead replace the hmm_mirror stuff with
> mmu_range_notifier.
> 
> Co-existence of the two APIs is the first step.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/hmm.h |  5 +
>  mm/hmm.c| 25 +++--
>  2 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 3fec513b9c00f1..8ac1fd6a81af8f 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -145,6 +145,9 @@ enum hmm_pfn_value_e {
>  /*
>   * struct hmm_range - track invalidation lock on virtual address range
>   *
> + * @notifier: an optional mmu_range_notifier
> + * @notifier_seq: when notifier is used this is the result of
> + *mmu_range_read_begin()
>   * @hmm: the core HMM structure this range is active against
>   * @vma: the vm area struct for the range
>   * @list: all range lock are on a list
> @@ -159,6 +162,8 @@ enum hmm_pfn_value_e {
>   * @valid: pfns array did not change since it has been fill by an HMM 
> function
>   */
>  struct hmm_range {
> + struct mmu_range_notifier *notifier;
> + unsigned long   notifier_seq;
>   struct hmm  *hmm;
>   struct list_headlist;
>   unsigned long   start;
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 902f5fa6bf93ad..22ac3595771feb 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -852,6 +852,14 @@ void hmm_range_unregister(struct hmm_range *range)
>  }
>  EXPORT_SYMBOL(hmm_range_unregister);
>  
> +static bool needs_retry(struct hmm_range *range)
> +{
> + if (range->notifier)
> + return mmu_range_check_retry(range->notifier,
> +  range->notifier_seq);
> + return !range->valid;
> +}
> +
>  static const struct mm_walk_ops hmm_walk_ops = {
>   .pud_entry  = hmm_vma_walk_pud,
>   .pmd_entry  = hmm_vma_walk_pmd,
> @@ -892,18 +900,23 @@ long hmm_range_fault(struct hmm_range *range, unsigned 
> int flags)
>   const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
>   unsigned long start = range->start, end;
>   struct hmm_vma_walk hmm_vma_walk;
> - struct hmm *hmm = range->hmm;
> + struct mm_struct *mm;
>   struct vm_area_struct *vma;
>   int ret;
>  
> - lockdep_assert_held(&hmm->mmu_notifier.mm->mmap_sem);
> + if (range->notifier)
> + mm = range->notifier->mm;
> + else
> + mm = range->hmm->mmu_notifier.mm;
> +
> + lockdep_assert_held(&mm->mmap_sem);
>  
>   do {
>   /* If range is no longer valid force retry. */
> - if (!range->valid)
> + if (needs_retry(range))
>   return -EBUSY;
>  
> - vma = find_vma(hmm->mmu_notifier.mm, start);
> + vma = find_vma(mm, start);
>   if (vma == NULL || (vma->vm_flags & device_vma))
>   return -EFAULT;
>  
> @@ -933,7 +946,7 @@ long hmm_range_fault(struct hmm_range *range, unsigned 
> int flags)
>   start = hmm_vma_walk.last;
>  
>   /* Keep trying while the range is valid. */
> - } while (ret == -EBUSY && range->valid);
> + } while (ret == -EBUSY && !needs_retry(range));
>  
>   if (ret) {
>   unsigned long i;
> @@ -991,7 +1004,7 @@ long hmm_range_dma_map(struct hmm_range *range, struct 
> device *device,
>   continue;
>  
>   /* Check if range is being invalidated */
> - if (!range->valid) {
> + if (needs_retry(range)) {
>   ret = -EBUSY;
>   goto unmap;
>   }
> -- 
> 2.23.0
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:28PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> Now that we have KERNEL_HEADER_TEST all headers are generally compile
> tested, so relying on makefile tricks to avoid compiling code that depends
> on CONFIG_MMU_NOTIFIER is more annoying.
> 
> Instead follow the usual pattern and provide most of the header with only
> the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
> ensures code compiles no matter what the config setting is.
> 
> While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/mmu_notifier.h | 46 +---
>  mm/mmu_notifier.c| 13 ++
>  2 files changed, 30 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 1bd8e6a09a3c27..12bd603d318ce7 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -7,8 +7,9 @@
>  #include 
>  #include 
>  
> +struct mmu_notifier_mm;
>  struct mmu_notifier;
> -struct mmu_notifier_ops;
> +struct mmu_notifier_range;
>  
>  /**
>   * enum mmu_notifier_event - reason for the mmu notifier callback
> @@ -40,36 +41,8 @@ enum mmu_notifier_event {
>   MMU_NOTIFY_SOFT_DIRTY,
>  };
>  
> -#ifdef CONFIG_MMU_NOTIFIER
> -
> -#ifdef CONFIG_LOCKDEP
> -extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
> -#endif
> -
> -/*
> - * The mmu notifier_mm structure is allocated and installed in
> - * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> - * critical section and it's released only when mm_count reaches zero
> - * in mmdrop().
> - */
> -struct mmu_notifier_mm {
> - /* all mmu notifiers registerd in this mm are queued in this list */
> - struct hlist_head list;
> - /* to serialize the list modifications and hlist_unhashed */
> - spinlock_t lock;
> -};
> -
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>  
> -struct mmu_notifier_range {
> - struct vm_area_struct *vma;
> - struct mm_struct *mm;
> - unsigned long start;
> - unsigned long end;
> - unsigned flags;
> - enum mmu_notifier_event event;
> -};
> -
>  struct mmu_notifier_ops {
>   /*
>* Called either by mmu_notifier_unregister or when the mm is
> @@ -249,6 +222,21 @@ struct mmu_notifier {
>   unsigned int users;
>  };
>  
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +#ifdef CONFIG_LOCKDEP
> +extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
> +#endif
> +
> +struct mmu_notifier_range {
> + struct vm_area_struct *vma;
> + struct mm_struct *mm;
> + unsigned long start;
> + unsigned long end;
> + unsigned flags;
> + enum mmu_notifier_event event;
> +};
> +
>  static inline int mm_has_notifiers(struct mm_struct *mm)
>  {
>   return unlikely(mm->mmu_notifier_mm);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 7fde88695f35d6..367670cfd02b7b 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -27,6 +27,19 @@ struct lockdep_map 
> __mmu_notifier_invalidate_range_start_map = {
>  };
>  #endif
>  
> +/*
> + * The mmu notifier_mm structure is allocated and installed in
> + * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> + * critical section and it's released only when mm_count reaches zero
> + * in mmdrop().
> + */
> +struct mmu_notifier_mm {
> + /* all mmu notifiers registered in this mm are queued in this list */
> + struct hlist_head list;
> + /* to serialize the list modifications and hlist_unhashed */
> + spinlock_t lock;
> +};
> +
>  /*
>   * This function can't run concurrently against mmu_notifier_register
>   * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> -- 
> 2.23.0
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm 04/15] mm/hmm: define the pre-processor related parts of hmm.h even if disabled

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:31PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> Only the function calls are stubbed out with static inlines that always
> fail. This is the standard way to write a header for an optional component
> and makes it easier for drivers that only optionally need HMM_MIRROR.
> 
> Signed-off-by: Jason Gunthorpe 

Reviewed-by: Jérôme Glisse 

> ---
>  include/linux/hmm.h | 59 -
>  kernel/fork.c   |  1 -
>  2 files changed, 47 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 8ac1fd6a81af8f..2666eb08a40615 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -62,8 +62,6 @@
>  #include 
>  #include 
>  
> -#ifdef CONFIG_HMM_MIRROR
> -
>  #include 
>  #include 
>  #include 
> @@ -374,6 +372,15 @@ struct hmm_mirror {
>   struct list_headlist;
>  };
>  
> +/*
> + * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that 
> case.
> + */
> +#define HMM_FAULT_ALLOW_RETRY(1 << 0)
> +
> +/* Don't fault in missing PTEs, just snapshot the current state. */
> +#define HMM_FAULT_SNAPSHOT   (1 << 1)
> +
> +#ifdef CONFIG_HMM_MIRROR
>  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
>  void hmm_mirror_unregister(struct hmm_mirror *mirror);
>  
> @@ -383,14 +390,6 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
>  int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror);
>  void hmm_range_unregister(struct hmm_range *range);
>  
> -/*
> - * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that 
> case.
> - */
> -#define HMM_FAULT_ALLOW_RETRY(1 << 0)
> -
> -/* Don't fault in missing PTEs, just snapshot the current state. */
> -#define HMM_FAULT_SNAPSHOT   (1 << 1)
> -
>  long hmm_range_fault(struct hmm_range *range, unsigned int flags);
>  
>  long hmm_range_dma_map(struct hmm_range *range,
> @@ -401,6 +400,44 @@ long hmm_range_dma_unmap(struct hmm_range *range,
>struct device *device,
>dma_addr_t *daddrs,
>bool dirty);
> +#else
> +int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +void hmm_mirror_unregister(struct hmm_mirror *mirror)
> +{
> +}
> +
> +int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +void hmm_range_unregister(struct hmm_range *range)
> +{
> +}
> +
> +static inline long hmm_range_fault(struct hmm_range *range, unsigned int 
> flags)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static inline long hmm_range_dma_map(struct hmm_range *range,
> +  struct device *device, dma_addr_t *daddrs,
> +  unsigned int flags)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +static inline long hmm_range_dma_unmap(struct hmm_range *range,
> +struct device *device,
> +dma_addr_t *daddrs, bool dirty)
> +{
> + return -EOPNOTSUPP;
> +}
> +#endif
>  
>  /*
>   * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
> @@ -411,6 +448,4 @@ long hmm_range_dma_unmap(struct hmm_range *range,
>   */
>  #define HMM_RANGE_DEFAULT_TIMEOUT 1000
>  
> -#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
> -
>  #endif /* LINUX_HMM_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f9572f41612628..4561a65d19db88 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -40,7 +40,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> -- 
> 2.23.0
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH hmm 02/15] mm/mmu_notifier: add an interval tree notifier

2019-10-21 Thread Jerome Glisse
On Tue, Oct 15, 2019 at 03:12:29PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> Of the 13 users of mmu_notifiers, 8 of them use only
> invalidate_range_start/end() and immediately intersect the
> mmu_notifier_range with some kind of internal list of VAs.  4 use an
> interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
> of some kind (scif_dma, vhost, gntdev, hmm)
> 
> And the remaining 5 either don't use invalidate_range_start() or do some
> special thing with it.
> 
> It turns out that building a correct scheme with an interval tree is
> pretty complicated, particularly if the use case is synchronizing against
> another thread doing get_user_pages().  Many of these implementations have
> various subtle and difficult to fix races.
> 
> This approach puts the interval tree as common code at the top of the mmu
> notifier call tree and implements a shareable locking scheme.
> 
> It includes:
>  - An interval tree tracking VA ranges, with per-range callbacks
>  - A read/write locking scheme for the interval tree that avoids
>sleeping in the notifier path (for OOM killer)
>  - A sequence counter based collision-retry locking scheme to tell
>device page fault that a VA range is being concurrently invalidated.
> 
> This is based on various ideas:
> - hmm accumulates invalidated VA ranges and releases them when all
>   invalidates are done, via active_invalidate_ranges count.
>   This approach avoids having to intersect the interval tree twice (as
>   umem_odp does) at the potential cost of a longer device page fault.
> 
> - kvm/umem_odp use a sequence counter to drive the collision retry,
>   via invalidate_seq
> 
> - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
>   This makes adding/removing interval tree members more deterministic
> 
> - seqlock, except this version makes the seqlock idea multi-holder on the
>   write side by protecting it with active_invalidate_ranges and a spinlock
> 
> To minimize MM overhead when only the interval tree is being used, the
> entire SRCU and hlist overheads are dropped using some simple
> branches. Similarly the interval tree overhead is dropped when in hlist
> mode.
> 
> The overhead from the mandatory spinlock is broadly the same as most of
> existing users which already had a lock (or two) of some sort on the
> invalidation path.
> 
> Cc: Andrea Arcangeli 
> Cc: Michal Hocko 
> Signed-off-by: Jason Gunthorpe 
> ---
>  include/linux/mmu_notifier.h |  78 ++
>  mm/Kconfig   |   1 +
>  mm/mmu_notifier.c| 529 +--
>  3 files changed, 583 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 12bd603d318ce7..bc2b12483de127 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -6,10 +6,12 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct mmu_notifier_mm;
>  struct mmu_notifier;
>  struct mmu_notifier_range;
> +struct mmu_range_notifier;
>  
>  /**
>   * enum mmu_notifier_event - reason for the mmu notifier callback
> @@ -32,6 +34,9 @@ struct mmu_notifier_range;
>   * access flags). User should soft dirty the page in the end callback to make
>   * sure that anyone relying on soft dirtyness catch pages that might be 
> written
>   * through non CPU mappings.
> + *
> + * @MMU_NOTIFY_RELEASE: used during mmu_range_notifier invalidate to signal 
> that
> + * the mm refcount is zero and the range is no longer accessible.
>   */
>  enum mmu_notifier_event {
>   MMU_NOTIFY_UNMAP = 0,
> @@ -39,6 +44,7 @@ enum mmu_notifier_event {
>   MMU_NOTIFY_PROTECTION_VMA,
>   MMU_NOTIFY_PROTECTION_PAGE,
>   MMU_NOTIFY_SOFT_DIRTY,
> + MMU_NOTIFY_RELEASE,
>  };
>  
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> @@ -222,6 +228,25 @@ struct mmu_notifier {
>   unsigned int users;
>  };
>  
> +/**
> + * struct mmu_range_notifier_ops
> + * @invalidate: Upon return the caller must stop using any SPTEs within this
> + *  range, this function can sleep. Return false if blocking was
> + *  required but range is non-blocking
> + */
> +struct mmu_range_notifier_ops {
> + bool (*invalidate)(struct mmu_range_notifier *mrn,
> +const struct mmu_notifier_range *range);
> +};
> +
> +struct mmu_range_notifier {
> + struct interval_tree_node interval_tree;
> + const struct mmu_range_notifier_ops *ops;
> + struct hlist_node deferred_item;
> + unsigned long invalidate_seq;
> + struct mm_struct *mm;
> +};
> +
>  #ifdef CONFIG_MMU_NOTIFIER
>  
>  #ifdef CONFIG_LOCKDEP
> @@ -263,6 +288,59 @@ extern int __mmu_notifier_register(struct mmu_notifier 
> *mn,
>  struct mm_struct *mm);
>  extern void mmu_notifier_unregister(struct mmu_notifier *mn,
>   struct mm_struct *mm);
> +
> +unsigned long mmu_range_read_begin(struct mmu_ran

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-16 Thread Jerome Glisse
On Fri, Aug 16, 2019 at 11:31:45AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2019 at 02:26:25PM +0200, Michal Hocko wrote:
> > On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote:
> > > On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> > > > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > > > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:

[...]

> > > I would like to inject it into the notifier path as this is very
> > > difficult for driver authors to discover and know about, but I'm
> > > worried about your false positive remark.
> > > 
> > > I think I understand we can use only GFP_ATOMIC in the notifiers, but
> > > we need a strategy to handle OOM to guarentee forward progress.
> > 
> > Your example is from the notifier registration IIUC. 
> 
> Yes, that is where this commit hit it.. Triggering this under an
> actual notifier to get a lockdep report is hard.
> 
> > Can you pre-allocate before taking locks? Could you point me to some
> > examples when the allocation is necessary in the range notifier
> > callback?
> 
> Hmm. I took a careful look, I only found mlx5 as obviously allocating
> memory:
> 
>  mlx5_ib_invalidate_range()
>   mlx5_ib_update_xlt()
>__get_free_pages(gfp, get_order(size));
> 
> However, I think this could be changed to fall back to some small
> buffer if allocation fails. The existing scheme looks sketchy
> 
> nouveau does:
> 
>  nouveau_svmm_invalidate
>   nvif_object_mthd
>kmalloc(GFP_KERNEL)
> 
> But I think it reliably uses a stack buffer here
> 
> i915 I think Daniel said he audited.
> 
> amd_mn.. The actual invalidate_range_start does not allocate memory,
> but it is entangled with so many locks it would need careful analysis
> to be sure.
> 
> The others look generally OK, which is good, better than I hoped :)

It is on my TODO list to get rid of allocation in notifier callback
(iirc nouveau already use the stack unless it was lost in all the
revision it wants through). Anyway i do not think we need allocation
in notifier.

Cheers,
Jérôme


Re: [PATCH] mm/hmm: hmm_range_fault handle pages swapped out

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 08:52:56PM +, Yang, Philip wrote:
> hmm_range_fault may return NULL pages because some of pfns are equal to
> HMM_PFN_NONE. This happens randomly under memory pressure. The reason is
> for swapped out page pte path, hmm_vma_handle_pte doesn't update fault
> variable from cpu_flags, so it failed to call hmm_vam_do_fault to swap
> the page in.
> 
> The fix is to call hmm_pte_need_fault to update fault variable.
> 
> Change-Id: I2e8611485563d11d938881c18b7935fa1e7c91ee
> Signed-off-by: Philip Yang 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/hmm.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 9f22562e2c43..7ca4fb39d3d8 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -544,6 +544,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
> unsigned long addr,
>   swp_entry_t entry = pte_to_swp_entry(pte);
>  
>   if (!non_swap_entry(entry)) {
> + cpu_flags = pte_to_hmm_pfn_flags(range, pte);
> + hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
> +&fault, &write_fault);
>   if (fault || write_fault)
>   goto fault;
>   return 0;
> -- 
> 2.17.1
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 08:41:33PM +, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 04:33:06PM -0400, Jerome Glisse wrote:
> 
> > So nor HMM nor driver should dereference the struct page (i do not
> > think any iommu driver would either),
> 
> Er, they do technically deref the struct page:
> 
> nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
>struct hmm_range *range)
>   struct page *page;
>   page = hmm_pfn_to_page(range, range->pfns[i]);
>   if (!nouveau_dmem_page(drm, page)) {
> 
> 
> nouveau_dmem_page(struct nouveau_drm *drm, struct page *page)
> {
>   return is_device_private_page(page) && drm->dmem == page_to_dmem(page)
> 
> 
> Which does touch 'page->pgmap'
> 
> Is this OK without having a get_dev_pagemap() ?
>
> Noting that the collision-retry scheme doesn't protect anything here
> as we can have a concurrent invalidation while doing the above deref.

Uh ? How so ? We are not reading the same code i think.

My read is that function is call when holding the device
lock which exclude any racing mmu notifier from making
forward progress and it is also protected by the range so
at the time this happens it is safe to dereference the
struct page. In this case any way we can update the
nouveau_dmem_page() to check that page page->pgmap == the
expected pgmap.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 01:12:22PM -0700, Dan Williams wrote:
> On Thu, Aug 15, 2019 at 12:44 PM Jerome Glisse  wrote:
> >
> > On Thu, Aug 15, 2019 at 12:36:58PM -0700, Dan Williams wrote:
> > > On Thu, Aug 15, 2019 at 11:07 AM Jerome Glisse  wrote:
> > > >
> > > > On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> > > > > On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > > > > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > > > > > Section alignment constraints somewhat save us here. The only 
> > > > > > > > example
> > > > > > > > I can think of a PMD not containing a uniform pgmap association 
> > > > > > > > for
> > > > > > > > each pte is the case when the pgmap overlaps normal dram, i.e. 
> > > > > > > > shares
> > > > > > > > the same 'struct memory_section' for a given span. Otherwise, 
> > > > > > > > distinct
> > > > > > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > > > > > subsections as of v5.3). Otherwise the implementation could not
> > > > > > > > guarantee different mapping lifetimes.
> > > > > > > >
> > > > > > > > That said, this seems to want a better mechanism to determine 
> > > > > > > > "pfn is
> > > > > > > > ZONE_DEVICE".
> > > > > > >
> > > > > > > So I guess this patch is fine for now, and once you provide a 
> > > > > > > better
> > > > > > > mechanism we can switch over to it?
> > > > > >
> > > > > > What about the version I sent to just get rid of all the strange
> > > > > > put_dev_pagemaps while scanning? Odds are good we will work with 
> > > > > > only
> > > > > > a single pagemap, so it makes some sense to cache it once we find 
> > > > > > it?
> > > > >
> > > > > Yes, if the scan is over a single pmd then caching it makes sense.
> > > >
> > > > Quite frankly an easier an better solution is to remove the pagemap
> > > > lookup as HMM user abide by mmu notifier it means we will not make
> > > > use or dereference the struct page so that we are safe from any
> > > > racing hotunplug of dax memory (as long as device driver using hmm
> > > > do not have a bug).
> > >
> > > Yes, as long as the driver remove is synchronized against HMM
> > > operations via another mechanism then there is no need to take pagemap
> > > references. Can you briefly describe what that other mechanism is?
> >
> > So if you hotunplug some dax memory i assume that this can only
> > happens once all the pages are unmapped (as it must have the
> > zero refcount, well 1 because of the bias) and any unmap will
> > trigger a mmu notifier callback. User of hmm mirror abiding by
> > the API will never make use of information they get through the
> > fault or snapshot function until checking for racing notifier
> > under lock.
> 
> Hmm that first assumption is not guaranteed by the dev_pagemap core.
> The dev_pagemap end of life model is "disable, invalidate, drain" so
> it's possible to call devm_munmap_pages() while pages are still mapped
> it just won't complete the teardown of the pagemap until the last
> reference is dropped. New references are blocked during this teardown.
> 
> However, if the driver is validating the liveness of the mapping in
> the mmu-notifier path and blocking new references it sounds like it
> should be ok. Might there be GPU driver unit tests that cover this
> racing teardown case?

So nor HMM nor driver should dereference the struct page (i do not
think any iommu driver would either), they only care about the pfn.
So even if we race with a teardown as soon as we get the mmu notifier
callback to invalidate the mmapping we will do so. The pattern is:

mydriver_populate_vaddr_range(start, end) {
hmm_range_register(range, start, end)
again:
ret = hmm_range_fault(start, end)
if (ret < 0)
return ret;

take_driver_page_table_lock();
if (range.valid) {
populate_device_page_table();
release_dr

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 12:36:58PM -0700, Dan Williams wrote:
> On Thu, Aug 15, 2019 at 11:07 AM Jerome Glisse  wrote:
> >
> > On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> > > On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  wrote:
> > > >
> > > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > > > Section alignment constraints somewhat save us here. The only 
> > > > > > example
> > > > > > I can think of a PMD not containing a uniform pgmap association for
> > > > > > each pte is the case when the pgmap overlaps normal dram, i.e. 
> > > > > > shares
> > > > > > the same 'struct memory_section' for a given span. Otherwise, 
> > > > > > distinct
> > > > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > > > subsections as of v5.3). Otherwise the implementation could not
> > > > > > guarantee different mapping lifetimes.
> > > > > >
> > > > > > That said, this seems to want a better mechanism to determine "pfn 
> > > > > > is
> > > > > > ZONE_DEVICE".
> > > > >
> > > > > So I guess this patch is fine for now, and once you provide a better
> > > > > mechanism we can switch over to it?
> > > >
> > > > What about the version I sent to just get rid of all the strange
> > > > put_dev_pagemaps while scanning? Odds are good we will work with only
> > > > a single pagemap, so it makes some sense to cache it once we find it?
> > >
> > > Yes, if the scan is over a single pmd then caching it makes sense.
> >
> > Quite frankly an easier an better solution is to remove the pagemap
> > lookup as HMM user abide by mmu notifier it means we will not make
> > use or dereference the struct page so that we are safe from any
> > racing hotunplug of dax memory (as long as device driver using hmm
> > do not have a bug).
> 
> Yes, as long as the driver remove is synchronized against HMM
> operations via another mechanism then there is no need to take pagemap
> references. Can you briefly describe what that other mechanism is?

So if you hotunplug some dax memory i assume that this can only
happens once all the pages are unmapped (as it must have the
zero refcount, well 1 because of the bias) and any unmap will
trigger a mmu notifier callback. User of hmm mirror abiding by
the API will never make use of information they get through the
fault or snapshot function until checking for racing notifier
under lock.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 03:01:59PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote:
> > On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> > > 
> > > > I'm not really well versed in the details of our userptr, but both
> > > > amdgpu and i915 wait for the gpu to complete from
> > > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > > > one, so maybe he can explain what exactly it is we're doing ...
> > > 
> > > amdgpu is (wrongly) using hmm for something, I can't really tell what
> > > it is trying to do. The calls to dma_fence under the
> > > invalidate_range_start do not give me a good feeling.
> > > 
> > > However, i915 shows all the signs of trying to follow the registration
> > > cache model, it even has a nice comment in
> > > i915_gem_userptr_get_pages() explaining that the races it has don't
> > > matter because it is a user space bug to change the VA mapping in the
> > > first place. That just screams registration cache to me.
> > > 
> > > So it is fine to run HW that way, but if you do, there is no reason to
> > > fence inside the invalidate_range end. Just orphan the DMA buffer and
> > > clean it up & release the page pins when all DMA buffer refs go to
> > > zero. The next access to that VA should get a new DMA buffer with the
> > > right mapping.
> > > 
> > > In other words the invalidation should be very simple without
> > > complicated locking, or wait_event's. Look at hfi1 for example.
> > 
> > This would break the today usage model of uptr and it will
> > break userspace expectation ie if GPU is writting to that
> > memory and that memory then the userspace want to make sure
> > that it will see what the GPU write.
> 
> How exactly? This is holding the page pin, so the only way the VA
> mapping can be changed is via explicit user action.
> 
> ie:
> 
>gpu_write_something(va, size)
>mmap(.., va, size, MMAP_FIXED);
>gpu_wait_done()
> 
> This is racy and indeterminate with both models.
> 
> Based on the comment in i915 it appears to be going on the model that
> changes to the mmap by userspace when the GPU is working on it is a
> programming bug. This is reasonable, lots of systems use this kind of
> consistency model.

Well userspace process doing munmap(), mremap(), fork() and things like
that are a bug from the i915 kernel and userspace contract point of view.

But things like migration or reclaim are not cover under that contract
and for those the expectation is that CPU access to the same virtual address
should allow to get what was last written to it either by the GPU or the
CPU.

> 
> Since the driver seems to rely on a dma_fence to block DMA access, it
> looks to me like the kernel has full visibility to the
> 'gpu_write_something()' and 'gpu_wait_done()' markers.

So let's only consider the case where GPU wants to write to the memory
(the read only case is obviously right and not need any notifier in
fact) and like above the only thing we care about is reclaim or migration
(for instance because of some numa compaction) as the rest is cover by
i915 userspace contract.

So in the write case we do GUPfast(write=true) which will be "safe" from
any concurrent CPU page table update ie if GUPfast get a reference for
the page then any racing CPU page table update will not be able to migrate
or reclaim the page and thus the virtual address to page association will
be preserve.

This is only true because of GUPfast(), now if GUPfast() fails it will
fallback to the slow GUP case which make the same thing safe by taking
the page table lock.

Because of the reference on the page the i915 driver can forego the mmu
notifier end callback. The thing here is that taking a page reference
is pointless if we have better synchronization and tracking of mmu
notifier. Hence converting to hmm mirror allows to avoid taking a ref
on the page while still keeping the same functionality as of today.


> I think trying to use hmm_range_fault on HW that can't do HW page
> faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that
> is what amd gpu is trying to do.
> 
> I haven't yet seen anything that looks like 'TLB shootdown' in i915??

GPU driver have complex usage pattern the tlb shootdown is implicit
once the GEM object associated with the uptr is invalidated it means
next time userspace submit command against that GEM object it will
have to re-validate it which means re-program the GPU page tab

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  wrote:
> >
> > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > Section alignment constraints somewhat save us here. The only example
> > > > I can think of a PMD not containing a uniform pgmap association for
> > > > each pte is the case when the pgmap overlaps normal dram, i.e. shares
> > > > the same 'struct memory_section' for a given span. Otherwise, distinct
> > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > subsections as of v5.3). Otherwise the implementation could not
> > > > guarantee different mapping lifetimes.
> > > >
> > > > That said, this seems to want a better mechanism to determine "pfn is
> > > > ZONE_DEVICE".
> > >
> > > So I guess this patch is fine for now, and once you provide a better
> > > mechanism we can switch over to it?
> >
> > What about the version I sent to just get rid of all the strange
> > put_dev_pagemaps while scanning? Odds are good we will work with only
> > a single pagemap, so it makes some sense to cache it once we find it?
> 
> Yes, if the scan is over a single pmd then caching it makes sense.

Quite frankly an easier an better solution is to remove the pagemap
lookup as HMM user abide by mmu notifier it means we will not make
use or dereference the struct page so that we are safe from any
racing hotunplug of dax memory (as long as device driver using hmm
do not have a bug).

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > 
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > > 
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> > 
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> 
> I still do not get why do you put FS/IO into the picture. This is really
> about __GFP_DIRECT_RECLAIM.
> 
> > 
> > > If you can express that in the existing lockdep machinery. All
> > > fine. But then consider deployments where lockdep is no-no because
> > > of the overhead.
> > 
> > This is all for driver debugging. The point of lockdep is to find all
> > these paths without have to hit them as actual races, using debug
> > kernels.
> > 
> > I don't think we need this kind of debugging on production kernels?
> 
> Again, the primary motivation was a simple debugging aid that could be
> used without worrying about overhead. So lockdep is very often out of
> the question.
> 
> > > > The best we got was drivers tested the VA range and returned success
> > > > if they had no interest. Which is a big win to be sure, but it looks
> > > > like getting any more is not really posssible.
> > > 
> > > And that is already a great win! Because many notifiers only do care
> > > about particular mappings. Please note that backing off unconditioanlly
> > > will simply cause that the oom reaper will have to back off not doing
> > > any tear down anything.
> > 
> > Well, I'm working to propose that we do the VA range test under core
> > mmu notifier code that cannot block and then we simply remove the idea
> > of blockable from drivers using this new 'range notifier'. 
> > 
> > I think this pretty much solves the concern?
> 
> Well, my idea was that a range check and early bail out was a first step
> and then each specific notifier would be able to do a more specific
> check. I was not able to do the second step because that requires a deep
> understanding of the respective subsystem.
> 
> Really all I do care about is to reclaim as much memory from the
> oom_reaper context as possible. And that cannot really be an unbounded
> process. Quite contrary it should be as swift as possible. From my
> cursory look some notifiers are able to achieve their task without
> blocking or depending on memory just fine. So bailing out
> unconditionally on the range of interest would just put us back.

Agree, OOM just asking the question: can i unmap that page quickly ?
so that me (OOM) can swap it out. In many cases the driver need to
lookup something to see if at the time the memory is just not in use
and can be reclaim right away. So driver should have a path to
optimistically update its state to allow quick reclaim.


> > > > However, we could (probably even should) make the drivers fs_reclaim
> > > > safe.
> > > > 
> > > > If that is enough to guarantee progress of OOM, then lets consider
> > > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > > > allocation behavior on the driver callback and lockdep to try and keep
> > > > pushing on the the debugging, and dropping !blocking.
> > > 
> > > How are you going to enforce indirect dependency? E.g. a lock that is
> > > also used in other context which depend on sleepable memory allocation
> > > to move forward.
> > 
> > You mean like this:
> > 
> >CPU0 CPU1
> > mutex_lock()
> > kmalloc(GFP_KERNEL)
> 
> no I mean __GFP_DIRECT_RECLAIM here.
> 
> > mutex_unlock()
> >   fs_reclaim_acquire()
> >   mutex_lock() <- illegal: lock dep assertion
> 
> I cannot really comment on how that is achieveable by lockdep. I managed
> to forget details about FS/IO reclaim recursion tracking already and I
> do not have time to learn it again. It was quite a hack. Anyway, let me
> repeat that the primary motivation was a simple aid. Not something as
> poverful as lockdep.

I feel that the fs_reclaim_acquire() is just too heavy weight here. I
do think that Daniel patches improve the debugging situation without
burdening anything so i am in favor or merging that.

I do not think we should devote too much time into fs_reclaim(), our
time would be better spend in improving the 

Re: [Nouveau] [PATCH] nouveau/hmm: map pages after migration

2019-08-15 Thread Jerome Glisse
On Tue, Aug 13, 2019 at 05:58:52PM -0400, Jerome Glisse wrote:
> On Wed, Aug 07, 2019 at 08:02:14AM -0700, Ralph Campbell wrote:
> > When memory is migrated to the GPU it is likely to be accessed by GPU
> > code soon afterwards. Instead of waiting for a GPU fault, map the
> > migrated memory into the GPU page tables with the same access permissions
> > as the source CPU page table entries. This preserves copy on write
> > semantics.
> > 
> > Signed-off-by: Ralph Campbell 
> > Cc: Christoph Hellwig 
> > Cc: Jason Gunthorpe 
> > Cc: "Jérôme Glisse" 
> > Cc: Ben Skeggs 
> 
> Sorry for delay i am swamp, couple issues:
> - nouveau_pfns_map() is never call, it should be call after
>   the dma copy is done (iirc it is lacking proper fencing
>   so that would need to be implemented first)
> 
> - the migrate ioctl is disconnected from the svm part and
>   thus we would need first to implement svm reference counting
>   and take a reference at begining of migration process and
>   release it at the end ie struct nouveau_svmm needs refcounting
>   of some sort. I let Ben decides what he likes best for that.

Thinking more about that the svm lifetime is bound to the file
descriptor on the device driver file held by the process. So
when you call migrate ioctl the svm should not go away because
you are in an ioctl against the file descriptor. I need to double
check all that in respect of process that open the device file
multiple time with different file descriptor (or fork thing and
all).


> - i rather not have an extra pfns array, i am pretty sure we
>   can directly feed what we get from the dma array to the svm
>   code to update the GPU page table
> 
> Observation that can be delayed to latter patches:
> - i do not think we want to map directly if the dma engine
>   is queue up and thus if the fence will take time to signal
> 
>   This is why i did not implement this in the first place.
>   Maybe using a workqueue to schedule a pre-fill of the GPU
>   page table and wakeup the workqueue with the fence notify
>   event.
> 
> 
> > ---
> > 
> > This patch is based on top of Christoph Hellwig's 9 patch series
> > https://lore.kernel.org/linux-mm/20190729234611.gc7...@redhat.com/T/#u
> > "turn the hmm migrate_vma upside down" but without patch 9
> > "mm: remove the unused MIGRATE_PFN_WRITE" and adds a use for the flag.
> > 
> > 
> >  drivers/gpu/drm/nouveau/nouveau_dmem.c | 45 +-
> >  drivers/gpu/drm/nouveau/nouveau_svm.c  | 86 ++
> >  drivers/gpu/drm/nouveau/nouveau_svm.h  | 19 ++
> >  3 files changed, 133 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
> > b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > index ef9de82b0744..c83e6f118817 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > @@ -25,11 +25,13 @@
> >  #include "nouveau_dma.h"
> >  #include "nouveau_mem.h"
> >  #include "nouveau_bo.h"
> > +#include "nouveau_svm.h"
> >  
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -560,11 +562,12 @@ nouveau_dmem_init(struct nouveau_drm *drm)
> >  }
> >  
> >  static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
> > -   struct vm_area_struct *vma, unsigned long addr,
> > -   unsigned long src, dma_addr_t *dma_addr)
> > +   struct vm_area_struct *vma, unsigned long src,
> > +   dma_addr_t *dma_addr, u64 *pfn)
> >  {
> > struct device *dev = drm->dev->dev;
> > struct page *dpage, *spage;
> > +   unsigned long paddr;
> >  
> > spage = migrate_pfn_to_page(src);
> > if (!spage || !(src & MIGRATE_PFN_MIGRATE))
> > @@ -572,17 +575,21 @@ static unsigned long 
> > nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
> >  
> > dpage = nouveau_dmem_page_alloc_locked(drm);
> > if (!dpage)
> > -   return 0;
> > +   goto out;
> >  
> > *dma_addr = dma_map_page(dev, spage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
> > if (dma_mapping_error(dev, *dma_addr))
> > goto out_free_page;
> >  
> > +   paddr = nouveau_dmem_page_addr(dpage);
> > if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_VRAM,
> > -   nouveau_dmem_page_addr(d

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> 
> > I'm not really well versed in the details of our userptr, but both
> > amdgpu and i915 wait for the gpu to complete from
> > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > one, so maybe he can explain what exactly it is we're doing ...
> 
> amdgpu is (wrongly) using hmm for something, I can't really tell what
> it is trying to do. The calls to dma_fence under the
> invalidate_range_start do not give me a good feeling.
> 
> However, i915 shows all the signs of trying to follow the registration
> cache model, it even has a nice comment in
> i915_gem_userptr_get_pages() explaining that the races it has don't
> matter because it is a user space bug to change the VA mapping in the
> first place. That just screams registration cache to me.
> 
> So it is fine to run HW that way, but if you do, there is no reason to
> fence inside the invalidate_range end. Just orphan the DMA buffer and
> clean it up & release the page pins when all DMA buffer refs go to
> zero. The next access to that VA should get a new DMA buffer with the
> right mapping.
> 
> In other words the invalidation should be very simple without
> complicated locking, or wait_event's. Look at hfi1 for example.

This would break the today usage model of uptr and it will
break userspace expectation ie if GPU is writting to that
memory and that memory then the userspace want to make sure
that it will see what the GPU write.

Yes i915 is broken in respect to not having a end notifier
and tracking active invalidation for a range but the GUP
side of thing kind of hide this bug and it shrinks the window
for bad to happen to something so small that i doubt anyone
could ever hit it (still a bug thought).

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 07:21:47PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe  wrote:
> >
> > On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> > > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > > >
> > > > > You have to wait for the gpu to finnish current processing in
> > > > > invalidate_range_start. Otherwise there's no point to any of this
> > > > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > > >
> > > > I don't envy your task :|
> > > >
> > > > But, what you describe sure sounds like a 'registration cache' model,
> > > > not the 'shadow pte' model of coherency.
> > > >
> > > > The key difference is that a regirstationcache is allowed to become
> > > > incoherent with the VMA's because it holds page pins. It is a
> > > > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > > > while the device is working on that VA, but it does not harm system
> > > > integrity because of the page pin.
> > > >
> > > > The cache ensures that each initiated operation sees a DMA setup that
> > > > matches the current VA map when the operation is initiated and allows
> > > > expensive device DMA setups to be re-used.
> > > >
> > > > A 'shadow pte' model (ie hmm) *really* needs device support to
> > > > directly block DMA access - ie trigger 'device page fault'. ie the
> > > > invalidate_start should inform the device to enter a fault mode and
> > > > that is it.  If the device can't do that, then the driver probably
> > > > shouldn't persue this level of coherency. The driver would quickly get
> > > > into the messy locking problems like dma_fence_wait from a notifier.
> > >
> > > I think here we do not agree on the hardware requirement. For GPU
> > > we will always need to be able to wait for some GPU fence from inside
> > > the notifier callback, there is just no way around that for many of
> > > the GPUs today (i do not see any indication of that changing).
> >
> > I didn't say you couldn't wait, I was trying to say that the wait
> > should only be contigent on the HW itself. Ie you can wait on a GPU
> > page table lock, and you can wait on a GPU page table flush completion
> > via IRQ.
> >
> > What is troubling is to wait till some other thread gets a GPU command
> > completion and decr's a kref on the DMA buffer - which kinda looks
> > like what this dma_fence() stuff is all about. A driver like that
> > would have to be super careful to ensure consistent forward progress
> > toward dma ref == 0 when the system is under reclaim.
> >
> > ie by running it's entire IRQ flow under fs_reclaim locking.
> 
> This is correct. At least for i915 it's already a required due to our
> shrinker also having to do the same. I think amdgpu isn't bothering
> with that since they have vram for most of the stuff, and just limit
> system memory usage to half of all and forgo the shrinker. Probably
> not the nicest approach. Anyway, both do the same mmu_notifier dance,
> just want to explain that we've been living with this for longer
> already.
> 
> So yeah writing a gpu driver is not easy.
> 
> > > associated with the mm_struct. In all GPU driver so far it is a short
> > > lived lock and nothing blocking is done while holding it (it is just
> > > about updating page table directory really wether it is filling it or
> > > clearing it).
> >
> > The main blocking I expect in a shadow PTE flow is waiting for the HW
> > to complete invalidations of its PTE cache.
> >
> > > > It is important to identify what model you are going for as defining a
> > > > 'registration cache' coherence expectation allows the driver to skip
> > > > blocking in invalidate_range_start. All it does is invalidate the
> > > > cache so that future operations pick up the new VA mapping.
> > > >
> > > > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > > > well proven, within some limitations of course.
> > > >
> > > > At least, 'registration cache' is the only use model I know of where
> > > > it is acceptable to skip invalidate_range_end.
> &

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> 
> > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > __GFP_DIRECT_RECLAIM..
> > >
> > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > allocations during OOM, then I think fs_reclaim already matches what
> > > you described?
> > 
> > No GFP_NOFS is equally bad. Please read my other email explaining what
> > the oom_reaper actually requires. In short no blocking on direct or
> > indirect dependecy on memory allocation that might sleep.
> 
> It is much easier to follow with some hints on code, so the true
> requirement is that the OOM repear not block on GFP_FS and GFP_IO
> allocations, great, that constraint is now clear.
> 
> > If you can express that in the existing lockdep machinery. All
> > fine. But then consider deployments where lockdep is no-no because
> > of the overhead.
> 
> This is all for driver debugging. The point of lockdep is to find all
> these paths without have to hit them as actual races, using debug
> kernels.
> 
> I don't think we need this kind of debugging on production kernels?
> 
> > > The best we got was drivers tested the VA range and returned success
> > > if they had no interest. Which is a big win to be sure, but it looks
> > > like getting any more is not really posssible.
> > 
> > And that is already a great win! Because many notifiers only do care
> > about particular mappings. Please note that backing off unconditioanlly
> > will simply cause that the oom reaper will have to back off not doing
> > any tear down anything.
> 
> Well, I'm working to propose that we do the VA range test under core
> mmu notifier code that cannot block and then we simply remove the idea
> of blockable from drivers using this new 'range notifier'. 
> 
> I think this pretty much solves the concern?

I am not sure i follow what you propose here ? Like i pointed out in
another email for GPU we do need to be able to sleep (we might get
lucky and not need too but this is runtime thing) within notifier
range_start callback. This has been something allow by notifier since
it has been introduced in the kernel.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> 
> > You have to wait for the gpu to finnish current processing in
> > invalidate_range_start. Otherwise there's no point to any of this
> > really. So the wait_event/dma_fence_wait are unavoidable really.
> 
> I don't envy your task :|
> 
> But, what you describe sure sounds like a 'registration cache' model,
> not the 'shadow pte' model of coherency.
> 
> The key difference is that a regirstationcache is allowed to become
> incoherent with the VMA's because it holds page pins. It is a
> programming bug in userspace to change VA mappings via mmap/munmap/etc
> while the device is working on that VA, but it does not harm system
> integrity because of the page pin.
> 
> The cache ensures that each initiated operation sees a DMA setup that
> matches the current VA map when the operation is initiated and allows
> expensive device DMA setups to be re-used.
> 
> A 'shadow pte' model (ie hmm) *really* needs device support to
> directly block DMA access - ie trigger 'device page fault'. ie the
> invalidate_start should inform the device to enter a fault mode and
> that is it.  If the device can't do that, then the driver probably
> shouldn't persue this level of coherency. The driver would quickly get
> into the messy locking problems like dma_fence_wait from a notifier.

I think here we do not agree on the hardware requirement. For GPU
we will always need to be able to wait for some GPU fence from inside
the notifier callback, there is just no way around that for many of
the GPUs today (i do not see any indication of that changing).

Driver should avoid lock complexity by using wait queue so that the
driver notifier callback can wait without having to hold some driver
lock. However there will be at least one lock needed to update the
internal driver state for the range being invalidated. That lock is
just the device driver page table lock for the GPU page table
associated with the mm_struct. In all GPU driver so far it is a short
lived lock and nothing blocking is done while holding it (it is just
about updating page table directory really wether it is filling it or
clearing it).


> 
> It is important to identify what model you are going for as defining a
> 'registration cache' coherence expectation allows the driver to skip
> blocking in invalidate_range_start. All it does is invalidate the
> cache so that future operations pick up the new VA mapping.
> 
> Intel's HFI RDMA driver uses this model extensively, and I think it is
> well proven, within some limitations of course.
> 
> At least, 'registration cache' is the only use model I know of where
> it is acceptable to skip invalidate_range_end.

Here GPU are not in the registration cache model, i know it might looks
like it because of GUP but GUP was use just because hmm did not exist
at the time.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] nouveau/hmm: map pages after migration

2019-08-13 Thread Jerome Glisse
On Wed, Aug 07, 2019 at 08:02:14AM -0700, Ralph Campbell wrote:
> When memory is migrated to the GPU it is likely to be accessed by GPU
> code soon afterwards. Instead of waiting for a GPU fault, map the
> migrated memory into the GPU page tables with the same access permissions
> as the source CPU page table entries. This preserves copy on write
> semantics.
> 
> Signed-off-by: Ralph Campbell 
> Cc: Christoph Hellwig 
> Cc: Jason Gunthorpe 
> Cc: "Jérôme Glisse" 
> Cc: Ben Skeggs 

Sorry for delay i am swamp, couple issues:
- nouveau_pfns_map() is never call, it should be call after
  the dma copy is done (iirc it is lacking proper fencing
  so that would need to be implemented first)

- the migrate ioctl is disconnected from the svm part and
  thus we would need first to implement svm reference counting
  and take a reference at begining of migration process and
  release it at the end ie struct nouveau_svmm needs refcounting
  of some sort. I let Ben decides what he likes best for that.

- i rather not have an extra pfns array, i am pretty sure we
  can directly feed what we get from the dma array to the svm
  code to update the GPU page table

Observation that can be delayed to latter patches:
- i do not think we want to map directly if the dma engine
  is queue up and thus if the fence will take time to signal

  This is why i did not implement this in the first place.
  Maybe using a workqueue to schedule a pre-fill of the GPU
  page table and wakeup the workqueue with the fence notify
  event.


> ---
> 
> This patch is based on top of Christoph Hellwig's 9 patch series
> https://lore.kernel.org/linux-mm/20190729234611.gc7...@redhat.com/T/#u
> "turn the hmm migrate_vma upside down" but without patch 9
> "mm: remove the unused MIGRATE_PFN_WRITE" and adds a use for the flag.
> 
> 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 45 +-
>  drivers/gpu/drm/nouveau/nouveau_svm.c  | 86 ++
>  drivers/gpu/drm/nouveau/nouveau_svm.h  | 19 ++
>  3 files changed, 133 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
> b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index ef9de82b0744..c83e6f118817 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -25,11 +25,13 @@
>  #include "nouveau_dma.h"
>  #include "nouveau_mem.h"
>  #include "nouveau_bo.h"
> +#include "nouveau_svm.h"
>  
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -560,11 +562,12 @@ nouveau_dmem_init(struct nouveau_drm *drm)
>  }
>  
>  static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
> - struct vm_area_struct *vma, unsigned long addr,
> - unsigned long src, dma_addr_t *dma_addr)
> + struct vm_area_struct *vma, unsigned long src,
> + dma_addr_t *dma_addr, u64 *pfn)
>  {
>   struct device *dev = drm->dev->dev;
>   struct page *dpage, *spage;
> + unsigned long paddr;
>  
>   spage = migrate_pfn_to_page(src);
>   if (!spage || !(src & MIGRATE_PFN_MIGRATE))
> @@ -572,17 +575,21 @@ static unsigned long 
> nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>  
>   dpage = nouveau_dmem_page_alloc_locked(drm);
>   if (!dpage)
> - return 0;
> + goto out;
>  
>   *dma_addr = dma_map_page(dev, spage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
>   if (dma_mapping_error(dev, *dma_addr))
>   goto out_free_page;
>  
> + paddr = nouveau_dmem_page_addr(dpage);
>   if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_VRAM,
> - nouveau_dmem_page_addr(dpage), NOUVEAU_APER_HOST,
> - *dma_addr))
> + paddr, NOUVEAU_APER_HOST, *dma_addr))
>   goto out_dma_unmap;
>  
> + *pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
> + ((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
> + if (src & MIGRATE_PFN_WRITE)
> + *pfn |= NVIF_VMM_PFNMAP_V0_W;
>   return migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
>  
>  out_dma_unmap:
> @@ -590,18 +597,19 @@ static unsigned long 
> nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>  out_free_page:
>   nouveau_dmem_page_free_locked(drm, dpage);
>  out:
> + *pfn = NVIF_VMM_PFNMAP_V0_NONE;
>   return 0;
>  }
>  
>  static void nouveau_dmem_migrate_chunk(struct migrate_vma *args,
> - struct nouveau_drm *drm, dma_addr_t *dma_addrs)
> + struct nouveau_drm *drm, dma_addr_t *dma_addrs, u64 *pfns)
>  {
>   struct nouveau_fence *fence;
>   unsigned long addr = args->start, nr_dma = 0, i;
>  
>   for (i = 0; addr < args->end; i++) {
>   args->dst[i] = nouveau_dmem_migrate_copy_one(drm, args->vma,
> - addr, args->src[i], &dma_addrs[nr_dma]);
> + 

Re: [PATCH 9/9] mm: remove the MIGRATE_PFN_WRITE flag

2019-07-30 Thread Jerome Glisse
On Tue, Jul 30, 2019 at 07:46:33AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 29, 2019 at 07:30:44PM -0400, Jerome Glisse wrote:
> > On Mon, Jul 29, 2019 at 05:28:43PM +0300, Christoph Hellwig wrote:
> > > The MIGRATE_PFN_WRITE is only used locally in migrate_vma_collect_pmd,
> > > where it can be replaced with a simple boolean local variable.
> > > 
> > > Signed-off-by: Christoph Hellwig 
> > 
> > NAK that flag is useful, for instance a anonymous vma might have
> > some of its page read only even if the vma has write permission.
> > 
> > It seems that the code in nouveau is wrong (probably lost that
> > in various rebase/rework) as this flag should be use to decide
> > wether to map the device memory with write permission or not.
> > 
> > I am traveling right now, i will investigate what happened to
> > nouveau code.
> 
> We can add it back when needed pretty easily.  Much of this has bitrotted
> way to fast, and the pending ppc kvmhmm code doesn't need it either.

Not using is a serious bug, i will investigate this friday.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 9/9] mm: remove the MIGRATE_PFN_WRITE flag

2019-07-29 Thread Jerome Glisse
On Mon, Jul 29, 2019 at 04:42:01PM -0700, Ralph Campbell wrote:
> 
> On 7/29/19 7:28 AM, Christoph Hellwig wrote:
> > The MIGRATE_PFN_WRITE is only used locally in migrate_vma_collect_pmd,
> > where it can be replaced with a simple boolean local variable.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Reviewed-by: Ralph Campbell 
> 
> > ---
> >   include/linux/migrate.h | 1 -
> >   mm/migrate.c| 9 +
> >   2 files changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 8b46cfdb1a0e..ba74ef5a7702 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -165,7 +165,6 @@ static inline int 
> > migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >   #define MIGRATE_PFN_VALID (1UL << 0)
> >   #define MIGRATE_PFN_MIGRATE   (1UL << 1)
> >   #define MIGRATE_PFN_LOCKED(1UL << 2)
> > -#define MIGRATE_PFN_WRITE  (1UL << 3)
> >   #define MIGRATE_PFN_SHIFT 6
> >   static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 74735256e260..724f92dcc31b 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2212,6 +2212,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > unsigned long mpfn, pfn;
> > struct page *page;
> > swp_entry_t entry;
> > +   bool writable = false;
> > pte_t pte;
> > pte = *ptep;
> > @@ -2240,7 +2241,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > mpfn = migrate_pfn(page_to_pfn(page)) |
> > MIGRATE_PFN_MIGRATE;
> > if (is_write_device_private_entry(entry))
> > -   mpfn |= MIGRATE_PFN_WRITE;
> > +   writable = true;
> > } else {
> > if (is_zero_pfn(pfn)) {
> > mpfn = MIGRATE_PFN_MIGRATE;
> > @@ -2250,7 +2251,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > }
> > page = vm_normal_page(migrate->vma, addr, pte);
> > mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> > -   mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> > +   if (pte_write(pte))
> > +   writable = true;
> > }
> > /* FIXME support THP */
> > @@ -2284,8 +2286,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > ptep_get_and_clear(mm, addr, ptep);
> > /* Setup special migration page table entry */
> > -   entry = make_migration_entry(page, mpfn &
> > -MIGRATE_PFN_WRITE);
> > +   entry = make_migration_entry(page, writable);
> > swp_pte = swp_entry_to_pte(entry);
> > if (pte_soft_dirty(pte))
> > swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > 
> 
> MIGRATE_PFN_WRITE may mot being used but that seems like a bug to me.
> If a page is migrated to device memory, it could be mapped at the same
> time to avoid a device page fault but it would need the flag to know
> whether to map it RW or RO. But I suppose that could be inferred from
> the vma->vm_flags.

It is a bug that it is not being use right now. I will have to dig my
git repo to see when that got kill. Will look into it once i get back.

The vma->vm_flags is of no use here. A page can be write protected
inside a writable vma for various reasons.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/9] mm: turn migrate_vma upside down

2019-07-29 Thread Jerome Glisse
On Mon, Jul 29, 2019 at 05:28:35PM +0300, Christoph Hellwig wrote:
> There isn't any good reason to pass callbacks to migrate_vma.  Instead
> we can just export the three steps done by this function to drivers and
> let them sequence the operation without callbacks.  This removes a lot
> of boilerplate code as-is, and will allow the drivers to drastically
> improve code flow and error handling further on.
> 
> Signed-off-by: Christoph Hellwig 


I haven't finished review, especialy the nouveau code, i will look
into this once i get back. In the meantime below are few corrections.

> ---
>  Documentation/vm/hmm.rst   |  55 +-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 122 +++--
>  include/linux/migrate.h| 118 ++--
>  mm/migrate.c   | 242 +++--
>  4 files changed, 193 insertions(+), 344 deletions(-)
> 

[...]

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8992741f10aa..dc4e60a496f2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2118,16 +2118,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct 
> *mm,
>  #endif /* CONFIG_NUMA */
>  
>  #if defined(CONFIG_MIGRATE_VMA_HELPER)
> -struct migrate_vma {
> - struct vm_area_struct   *vma;
> - unsigned long   *dst;
> - unsigned long   *src;
> - unsigned long   cpages;
> - unsigned long   npages;
> - unsigned long   start;
> - unsigned long   end;
> -};
> -
>  static int migrate_vma_collect_hole(unsigned long start,
>   unsigned long end,
>   struct mm_walk *walk)
> @@ -2578,6 +2568,108 @@ static void migrate_vma_unmap(struct migrate_vma 
> *migrate)
>   }
>  }
>  
> +/**
> + * migrate_vma_setup() - prepare to migrate a range of memory
> + * @args: contains the vma, start, and and pfns arrays for the migration
> + *
> + * Returns: negative errno on failures, 0 when 0 or more pages were migrated
> + * without an error.
> + *
> + * Prepare to migrate a range of memory virtual address range by collecting 
> all
> + * the pages backing each virtual address in the range, saving them inside 
> the
> + * src array.  Then lock those pages and unmap them. Once the pages are 
> locked
> + * and unmapped, check whether each page is pinned or not.  Pages that aren't
> + * pinned have the MIGRATE_PFN_MIGRATE flag set (by this function) in the
> + * corresponding src array entry.  Then restores any pages that are pinned, 
> by
> + * remapping and unlocking those pages.
> + *
> + * The caller should then allocate destination memory and copy source memory 
> to
> + * it for all those entries (ie with MIGRATE_PFN_VALID and 
> MIGRATE_PFN_MIGRATE
> + * flag set).  Once these are allocated and copied, the caller must update 
> each
> + * corresponding entry in the dst array with the pfn value of the destination
> + * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_LOCKED flags set
> + * (destination pages must have their struct pages locked, via lock_page()).
> + *
> + * Note that the caller does not have to migrate all the pages that are 
> marked
> + * with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration from
> + * device memory to system memory.  If the caller cannot migrate a device 
> page
> + * back to system memory, then it must return VM_FAULT_SIGBUS, which will
> + * might have severe consequences for the userspace process, so it should 
> best

  ^s/might//  ^s/should 
best/must/

> + * be avoided if possible.
 ^s/if possible//

Maybe adding something about failing only on unrecoverable device error. The
only reason we allow failure for migration here is because GPU devices can
go into bad state (GPU lockup) and when that happens the GPU memory might be
corrupted (power to GPU memory might be cut by GPU driver to recover the
GPU).

So failing migration back to main memory is only a last resort event.


> + *
> + * For empty entries inside CPU page table (pte_none() or pmd_none() is 
> true) we
> + * do set MIGRATE_PFN_MIGRATE flag inside the corresponding source array thus
> + * allowing the caller to allocate device memory for those unback virtual
> + * address.  For this the caller simply havs to allocate device memory and
   ^ haves

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 9/9] mm: remove the MIGRATE_PFN_WRITE flag

2019-07-29 Thread Jerome Glisse
On Mon, Jul 29, 2019 at 05:28:43PM +0300, Christoph Hellwig wrote:
> The MIGRATE_PFN_WRITE is only used locally in migrate_vma_collect_pmd,
> where it can be replaced with a simple boolean local variable.
> 
> Signed-off-by: Christoph Hellwig 

NAK that flag is useful, for instance a anonymous vma might have
some of its page read only even if the vma has write permission.

It seems that the code in nouveau is wrong (probably lost that
in various rebase/rework) as this flag should be use to decide
wether to map the device memory with write permission or not.

I am traveling right now, i will investigate what happened to
nouveau code.

Cheers,
Jérôme

> ---
>  include/linux/migrate.h | 1 -
>  mm/migrate.c| 9 +
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 8b46cfdb1a0e..ba74ef5a7702 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -165,7 +165,6 @@ static inline int migrate_misplaced_transhuge_page(struct 
> mm_struct *mm,
>  #define MIGRATE_PFN_VALID(1UL << 0)
>  #define MIGRATE_PFN_MIGRATE  (1UL << 1)
>  #define MIGRATE_PFN_LOCKED   (1UL << 2)
> -#define MIGRATE_PFN_WRITE(1UL << 3)
>  #define MIGRATE_PFN_SHIFT6
>  
>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 74735256e260..724f92dcc31b 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2212,6 +2212,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   unsigned long mpfn, pfn;
>   struct page *page;
>   swp_entry_t entry;
> + bool writable = false;
>   pte_t pte;
>  
>   pte = *ptep;
> @@ -2240,7 +2241,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   mpfn = migrate_pfn(page_to_pfn(page)) |
>   MIGRATE_PFN_MIGRATE;
>   if (is_write_device_private_entry(entry))
> - mpfn |= MIGRATE_PFN_WRITE;
> + writable = true;
>   } else {
>   if (is_zero_pfn(pfn)) {
>   mpfn = MIGRATE_PFN_MIGRATE;
> @@ -2250,7 +2251,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   }
>   page = vm_normal_page(migrate->vma, addr, pte);
>   mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> - mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> + if (pte_write(pte))
> + writable = true;
>   }
>  
>   /* FIXME support THP */
> @@ -2284,8 +2286,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   ptep_get_and_clear(mm, addr, ptep);
>  
>   /* Setup special migration page table entry */
> - entry = make_migration_entry(page, mpfn &
> -  MIGRATE_PFN_WRITE);
> + entry = make_migration_entry(page, writable);
>   swp_pte = swp_entry_to_pte(entry);
>   if (pte_soft_dirty(pte))
>   swp_pte = pte_swp_mksoft_dirty(swp_pte);
> -- 
> 2.20.1
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/nouveau/svm: Convert to use hmm_range_fault()

2019-06-17 Thread Jerome Glisse
On Sat, Jun 08, 2019 at 12:14:50AM +0530, Souptick Joarder wrote:
> Hi Jason,
> 
> On Tue, May 21, 2019 at 12:27 AM Souptick Joarder  
> wrote:
> >
> > Convert to use hmm_range_fault().
> >
> > Signed-off-by: Souptick Joarder 
> 
> Would you like to take it through your new hmm tree or do I
> need to resend it ?

This patch is wrong as the API is different between the two see what
is in hmm.h to see the differences between hmm_vma_fault() hmm_range_fault()
a simple rename break things.

> 
> > ---
> >  drivers/gpu/drm/nouveau/nouveau_svm.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
> > b/drivers/gpu/drm/nouveau/nouveau_svm.c
> > index 93ed43c..8d56bd6 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> > @@ -649,7 +649,7 @@ struct nouveau_svmm {
> > range.values = nouveau_svm_pfn_values;
> > range.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT;
> >  again:
> > -   ret = hmm_vma_fault(&range, true);
> > +   ret = hmm_range_fault(&range, true);
> > if (ret == 0) {
> > mutex_lock(&svmm->mutex);
> > if (!hmm_vma_range_done(&range)) {
> > --
> > 1.9.1
> >
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 4/4] mm, notifier: Add a lockdep map for invalidate_range_start

2019-05-21 Thread Jerome Glisse
On Tue, May 21, 2019 at 06:00:36PM +0200, Daniel Vetter wrote:
> On Tue, May 21, 2019 at 5:41 PM Jerome Glisse  wrote:
> >
> > On Mon, May 20, 2019 at 11:39:45PM +0200, Daniel Vetter wrote:
> > > This is a similar idea to the fs_reclaim fake lockdep lock. It's
> > > fairly easy to provoke a specific notifier to be run on a specific
> > > range: Just prep it, and then munmap() it.
> > >
> > > A bit harder, but still doable, is to provoke the mmu notifiers for
> > > all the various callchains that might lead to them. But both at the
> > > same time is really hard to reliable hit, especially when you want to
> > > exercise paths like direct reclaim or compaction, where it's not
> > > easy to control what exactly will be unmapped.
> > >
> > > By introducing a lockdep map to tie them all together we allow lockdep
> > > to see a lot more dependencies, without having to actually hit them
> > > in a single challchain while testing.
> > >
> > > Aside: Since I typed this to test i915 mmu notifiers I've only rolled
> > > this out for the invaliate_range_start callback. If there's
> > > interest, we should probably roll this out to all of them. But my
> > > undestanding of core mm is seriously lacking, and I'm not clear on
> > > whether we need a lockdep map for each callback, or whether some can
> > > be shared.
> >
> > I need to read more on lockdep but it is legal to have mmu notifier
> > invalidation within each other. For instance when you munmap you
> > might split a huge pmd and it will trigger a second invalidate range
> > while the munmap one is not done yet. Would that trigger the lockdep
> > here ?
> 
> Depends how it's nesting. I'm wrapping the annotation only just around
> the individual mmu notifier callback, so if the nesting is just
> - munmap starts
> - invalidate_range_start #1
> - we noticed that there's a huge pmd we need to split
> - invalidate_range_start #2
> - invalidate_reange_end #2
> - invalidate_range_end #1
> - munmap is done

Yeah this is how it looks. All the callback from range_start #1 would
happens before range_start #2 happens so we should be fine.

> 
> But if otoh it's ok to trigger the 2nd invalidate range from within an
> mmu_notifier->invalidate_range_start callback, then lockdep will be
> pissed about that.

No that would be illegal for a callback to do that. There is no existing
callback that would do that at least AFAIK. So we can just say that it
is illegal. I would not see the point.

> 
> > Worst case i can think of is 2 invalidate_range_start chain one after
> > the other. I don't think you can triggers a 3 levels nesting but maybe.
> 
> Lockdep has special nesting annotations. I think it'd be more an issue
> of getting those funneled through the entire call chain, assuming we
> really need that.

I think we are fine. So this patch looks good.

Reviewed-by: Jérôme Glisse 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/4] mm: Check if mmu notifier callbacks are allowed to fail

2019-05-21 Thread Jerome Glisse
On Mon, May 20, 2019 at 11:39:42PM +0200, Daniel Vetter wrote:
> Just a bit of paranoia, since if we start pushing this deep into
> callchains it's hard to spot all places where an mmu notifier
> implementation might fail when it's not allowed to.
> 
> Inspired by some confusion we had discussing i915 mmu notifiers and
> whether we could use the newly-introduced return value to handle some
> corner cases. Until we realized that these are only for when a task
> has been killed by the oom reaper.
> 
> An alternative approach would be to split the callback into two
> versions, one with the int return value, and the other with void
> return value like in older kernels. But that's a lot more churn for
> fairly little gain I think.
> 
> Summary from the m-l discussion on why we want something at warning
> level: This allows automated tooling in CI to catch bugs without
> humans having to look at everything. If we just upgrade the existing
> pr_info to a pr_warn, then we'll have false positives. And as-is, no
> one will ever spot the problem since it's lost in the massive amounts
> of overall dmesg noise.
> 
> v2: Drop the full WARN_ON backtrace in favour of just a pr_warn for
> the problematic case (Michal Hocko).
> 
> v3: Rebase on top of Glisse's arg rework.
> 
> v4: More rebase on top of Glisse reworking everything.
> 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: "Christian König" 
> Cc: David Rientjes 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: Paolo Bonzini 
> Reviewed-by: Christian König 
> Signed-off-by: Daniel Vetter 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/mmu_notifier.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index ee36068077b6..c05e406a7cd7 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -181,6 +181,9 @@ int __mmu_notifier_invalidate_range_start(struct 
> mmu_notifier_range *range)
>   pr_info("%pS callback failed with %d in 
> %sblockable context.\n",
>   mn->ops->invalidate_range_start, _ret,
>   !mmu_notifier_range_blockable(range) ? 
> "non-" : "");
> + if (!mmu_notifier_range_blockable(range))
> + pr_warn("%pS callback failure not 
> allowed\n",
> + 
> mn->ops->invalidate_range_start);
>   ret = _ret;
>   }
>   }
> -- 
> 2.20.1
> 


Re: [PATCH 4/4] mm, notifier: Add a lockdep map for invalidate_range_start

2019-05-21 Thread Jerome Glisse
On Mon, May 20, 2019 at 11:39:45PM +0200, Daniel Vetter wrote:
> This is a similar idea to the fs_reclaim fake lockdep lock. It's
> fairly easy to provoke a specific notifier to be run on a specific
> range: Just prep it, and then munmap() it.
> 
> A bit harder, but still doable, is to provoke the mmu notifiers for
> all the various callchains that might lead to them. But both at the
> same time is really hard to reliable hit, especially when you want to
> exercise paths like direct reclaim or compaction, where it's not
> easy to control what exactly will be unmapped.
> 
> By introducing a lockdep map to tie them all together we allow lockdep
> to see a lot more dependencies, without having to actually hit them
> in a single challchain while testing.
> 
> Aside: Since I typed this to test i915 mmu notifiers I've only rolled
> this out for the invaliate_range_start callback. If there's
> interest, we should probably roll this out to all of them. But my
> undestanding of core mm is seriously lacking, and I'm not clear on
> whether we need a lockdep map for each callback, or whether some can
> be shared.

I need to read more on lockdep but it is legal to have mmu notifier
invalidation within each other. For instance when you munmap you
might split a huge pmd and it will trigger a second invalidate range
while the munmap one is not done yet. Would that trigger the lockdep
here ?

Worst case i can think of is 2 invalidate_range_start chain one after
the other. I don't think you can triggers a 3 levels nesting but maybe.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 3/4] mm, notifier: Catch sleeping/blocking for !blockable

2019-05-21 Thread Jerome Glisse
On Mon, May 20, 2019 at 11:39:44PM +0200, Daniel Vetter wrote:
> We need to make sure implementations don't cheat and don't have a
> possible schedule/blocking point deeply burried where review can't
> catch it.
> 
> I'm not sure whether this is the best way to make sure all the
> might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> But it gets the job done.
> 
> Inspired by an i915 patch series which did exactly that, because the
> rules haven't been entirely clear to us.
> 
> v2: Use the shiny new non_block_start/end annotations instead of
> abusing preempt_disable/enable.
> 
> v3: Rebase on top of Glisse's arg rework.
> 
> v4: Rebase on top of more Glisse rework.
> 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: David Rientjes 
> Cc: "Christian König" 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Reviewed-by: Christian König 
> Signed-off-by: Daniel Vetter 
> ---
>  mm/mmu_notifier.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index c05e406a7cd7..a09e737711d5 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -176,7 +176,13 @@ int __mmu_notifier_invalidate_range_start(struct 
> mmu_notifier_range *range)
>   id = srcu_read_lock(&srcu);
>   hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
>   if (mn->ops->invalidate_range_start) {
> - int _ret = mn->ops->invalidate_range_start(mn, range);
> + int _ret;
> +
> + if (!mmu_notifier_range_blockable(range))
> + non_block_start();
> + _ret = mn->ops->invalidate_range_start(mn, range);
> + if (!mmu_notifier_range_blockable(range))
> + non_block_end();

This is a taste thing so feel free to ignore it as maybe other
will dislike more what i prefer:

+   if (!mmu_notifier_range_blockable(range)) {
+   non_block_start();
+   _ret = mn->ops->invalidate_range_start(mn, 
range);
+   non_block_end();
+   } else
+   _ret = mn->ops->invalidate_range_start(mn, 
range);

If only we had predicate on CPU like on GPU :)

In any case:

Reviewed-by: Jérôme Glisse 


>   if (_ret) {
>   pr_info("%pS callback failed with %d in 
> %sblockable context.\n",
>   mn->ops->invalidate_range_start, _ret,
> -- 
> 2.20.1
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/2] mm/hmm: support automatic NUMA balancing

2019-05-13 Thread Jerome Glisse
On Mon, May 13, 2019 at 02:27:20PM -0700, Andrew Morton wrote:
> On Fri, 10 May 2019 19:53:23 + "Kuehling, Felix"  
> wrote:
> 
> > From: Philip Yang 
> > 
> > While the page is migrating by NUMA balancing, HMM failed to detect this
> > condition and still return the old page. Application will use the new
> > page migrated, but driver pass the old page physical address to GPU,
> > this crash the application later.
> > 
> > Use pte_protnone(pte) to return this condition and then hmm_vma_do_fault
> > will allocate new page.
> > 
> > Signed-off-by: Philip Yang 
> 
> This should have included your signed-off-by:, since you were on the
> patch delivery path.  I'll make that change to my copy of the patch,
> OK?

Yes it should have included that.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/2] mm/hmm: Only set FAULT_FLAG_ALLOW_RETRY for non-blocking

2019-05-13 Thread Jerome Glisse
Andrew can we get this 2 fixes line up for 5.2 ?

On Mon, May 13, 2019 at 07:36:44PM +, Kuehling, Felix wrote:
> Hi Jerome,
> 
> Do you want me to push the patches to your branch? Or are you going to 
> apply them yourself?
> 
> Is your hmm-5.2-v3 branch going to make it into Linux 5.2? If so, do you 
> know when? I'd like to coordinate with Dave Airlie so that we can also 
> get that update into a drm-next branch soon.
> 
> I see that Linus merged Dave's pull request for Linux 5.2, which 
> includes the first changes in amdgpu using HMM. They're currently broken 
> without these two patches.

HMM patch do not go through any git branch they go through the mmotm
collection. So it is not something you can easily coordinate with drm
branch.

By broken i expect you mean that if numabalance happens it breaks ?
Or it might sleep when you are not expecting it too ?

Cheers,
Jérôme

> 
> Thanks,
>    Felix
> 
> On 2019-05-10 4:14 p.m., Jerome Glisse wrote:
> > [CAUTION: External Email]
> >
> > On Fri, May 10, 2019 at 07:53:24PM +, Kuehling, Felix wrote:
> >> Don't set this flag by default in hmm_vma_do_fault. It is set
> >> conditionally just a few lines below. Setting it unconditionally
> >> can lead to handle_mm_fault doing a non-blocking fault, returning
> >> -EBUSY and unlocking mmap_sem unexpectedly.
> >>
> >> Signed-off-by: Felix Kuehling 
> > Reviewed-by: Jérôme Glisse 
> >
> >> ---
> >>   mm/hmm.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/hmm.c b/mm/hmm.c
> >> index b65c27d5c119..3c4f1d62202f 100644
> >> --- a/mm/hmm.c
> >> +++ b/mm/hmm.c
> >> @@ -339,7 +339,7 @@ struct hmm_vma_walk {
> >>   static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
> >>bool write_fault, uint64_t *pfn)
> >>   {
> >> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
> >> + unsigned int flags = FAULT_FLAG_REMOTE;
> >>struct hmm_vma_walk *hmm_vma_walk = walk->private;
> >>struct hmm_range *range = hmm_vma_walk->range;
> >>struct vm_area_struct *vma = walk->vma;
> >> --
> >> 2.17.1
> >>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/2] mm/hmm: support automatic NUMA balancing

2019-05-10 Thread Jerome Glisse
On Fri, May 10, 2019 at 07:53:23PM +, Kuehling, Felix wrote:
> From: Philip Yang 
> 
> While the page is migrating by NUMA balancing, HMM failed to detect this
> condition and still return the old page. Application will use the new
> page migrated, but driver pass the old page physical address to GPU,
> this crash the application later.
> 
> Use pte_protnone(pte) to return this condition and then hmm_vma_do_fault
> will allocate new page.
> 
> Signed-off-by: Philip Yang 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/hmm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 75d2ea906efb..b65c27d5c119 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -554,7 +554,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
>  
>  static inline uint64_t pte_to_hmm_pfn_flags(struct hmm_range *range, pte_t 
> pte)
>  {
> - if (pte_none(pte) || !pte_present(pte))
> + if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
>   return 0;
>   return pte_write(pte) ? range->flags[HMM_PFN_VALID] |
>   range->flags[HMM_PFN_WRITE] :
> -- 
> 2.17.1
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/2] mm/hmm: Only set FAULT_FLAG_ALLOW_RETRY for non-blocking

2019-05-10 Thread Jerome Glisse
On Fri, May 10, 2019 at 07:53:24PM +, Kuehling, Felix wrote:
> Don't set this flag by default in hmm_vma_do_fault. It is set
> conditionally just a few lines below. Setting it unconditionally
> can lead to handle_mm_fault doing a non-blocking fault, returning
> -EBUSY and unlocking mmap_sem unexpectedly.
> 
> Signed-off-by: Felix Kuehling 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/hmm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index b65c27d5c119..3c4f1d62202f 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -339,7 +339,7 @@ struct hmm_vma_walk {
>  static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
>   bool write_fault, uint64_t *pfn)
>  {
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
> + unsigned int flags = FAULT_FLAG_REMOTE;
>   struct hmm_vma_walk *hmm_vma_walk = walk->private;
>   struct hmm_range *range = hmm_vma_walk->range;
>   struct vm_area_struct *vma = walk->vma;
> -- 
> 2.17.1
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/nouveau: Fix DEVICE_PRIVATE dependencies

2019-04-17 Thread Jerome Glisse
On Wed, Apr 17, 2019 at 10:26:32PM +0800, Yue Haibing wrote:
> From: YueHaibing 
> 
> During randconfig builds, I occasionally run into an invalid configuration
> 
> WARNING: unmet direct dependencies detected for DEVICE_PRIVATE
>   Depends on [n]: ARCH_HAS_HMM_DEVICE [=n] && ZONE_DEVICE [=n]
>   Selected by [y]:
>   - DRM_NOUVEAU_SVM [=y] && HAS_IOMEM [=y] && ARCH_HAS_HMM [=y] && 
> DRM_NOUVEAU [=y] && STAGING [=y]
> 
> mm/memory.o: In function `do_swap_page':
> memory.c:(.text+0x2754): undefined reference to `device_private_entry_fault'
> 
> commit 5da25090ab04 ("mm/hmm: kconfig split HMM address space mirroring from 
> device memory")
> split CONFIG_DEVICE_PRIVATE dependencies from
> ARCH_HAS_HMM to ARCH_HAS_HMM_DEVICE and ZONE_DEVICE,
> so enable DRM_NOUVEAU_SVM will trigger this warning,
> cause building failed.
> 
> Reported-by: Hulk Robot 
> Fixes: 5da25090ab04 ("mm/hmm: kconfig split HMM address space mirroring from 
> device memory")
> Signed-off-by: YueHaibing 

Reviewed-by: Jérôme Glisse 

> ---
>  drivers/gpu/drm/nouveau/Kconfig | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/Kconfig b/drivers/gpu/drm/nouveau/Kconfig
> index 00cd9ab..99e30c1 100644
> --- a/drivers/gpu/drm/nouveau/Kconfig
> +++ b/drivers/gpu/drm/nouveau/Kconfig
> @@ -74,7 +74,8 @@ config DRM_NOUVEAU_BACKLIGHT
>  
>  config DRM_NOUVEAU_SVM
>   bool "(EXPERIMENTAL) Enable SVM (Shared Virtual Memory) support"
> - depends on ARCH_HAS_HMM
> + depends on ARCH_HAS_HMM_DEVICE
> + depends on ZONE_DEVICE
>   depends on DRM_NOUVEAU
>   depends on STAGING
>   select HMM_MIRROR
> -- 
> 2.7.4
> 
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/9] mm: Add an apply_to_pfn_range interface

2019-04-17 Thread Jerome Glisse
On Wed, Apr 17, 2019 at 09:15:52AM +, Thomas Hellstrom wrote:
> On Tue, 2019-04-16 at 10:46 -0400, Jerome Glisse wrote:
> > On Sat, Apr 13, 2019 at 08:34:02AM +, Thomas Hellstrom wrote:
> > > Hi, Jérôme
> > > 
> > > On Fri, 2019-04-12 at 17:07 -0400, Jerome Glisse wrote:
> > > > On Fri, Apr 12, 2019 at 04:04:18PM +, Thomas Hellstrom wrote:

[...]

> > > > > -/*
> > > > > - * Scan a region of virtual memory, filling in page tables as
> > > > > necessary
> > > > > - * and calling a provided function on each leaf page table.
> > > > > +/**
> > > > > + * apply_to_pfn_range - Scan a region of virtual memory,
> > > > > calling a
> > > > > provided
> > > > > + * function on each leaf page table entry
> > > > > + * @closure: Details about how to scan and what function to
> > > > > apply
> > > > > + * @addr: Start virtual address
> > > > > + * @size: Size of the region
> > > > > + *
> > > > > + * If @closure->alloc is set to 1, the function will fill in
> > > > > the
> > > > > page table
> > > > > + * as necessary. Otherwise it will skip non-present parts.
> > > > > + * Note: The caller must ensure that the range does not
> > > > > contain
> > > > > huge pages.
> > > > > + * The caller must also assure that the proper mmu_notifier
> > > > > functions are
> > > > > + * called. Either in the pte leaf function or before and after
> > > > > the
> > > > > call to
> > > > > + * apply_to_pfn_range.
> > > > 
> > > > This is wrong there should be a big FAT warning that this can
> > > > only be
> > > > use
> > > > against mmap of device file. The page table walking above is
> > > > broken
> > > > for
> > > > various thing you might find in any other vma like THP, device
> > > > pte,
> > > > hugetlbfs,
> > > 
> > > I was figuring since we didn't export the function anymore, the
> > > warning
> > > and checks could be left to its users, assuming that any other
> > > future
> > > usage of this function would require mm people audit anyway. But I
> > > can
> > > of course add that warning also to this function if you still want
> > > that?
> > 
> > Yeah more warning are better, people might start using this, i know
> > some poeple use unexported symbol and then report bugs while they
> > just were doing something illegal.
> > 
> > > > ...
> > > > 
> > > > Also the mmu notifier can not be call from the pfn callback as
> > > > that
> > > > callback
> > > > happens under page table lock (the change_pte notifier callback
> > > > is
> > > > useless
> > > > and not enough). So it _must_ happen around the call to
> > > > apply_to_pfn_range
> > > 
> > > In the comments I was having in mind usage of, for example
> > > ptep_clear_flush_notify(). But you're the mmu_notifier expert here.
> > > Are
> > > you saying that function by itself would not be sufficient?
> > > In that case, should I just scratch the text mentioning the pte
> > > leaf
> > > function?
> > 
> > ptep_clear_flush_notify() is useless ... i have posted patches to
> > either
> > restore it or remove it. In any case you must call mmu notifier range
> > and
> > they can not happen under lock. You usage looked fine (in the next
> > patch)
> > but i would rather have a bit of comment here to make sure people are
> > also
> > aware of that.
> > 
> > While we can hope that people would cc mm when using mm function, it
> > is
> > not always the case. So i rather be cautious and warn in comment as
> > much
> > as possible.
> > 
> 
> OK. Understood. All this actually makes me tend to want to try a bit
> harder using a slight modification to the pagewalk code instead. Don't
> really want to encourage two parallel code paths doing essentially the
> same thing; one good and one bad.
> 
> One thing that confuses me a bit with the pagewalk code is that callers
> (for example softdirty) typically call
> mmu_notifier_invalidate_range_start() around the pagewalk, but then if
> it ends up splitting a pmd, mmu_notifier_invalidate_range is called
> again, within the f

Re: [PATCH 2/9] mm: Add an apply_to_pfn_range interface

2019-04-16 Thread Jerome Glisse
On Sat, Apr 13, 2019 at 08:34:02AM +, Thomas Hellstrom wrote:
> Hi, Jérôme
> 
> On Fri, 2019-04-12 at 17:07 -0400, Jerome Glisse wrote:
> > On Fri, Apr 12, 2019 at 04:04:18PM +, Thomas Hellstrom wrote:
> > > This is basically apply_to_page_range with added functionality:
> > > Allocating missing parts of the page table becomes optional, which
> > > means that the function can be guaranteed not to error if
> > > allocation
> > > is disabled. Also passing of the closure struct and callback
> > > function
> > > becomes different and more in line with how things are done
> > > elsewhere.
> > > 
> > > Finally we keep apply_to_page_range as a wrapper around
> > > apply_to_pfn_range
> > > 
> > > The reason for not using the page-walk code is that we want to
> > > perform
> > > the page-walk on vmas pointing to an address space without
> > > requiring the
> > > mmap_sem to be held rather thand on vmas belonging to a process
> > > with the
> > > mmap_sem held.
> > > 
> > > Notable changes since RFC:
> > > Don't export apply_to_pfn range.
> > > 
> > > Cc: Andrew Morton 
> > > Cc: Matthew Wilcox 
> > > Cc: Will Deacon 
> > > Cc: Peter Zijlstra 
> > > Cc: Rik van Riel 
> > > Cc: Minchan Kim 
> > > Cc: Michal Hocko 
> > > Cc: Huang Ying 
> > > Cc: Souptick Joarder 
> > > Cc: "Jérôme Glisse" 
> > > Cc: linux...@kvack.org
> > > Cc: linux-ker...@vger.kernel.org
> > > Signed-off-by: Thomas Hellstrom 
> > > ---
> > >  include/linux/mm.h |  10 
> > >  mm/memory.c| 130 ++---
> > > 
> > >  2 files changed, 108 insertions(+), 32 deletions(-)
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 80bb6408fe73..b7dd4ddd6efb 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -2632,6 +2632,16 @@ typedef int (*pte_fn_t)(pte_t *pte,
> > > pgtable_t token, unsigned long addr,
> > >  extern int apply_to_page_range(struct mm_struct *mm, unsigned long
> > > address,
> > >  unsigned long size, pte_fn_t fn, void
> > > *data);
> > >  
> > > +struct pfn_range_apply;
> > > +typedef int (*pter_fn_t)(pte_t *pte, pgtable_t token, unsigned
> > > long addr,
> > > +  struct pfn_range_apply *closure);
> > > +struct pfn_range_apply {
> > > + struct mm_struct *mm;
> > > + pter_fn_t ptefn;
> > > + unsigned int alloc;
> > > +};
> > > +extern int apply_to_pfn_range(struct pfn_range_apply *closure,
> > > +   unsigned long address, unsigned long
> > > size);
> > >  
> > >  #ifdef CONFIG_PAGE_POISONING
> > >  extern bool page_poisoning_enabled(void);
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index a95b4a3b1ae2..60d67158964f 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1938,18 +1938,17 @@ int vm_iomap_memory(struct vm_area_struct
> > > *vma, phys_addr_t start, unsigned long
> > >  }
> > >  EXPORT_SYMBOL(vm_iomap_memory);
> > >  
> > > -static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> > > -  unsigned long addr, unsigned long
> > > end,
> > > -  pte_fn_t fn, void *data)
> > > +static int apply_to_pte_range(struct pfn_range_apply *closure,
> > > pmd_t *pmd,
> > > +   unsigned long addr, unsigned long end)
> > >  {
> > >   pte_t *pte;
> > >   int err;
> > >   pgtable_t token;
> > >   spinlock_t *uninitialized_var(ptl);
> > >  
> > > - pte = (mm == &init_mm) ?
> > > + pte = (closure->mm == &init_mm) ?
> > >   pte_alloc_kernel(pmd, addr) :
> > > - pte_alloc_map_lock(mm, pmd, addr, &ptl);
> > > + pte_alloc_map_lock(closure->mm, pmd, addr, &ptl);
> > >   if (!pte)
> > >   return -ENOMEM;
> > >  
> > > @@ -1960,86 +1959,107 @@ static int apply_to_pte_range(struct
> > > mm_struct *mm, pmd_t *pmd,
> > >   token = pmd_pgtable(*pmd);
> > >  
> > >   do {
> > > - err = fn(pte++, token, addr, data);
> > > + err = closure->ptefn(pte++, t

Re: [PATCH 2/9] mm: Add an apply_to_pfn_range interface

2019-04-12 Thread Jerome Glisse
On Fri, Apr 12, 2019 at 04:04:18PM +, Thomas Hellstrom wrote:
> This is basically apply_to_page_range with added functionality:
> Allocating missing parts of the page table becomes optional, which
> means that the function can be guaranteed not to error if allocation
> is disabled. Also passing of the closure struct and callback function
> becomes different and more in line with how things are done elsewhere.
> 
> Finally we keep apply_to_page_range as a wrapper around apply_to_pfn_range
> 
> The reason for not using the page-walk code is that we want to perform
> the page-walk on vmas pointing to an address space without requiring the
> mmap_sem to be held rather thand on vmas belonging to a process with the
> mmap_sem held.
> 
> Notable changes since RFC:
> Don't export apply_to_pfn range.
> 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Minchan Kim 
> Cc: Michal Hocko 
> Cc: Huang Ying 
> Cc: Souptick Joarder 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Thomas Hellstrom 
> ---
>  include/linux/mm.h |  10 
>  mm/memory.c| 130 ++---
>  2 files changed, 108 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..b7dd4ddd6efb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2632,6 +2632,16 @@ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, 
> unsigned long addr,
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
>  unsigned long size, pte_fn_t fn, void *data);
>  
> +struct pfn_range_apply;
> +typedef int (*pter_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
> +  struct pfn_range_apply *closure);
> +struct pfn_range_apply {
> + struct mm_struct *mm;
> + pter_fn_t ptefn;
> + unsigned int alloc;
> +};
> +extern int apply_to_pfn_range(struct pfn_range_apply *closure,
> +   unsigned long address, unsigned long size);
>  
>  #ifdef CONFIG_PAGE_POISONING
>  extern bool page_poisoning_enabled(void);
> diff --git a/mm/memory.c b/mm/memory.c
> index a95b4a3b1ae2..60d67158964f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1938,18 +1938,17 @@ int vm_iomap_memory(struct vm_area_struct *vma, 
> phys_addr_t start, unsigned long
>  }
>  EXPORT_SYMBOL(vm_iomap_memory);
>  
> -static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> -  unsigned long addr, unsigned long end,
> -  pte_fn_t fn, void *data)
> +static int apply_to_pte_range(struct pfn_range_apply *closure, pmd_t *pmd,
> +   unsigned long addr, unsigned long end)
>  {
>   pte_t *pte;
>   int err;
>   pgtable_t token;
>   spinlock_t *uninitialized_var(ptl);
>  
> - pte = (mm == &init_mm) ?
> + pte = (closure->mm == &init_mm) ?
>   pte_alloc_kernel(pmd, addr) :
> - pte_alloc_map_lock(mm, pmd, addr, &ptl);
> + pte_alloc_map_lock(closure->mm, pmd, addr, &ptl);
>   if (!pte)
>   return -ENOMEM;
>  
> @@ -1960,86 +1959,107 @@ static int apply_to_pte_range(struct mm_struct *mm, 
> pmd_t *pmd,
>   token = pmd_pgtable(*pmd);
>  
>   do {
> - err = fn(pte++, token, addr, data);
> + err = closure->ptefn(pte++, token, addr, closure);
>   if (err)
>   break;
>   } while (addr += PAGE_SIZE, addr != end);
>  
>   arch_leave_lazy_mmu_mode();
>  
> - if (mm != &init_mm)
> + if (closure->mm != &init_mm)
>   pte_unmap_unlock(pte-1, ptl);
>   return err;
>  }
>  
> -static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
> -  unsigned long addr, unsigned long end,
> -  pte_fn_t fn, void *data)
> +static int apply_to_pmd_range(struct pfn_range_apply *closure, pud_t *pud,
> +   unsigned long addr, unsigned long end)
>  {
>   pmd_t *pmd;
>   unsigned long next;
> - int err;
> + int err = 0;
>  
>   BUG_ON(pud_huge(*pud));
>  
> - pmd = pmd_alloc(mm, pud, addr);
> + pmd = pmd_alloc(closure->mm, pud, addr);
>   if (!pmd)
>   return -ENOMEM;
> +
>   do {
>   next = pmd_addr_end(addr, end);
> - err = apply_to_pte_range(mm, pmd, addr, next, fn, data);
> + if (!closure->alloc && pmd_none_or_clear_bad(pmd))
> + continue;
> + err = apply_to_pte_range(closure, pmd, addr, next);
>   if (err)
>   break;
>   } while (pmd++, addr = next, addr != end);
>   return err;
>  }
>  
> -static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
> -  unsigned long addr, unsigned long en

Re: [PATCH v6 7/8] mm/mmu_notifier: pass down vma and reasons why mmu notifier is happening v2

2019-04-11 Thread Jerome Glisse
On Thu, Apr 11, 2019 at 03:21:08PM +, Weiny, Ira wrote:
> > On Wed, Apr 10, 2019 at 04:41:57PM -0700, Ira Weiny wrote:
> > > On Tue, Mar 26, 2019 at 12:47:46PM -0400, Jerome Glisse wrote:
> > > > From: Jérôme Glisse 
> > > >
> > > > CPU page table update can happens for many reasons, not only as a
> > > > result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...)
> > > > but also as a result of kernel activities (memory compression,
> > > > reclaim, migration, ...).
> > > >
> > > > Users of mmu notifier API track changes to the CPU page table and
> > > > take specific action for them. While current API only provide range
> > > > of virtual address affected by the change, not why the changes is
> > > > happening
> > > >
> > > > This patch is just passing down the new informations by adding it to
> > > > the mmu_notifier_range structure.
> > > >
> > > > Changes since v1:
> > > > - Initialize flags field from mmu_notifier_range_init()
> > > > arguments
> > > >
> > > > Signed-off-by: Jérôme Glisse 
> > > > Cc: Andrew Morton 
> > > > Cc: linux...@kvack.org
> > > > Cc: Christian König 
> > > > Cc: Joonas Lahtinen 
> > > > Cc: Jani Nikula 
> > > > Cc: Rodrigo Vivi 
> > > > Cc: Jan Kara 
> > > > Cc: Andrea Arcangeli 
> > > > Cc: Peter Xu 
> > > > Cc: Felix Kuehling 
> > > > Cc: Jason Gunthorpe 
> > > > Cc: Ross Zwisler 
> > > > Cc: Dan Williams 
> > > > Cc: Paolo Bonzini 
> > > > Cc: Radim Krčmář 
> > > > Cc: Michal Hocko 
> > > > Cc: Christian Koenig 
> > > > Cc: Ralph Campbell 
> > > > Cc: John Hubbard 
> > > > Cc: k...@vger.kernel.org
> > > > Cc: dri-devel@lists.freedesktop.org
> > > > Cc: linux-r...@vger.kernel.org
> > > > Cc: Arnd Bergmann 
> > > > ---
> > > >  include/linux/mmu_notifier.h | 6 +-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/include/linux/mmu_notifier.h
> > > > b/include/linux/mmu_notifier.h index 62f94cd85455..0379956fff23
> > > > 100644
> > > > --- a/include/linux/mmu_notifier.h
> > > > +++ b/include/linux/mmu_notifier.h
> > > > @@ -58,10 +58,12 @@ struct mmu_notifier_mm {  #define
> > > > MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> > > >
> > > >  struct mmu_notifier_range {
> > > > +   struct vm_area_struct *vma;
> > > > struct mm_struct *mm;
> > > > unsigned long start;
> > > > unsigned long end;
> > > > unsigned flags;
> > > > +   enum mmu_notifier_event event;
> > > >  };
> > > >
> > > >  struct mmu_notifier_ops {
> > > > @@ -363,10 +365,12 @@ static inline void
> > mmu_notifier_range_init (struct mmu_notifier_range *range,
> > > >unsigned long start,
> > > >unsigned long end)
> > > >  {
> > > > +   range->vma = vma;
> > > > +   range->event = event;
> > > > range->mm = mm;
> > > > range->start = start;
> > > > range->end = end;
> > > > -   range->flags = 0;
> > > > +   range->flags = flags;
> > >
> > > Which of the "user patch sets" uses the new flags?
> > >
> > > I'm not seeing that user yet.  In general I don't see anything wrong
> > > with the series and I like the idea of telling drivers why the invalidate 
> > > has
> > fired.
> > >
> > > But is the flags a future feature?
> > >
> > 
> > I believe the link were in the cover:
> > 
> > https://lkml.org/lkml/2019/1/23/833
> > https://lkml.org/lkml/2019/1/23/834
> > https://lkml.org/lkml/2019/1/23/832
> > https://lkml.org/lkml/2019/1/23/831
> > 
> > I have more coming for HMM but i am waiting after 5.2 once amdgpu HMM
> > patch are merge upstream as it will change what is passed down to driver
> > and it would conflict with non merged HMM driver (like amdgpu today).
> > 
> 
> Unfortunately this does not answer my question.  Yes I saw the links to the 
> patches which use this

Re: [PATCH v6 7/8] mm/mmu_notifier: pass down vma and reasons why mmu notifier is happening v2

2019-04-11 Thread Jerome Glisse
On Wed, Apr 10, 2019 at 04:41:57PM -0700, Ira Weiny wrote:
> On Tue, Mar 26, 2019 at 12:47:46PM -0400, Jerome Glisse wrote:
> > From: Jérôme Glisse 
> > 
> > CPU page table update can happens for many reasons, not only as a result
> > of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also
> > as a result of kernel activities (memory compression, reclaim, migration,
> > ...).
> > 
> > Users of mmu notifier API track changes to the CPU page table and take
> > specific action for them. While current API only provide range of virtual
> > address affected by the change, not why the changes is happening
> > 
> > This patch is just passing down the new informations by adding it to the
> > mmu_notifier_range structure.
> > 
> > Changes since v1:
> > - Initialize flags field from mmu_notifier_range_init() arguments
> > 
> > Signed-off-by: Jérôme Glisse 
> > Cc: Andrew Morton 
> > Cc: linux...@kvack.org
> > Cc: Christian König 
> > Cc: Joonas Lahtinen 
> > Cc: Jani Nikula 
> > Cc: Rodrigo Vivi 
> > Cc: Jan Kara 
> > Cc: Andrea Arcangeli 
> > Cc: Peter Xu 
> > Cc: Felix Kuehling 
> > Cc: Jason Gunthorpe 
> > Cc: Ross Zwisler 
> > Cc: Dan Williams 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Michal Hocko 
> > Cc: Christian Koenig 
> > Cc: Ralph Campbell 
> > Cc: John Hubbard 
> > Cc: k...@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: Arnd Bergmann 
> > ---
> >  include/linux/mmu_notifier.h | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 62f94cd85455..0379956fff23 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -58,10 +58,12 @@ struct mmu_notifier_mm {
> >  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >  
> >  struct mmu_notifier_range {
> > +   struct vm_area_struct *vma;
> > struct mm_struct *mm;
> > unsigned long start;
> > unsigned long end;
> > unsigned flags;
> > +   enum mmu_notifier_event event;
> >  };
> >  
> >  struct mmu_notifier_ops {
> > @@ -363,10 +365,12 @@ static inline void mmu_notifier_range_init(struct 
> > mmu_notifier_range *range,
> >unsigned long start,
> >unsigned long end)
> >  {
> > +   range->vma = vma;
> > +   range->event = event;
> > range->mm = mm;
> > range->start = start;
> > range->end = end;
> > -   range->flags = 0;
> > +   range->flags = flags;
> 
> Which of the "user patch sets" uses the new flags?
> 
> I'm not seeing that user yet.  In general I don't see anything wrong with the
> series and I like the idea of telling drivers why the invalidate has fired.
> 
> But is the flags a future feature?
> 

I believe the link were in the cover:

https://lkml.org/lkml/2019/1/23/833
https://lkml.org/lkml/2019/1/23/834
https://lkml.org/lkml/2019/1/23/832
https://lkml.org/lkml/2019/1/23/831

I have more coming for HMM but i am waiting after 5.2 once amdgpu
HMM patch are merge upstream as it will change what is passed down
to driver and it would conflict with non merged HMM driver (like
amdgpu today).

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v6 0/8] mmu notifier provide context informations

2019-04-10 Thread Jerome Glisse
On Tue, Apr 09, 2019 at 03:08:55PM -0700, Andrew Morton wrote:
> On Tue, 26 Mar 2019 12:47:39 -0400 jgli...@redhat.com wrote:
> 
> > From: Jérôme Glisse 
> > 
> > (Andrew this apply on top of my HMM patchset as otherwise you will have
> >  conflict with changes to mm/hmm.c)
> > 
> > Changes since v5:
> > - drop KVM bits waiting for KVM people to express interest if they
> >   do not then i will post patchset to remove change_pte_notify as
> >   without the changes in v5 change_pte_notify is just useless (it
> >   it is useless today upstream it is just wasting cpu cycles)
> > - rebase on top of lastest Linus tree
> > 
> > Previous cover letter with minor update:
> > 
> > 
> > Here i am not posting users of this, they already have been posted to
> > appropriate mailing list [6] and will be merge through the appropriate
> > tree once this patchset is upstream.
> > 
> > Note that this serie does not change any behavior for any existing
> > code. It just pass down more information to mmu notifier listener.
> > 
> > The rational for this patchset:
> > 
> > CPU page table update can happens for many reasons, not only as a
> > result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...)
> > but also as a result of kernel activities (memory compression, reclaim,
> > migration, ...).
> > 
> > This patch introduce a set of enums that can be associated with each
> > of the events triggering a mmu notifier:
> > 
> > - UNMAP: munmap() or mremap()
> > - CLEAR: page table is cleared (migration, compaction, reclaim, ...)
> > - PROTECTION_VMA: change in access protections for the range
> > - PROTECTION_PAGE: change in access protections for page in the range
> > - SOFT_DIRTY: soft dirtyness tracking
> > 
> > Being able to identify munmap() and mremap() from other reasons why the
> > page table is cleared is important to allow user of mmu notifier to
> > update their own internal tracking structure accordingly (on munmap or
> > mremap it is not longer needed to track range of virtual address as it
> > becomes invalid). Without this serie, driver are force to assume that
> > every notification is an munmap which triggers useless trashing within
> > drivers that associate structure with range of virtual address. Each
> > driver is force to free up its tracking structure and then restore it
> > on next device page fault. With this serie we can also optimize device
> > page table update [6].
> > 
> > More over this can also be use to optimize out some page table updates
> > like for KVM where we can update the secondary MMU directly from the
> > callback instead of clearing it.
> 
> We seem to be rather short of review input on this patchset.  ie: there
> is none.

I forgot to update the review tag but Ralph did review v5:
https://lkml.org/lkml/2019/2/22/564
https://lkml.org/lkml/2019/2/22/561
https://lkml.org/lkml/2019/2/22/558
https://lkml.org/lkml/2019/2/22/710
https://lkml.org/lkml/2019/2/22/711
https://lkml.org/lkml/2019/2/22/695
https://lkml.org/lkml/2019/2/22/738
https://lkml.org/lkml/2019/2/22/757

and since this v6 is a rebase just with better comments here and
there i believe those reviews holds.

> 
> > ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
> 
> OK, kind of ackish, but not a review.
> 
> > ACKS RDMA https://lkml.org/lkml/2018/12/6/1473
> 
> This actually acks the infiniband part of a patch which isn't in this
> series.

This to show that they are end user and that those end user are
wanted. Also obviously i will be using this within HMM and thus
it will be use by mlx5, nouveau and amdgpu (which are all the
HMM user that are either upstream or queue up for 5.2 or 5.3).

> So we have some work to do, please.  Who would be suitable reviewers?

Anyone willing to review mmu notifier code. I believe this patchset is
not that hard to review this is about giving contextual informations
on why mmu notifier are happening it does not change the logic of any-
thing. They are no maintainers for the mmu notifier so i don't have a
person i can single out for review, thought given i have been the one
doing most changes in that area it could fall on me ...

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v6 0/8] mmu notifier provide context informations

2019-04-09 Thread Jerome Glisse
Andrew anything blocking this for 5.2 ? Should i ask people (ie the end
user of this) to re-ack v6 (it is the same as previous version just rebase
and dropped kvm bits).



On Tue, Mar 26, 2019 at 12:47:39PM -0400, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> (Andrew this apply on top of my HMM patchset as otherwise you will have
>  conflict with changes to mm/hmm.c)
> 
> Changes since v5:
> - drop KVM bits waiting for KVM people to express interest if they
>   do not then i will post patchset to remove change_pte_notify as
>   without the changes in v5 change_pte_notify is just useless (it
>   it is useless today upstream it is just wasting cpu cycles)
> - rebase on top of lastest Linus tree
> 
> Previous cover letter with minor update:
> 
> 
> Here i am not posting users of this, they already have been posted to
> appropriate mailing list [6] and will be merge through the appropriate
> tree once this patchset is upstream.
> 
> Note that this serie does not change any behavior for any existing
> code. It just pass down more information to mmu notifier listener.
> 
> The rational for this patchset:
> 
> CPU page table update can happens for many reasons, not only as a
> result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...)
> but also as a result of kernel activities (memory compression, reclaim,
> migration, ...).
> 
> This patch introduce a set of enums that can be associated with each
> of the events triggering a mmu notifier:
> 
> - UNMAP: munmap() or mremap()
> - CLEAR: page table is cleared (migration, compaction, reclaim, ...)
> - PROTECTION_VMA: change in access protections for the range
> - PROTECTION_PAGE: change in access protections for page in the range
> - SOFT_DIRTY: soft dirtyness tracking
> 
> Being able to identify munmap() and mremap() from other reasons why the
> page table is cleared is important to allow user of mmu notifier to
> update their own internal tracking structure accordingly (on munmap or
> mremap it is not longer needed to track range of virtual address as it
> becomes invalid). Without this serie, driver are force to assume that
> every notification is an munmap which triggers useless trashing within
> drivers that associate structure with range of virtual address. Each
> driver is force to free up its tracking structure and then restore it
> on next device page fault. With this serie we can also optimize device
> page table update [6].
> 
> More over this can also be use to optimize out some page table updates
> like for KVM where we can update the secondary MMU directly from the
> callback instead of clearing it.
> 
> ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
> ACKS RDMA https://lkml.org/lkml/2018/12/6/1473
> 
> Cheers,
> Jérôme
> 
> [1] v1 https://lkml.org/lkml/2018/3/23/1049
> [2] v2 https://lkml.org/lkml/2018/12/5/10
> [3] v3 https://lkml.org/lkml/2018/12/13/620
> [4] v4 https://lkml.org/lkml/2019/1/23/838
> [5] v5 https://lkml.org/lkml/2019/2/19/752
> [6] patches to use this:
> https://lkml.org/lkml/2019/1/23/833
> https://lkml.org/lkml/2019/1/23/834
> https://lkml.org/lkml/2019/1/23/832
> https://lkml.org/lkml/2019/1/23/831
> 
> Cc: Andrew Morton 
> Cc: linux...@kvack.org
> Cc: Christian König 
> Cc: Joonas Lahtinen 
> Cc: Jani Nikula 
> Cc: Rodrigo Vivi 
> Cc: Jan Kara 
> Cc: Andrea Arcangeli 
> Cc: Peter Xu 
> Cc: Felix Kuehling 
> Cc: Jason Gunthorpe 
> Cc: Ross Zwisler 
> Cc: Dan Williams 
> Cc: Paolo Bonzini 
> Cc: Alex Deucher 
> Cc: Radim Krčmář 
> Cc: Michal Hocko 
> Cc: Christian Koenig 
> Cc: Ben Skeggs 
> Cc: Ralph Campbell 
> Cc: John Hubbard 
> Cc: k...@vger.kernel.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: linux-r...@vger.kernel.org
> Cc: Arnd Bergmann 
> 
> Jérôme Glisse (8):
>   mm/mmu_notifier: helper to test if a range invalidation is blockable
>   mm/mmu_notifier: convert user range->blockable to helper function
>   mm/mmu_notifier: convert mmu_notifier_range->blockable to a flags
>   mm/mmu_notifier: contextual information for event enums
>   mm/mmu_notifier: contextual information for event triggering
> invalidation v2
>   mm/mmu_notifier: use correct mmu_notifier events for each invalidation
>   mm/mmu_notifier: pass down vma and reasons why mmu notifier is
> happening v2
>   mm/mmu_notifier: mmu_notifier_range_update_to_read_only() helper
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  |  8 ++--
>  drivers/gpu/drm/i915/i915_gem_userptr.c |  2 +-
>  drivers/gpu/drm/radeon/radeon_mn.c  |  4 +-
>  drivers/infiniband/core/umem_odp.c  |  5 +-
>  drivers/xen/gntdev.c|  6 +--
>  fs/proc/task_mmu.c  |  3 +-
>  include/linux/mmu_notifier.h| 63 +++--
>  kernel/events/uprobes.c |  3 +-
>  mm/hmm.c|  6 +--
>  mm/huge_memory.c| 14 +++---
>  mm/hugetlb.c| 12 +++--
>  mm/khugepaged.c

Re: [RFC PATCH Xilinx Alveo 0/6] Xilinx PCIe accelerator driver

2019-04-03 Thread Jerome Glisse
On Fri, Mar 29, 2019 at 06:09:18PM -0700, Ronan KERYELL wrote:
> I am adding linux-f...@vger.kernel.org, since this is why I missed this
> thread in the first place...
> > On Fri, 29 Mar 2019 14:56:17 +1000, Dave Airlie  
> > said:
> Dave> On Thu, 28 Mar 2019 at 10:14, Sonal Santan  
> wrote:
> >>> From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch]

[...]

> Long answer:
> 
> - processors, GPU and other digital circuits are designed from a lot of
>   elementary transistors, wires, capacitors, resistors... using some
>   very complex (and expensive) tools from some EDA companies but at the
>   end, after months of work, they come often with a "simple" public
>   interface, the... instruction set! So it is rather "easy" at the end
>   to generate some instructions with a compiler such as LLVM from a
>   description of this ISA or some reverse engineering. Note that even if
>   the ISA is public, it is very difficult to make another efficient
>   processor from scratch just from this ISA, so there is often no
>   concern about making this ISA public to develop the ecosystem ;
> 
> - FPGA are field-programmable gate arrays, made also from a lot of
>   elementary transistors, wires, capacitors, resistors... but organized
>   in billions of very low-level elementary gates, memory elements, DSP
>   blocks, I/O blocks, clock generators, specific
>   accelerators... directly exposed to the user and that can be
>   programmed according to a configuration memory (the bitstream) that
>   details how to connect each part, routing element, configuring each
>   elemental piece of hardware.  So instead of just writing instructions
>   like on a CPU or a GPU, you need to configure each bit of the
>   architecture in such a way it does something interesting for
>   you. Concretely, you write some programs in RTL languages (Verilog,
>   VHDL) or higher-level (C/C++, OpenCL, SYCL...)  and you use some very
>   complex (and expensive) tools from some EDA companies to generate the
>   bitstream implementing an equivalent circuit with the same
>   semantics. Since the architecture is so low level, there is a direct
>   mapping between the configuration memory (bitstream) and the hardware
>   architecture itself, so if it is public then it is easy to duplicate
>   the FPGA itself and to start a new FPGA company. That is unfortunately
>   something the existing FPGA companies do not want... ;-)

This is completely bogus argument, all FPGA documentation i have seen so far
_extensively_ describe _each_ basic blocks within the FGPA, this does include
the excelent documentation Xilinx provide on the inner working and layout of
Xilinx FPGA. Same apply to Altera, Atmel, Latice, ...

The extensive public documentation is enough for anyone with the money and
with half decent engineers to produce an FPGA.

The real know how of FPGA vendor is how to produce big chips on small process
capable to sustain high clock with the best power consumption possible. This
is the part where the years of experiences of each company pay off. The cost
for anyone to come to the market is in the hundred of millions just in setup
cost and to catch with established vendor on the hardware side. This without
any garanty of revenue at the end.

The bitstream is only giving away which bits correspond to which wire where
the LUT boolean table is store  ... Bitstream that have been reverse engineer
never revealed anything of value that was not already publicly documented.


So no the bitstream has _no_ value, please prove me wrong with Latice bitstream
for instance. If anything the fact that Latice has a reverse engineer bitstream
has made that FPGA popular with the maker community as it allows people to do
experiment for which the closed source tools are an impediment. So i would argue
that open bitstream is actualy beneficial.


The only valid reason i have ever seen for hidding the bitstream is to protect
the IP of the customer ie those customer that can pour quite a lot of money on
designing something with an FPGA and then wants to keep the VHDL/Verilog
protected and "safe" from reverse engineering.

But this is security by obscurity and FPGA company would be better off providing
strong bitstream encryption (and most already do but i have seen some paper on
how to break them).


I rather not see any bogus argument to try to justify something that is not
justifiable.


Daniel already stressed that we need to know what the bitstream can do and it
is even more important with FPGA where on some FPGA AFAICT the bitstream can
have total control over the PCIE BUS and thus can be use to attack either main
memory or other PCIE devices.

For instance with ATS/PASID you can have the device send pre-translated request
to the IOMMU and access any memory despite the IOMMU.

So without total confidence of what the bitstream can and can not do, and thus
without knowledge of the bitstream format and how it maps to LUT, switch, cross-
bar, clock, fix block (PCIE, 

Re: [RFC PATCH RESEND 3/3] mm: Add write-protect and clean utilities for address space ranges

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 08:29:31PM +, Thomas Hellstrom wrote:
> On Thu, 2019-03-21 at 10:12 -0400, Jerome Glisse wrote:
> > On Thu, Mar 21, 2019 at 01:22:41PM +, Thomas Hellstrom wrote:
> > > Add two utilities to a) write-protect and b) clean all ptes
> > > pointing into
> > > a range of an address space
> > > The utilities are intended to aid in tracking dirty pages (either
> > > driver-allocated system memory or pci device memory).
> > > The write-protect utility should be used in conjunction with
> > > page_mkwrite() and pfn_mkwrite() to trigger write page-faults on
> > > page
> > > accesses. Typically one would want to use this on sparse accesses
> > > into
> > > large memory regions. The clean utility should be used to utilize
> > > hardware dirtying functionality and avoid the overhead of page-
> > > faults,
> > > typically on large accesses into small memory regions.
> > 
> > Again this does not use mmu notifier and there is no scary comment to
> > explain the very limited use case it should be use for ie mmap of a
> > device file and only by the device driver.
> 
> Scary comment and asserts will be added.
> 
> > 
> > Using it ouside of this would break softdirty or trigger false COW or
> > other scary thing.
> 
> This is something that should clearly be avoided if at all possible.
> False COWs could be avoided by asserting that VMAs are shared. I need
> to look deaper into softdirty, but note that the __mkwrite / dirty /
> clean pattern is already used in a very similar way in
> drivers/video/fb_defio.c although it operates only on real pages one at
> a time.

It should just be allow only for mapping of device file for which none
of the above apply (softdirty, COW, ...).

> 
> > 
> > > Cc: Andrew Morton 
> > > Cc: Matthew Wilcox 
> > > Cc: Will Deacon 
> > > Cc: Peter Zijlstra 
> > > Cc: Rik van Riel 
> > > Cc: Minchan Kim 
> > > Cc: Michal Hocko 
> > > Cc: Huang Ying 
> > > Cc: Souptick Joarder 
> > > Cc: "Jérôme Glisse" 
> > > Cc: linux...@kvack.org
> > > Cc: linux-ker...@vger.kernel.org
> > > Signed-off-by: Thomas Hellstrom 
> > > ---
> > >  include/linux/mm.h  |   9 +-
> > >  mm/Makefile |   2 +-
> > >  mm/apply_as_range.c | 257
> > > 
> > >  3 files changed, 266 insertions(+), 2 deletions(-)
> > >  create mode 100644 mm/apply_as_range.c
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index b7dd4ddd6efb..62f24dd0bfa0 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -2642,7 +2642,14 @@ struct pfn_range_apply {
> > >  };
> > >  extern int apply_to_pfn_range(struct pfn_range_apply *closure,
> > > unsigned long address, unsigned long
> > > size);
> > > -
> > > +unsigned long apply_as_wrprotect(struct address_space *mapping,
> > > +  pgoff_t first_index, pgoff_t nr);
> > > +unsigned long apply_as_clean(struct address_space *mapping,
> > > +  pgoff_t first_index, pgoff_t nr,
> > > +  pgoff_t bitmap_pgoff,
> > > +  unsigned long *bitmap,
> > > +  pgoff_t *start,
> > > +  pgoff_t *end);
> > >  #ifdef CONFIG_PAGE_POISONING
> > >  extern bool page_poisoning_enabled(void);
> > >  extern void kernel_poison_pages(struct page *page, int numpages,
> > > int enable);
> > > diff --git a/mm/Makefile b/mm/Makefile
> > > index d210cc9d6f80..a94b78f12692 100644
> > > --- a/mm/Makefile
> > > +++ b/mm/Makefile
> > > @@ -39,7 +39,7 @@ obj-y   := filemap.o mempool.o
> > > oom_kill.o fadvise.o \
> > >  mm_init.o mmu_context.o percpu.o
> > > slab_common.o \
> > >  compaction.o vmacache.o \
> > >  interval_tree.o list_lru.o workingset.o \
> > > -debug.o $(mmu-y)
> > > +debug.o apply_as_range.o $(mmu-y)
> > >  
> > >  obj-y += init-mm.o
> > >  obj-y += memblock.o
> > > diff --git a/mm/apply_as_range.c b/mm/apply_as_range.c
> > > new file mode 100644
> > > index ..9f03e272ebd0
> > > --- /dev/null
> > > +++ b/mm/apply_as_range.c
> > > @@ -0,0 +1,257 @@
&

Re: [RFC PATCH RESEND 0/3] mm modifications / helpers for emulated GPU coherent memory

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 07:51:16PM +, Thomas Hellstrom wrote:
> Hi, Jérôme,
> 
> Thanks for commenting. I have a couple of questions / clarifications
> below.
> 
> On Thu, 2019-03-21 at 09:46 -0400, Jerome Glisse wrote:
> > On Thu, Mar 21, 2019 at 01:22:22PM +, Thomas Hellstrom wrote:
> > > Resending since last series was sent through a mis-configured SMTP
> > > server.
> > > 
> > > Hi,
> > > This is an early RFC to make sure I don't go too far in the wrong
> > > direction.
> > > 
> > > Non-coherent GPUs that can't directly see contents in CPU-visible
> > > memory,
> > > like VMWare's SVGA device, run into trouble when trying to
> > > implement
> > > coherent memory requirements of modern graphics APIs. Examples are
> > > Vulkan and OpenGL 4.4's ARB_buffer_storage.
> > > 
> > > To remedy, we need to emulate coherent memory. Typically when it's
> > > detected
> > > that a buffer object is about to be accessed by the GPU, we need to
> > > gather the ranges that have been dirtied by the CPU since the last
> > > operation,
> > > apply an operation to make the content visible to the GPU and clear
> > > the
> > > the dirty tracking.
> > > 
> > > Depending on the size of the buffer object and the access pattern
> > > there are
> > > two major possibilities:
> > > 
> > > 1) Use page_mkwrite() and pfn_mkwrite(). (GPU buffer objects are
> > > backed
> > > either by PCI device memory or by driver-alloced pages).
> > > The dirty-tracking needs to be reset by write-protecting the
> > > affected ptes
> > > and flush tlb. This has a complexity of O(num_dirty_pages), but the
> > > write page-fault is of course costly.
> > > 
> > > 2) Use hardware dirty-flags in the ptes. The dirty-tracking needs
> > > to be reset
> > > by clearing the dirty bits and flush tlb. This has a complexity of
> > > O(num_buffer_object_pages) and dirty bits need to be scanned in
> > > full before
> > > each gpu-access.
> > > 
> > > So in practice the two methods need to be interleaved for best
> > > performance.
> > > 
> > > So to facilitate this, I propose two new helpers,
> > > apply_as_wrprotect() and
> > > apply_as_clean() ("as" stands for address-space) both inspired by
> > > unmap_mapping_range(). Users of these helpers are in the making,
> > > but needs
> > > some cleaning-up.
> > 
> > To be clear this should _only be use_ for mmap of device file ? If so
> > the API should try to enforce that as much as possible for instance
> > by
> > mandating the file as argument so that the function can check it is
> > only use in that case. Also big scary comment to make sure no one
> > just
> > start using those outside this very limited frame.
> 
> Fine with me. Perhaps we could BUG() / WARN() on certain VMA flags 
> instead of mandating the file as argument. That can make sure we
> don't accidently hit pages we shouldn't hit.

You already provide the mapping as argument it should not be hard to
check it is a mapping to a device file as the vma flags will not be
enough to identify this case.

> 
> > 
> > > There's also a change to x_mkwrite() to allow dropping the mmap_sem
> > > while
> > > waiting.
> > 
> > This will most likely conflict with userfaultfd write protection. 
> 
> Are you referring to the x_mkwrite() usage itself or the mmap_sem
> dropping facilitation?

Both i believe, however i have not try to apply your patches on top of
the userfaultfd patchset

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH RESEND 2/3] mm: Add an apply_to_pfn_range interface

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 07:59:35PM +, Thomas Hellstrom wrote:
> On Thu, 2019-03-21 at 09:52 -0400, Jerome Glisse wrote:
> > On Thu, Mar 21, 2019 at 01:22:35PM +, Thomas Hellstrom wrote:
> > > This is basically apply_to_page_range with added functionality:
> > > Allocating missing parts of the page table becomes optional, which
> > > means that the function can be guaranteed not to error if
> > > allocation
> > > is disabled. Also passing of the closure struct and callback
> > > function
> > > becomes different and more in line with how things are done
> > > elsewhere.
> > > 
> > > Finally we keep apply_to_page_range as a wrapper around
> > > apply_to_pfn_range
> > 
> > The apply_to_page_range() is dangerous API it does not follow other
> > mm patterns like mmu notifier. It is suppose to be use in arch code
> > or vmalloc or similar thing but not in regular driver code. I see
> > it has crept out of this and is being use by few device driver. I am
> > not sure we should encourage that.
> 
> I can certainly remove the EXPORT of the new apply_to_pfn_range() which
> will make sure its use stays within the mm code. I don't expect any
> additional usage except for the two address-space utilities.
> 
> I'm looking for examples to see how it could be more in line with the
> rest of the mm code. The main difference from the pattern in, for
> example, page_mkclean() seems to be that it's lacking the
> mmu_notifier_invalidate_start() and mmu_notifier_invalidate_end()?
> Perhaps the intention is to have the pte leaf functions notify on pte
> updates? How does this relate to arch_enter_lazy_mmu() which is called
> outside of the page table locks? The documentation appears a bit
> scarce...

Best is to use something like walk_page_range() and have proper mmu
notifier in the callback. The apply_to_page_range() is broken for
huge page (THP) and other things like that. Thought you should not
have THP within mmap of a device file (at least i do not thing any
driver does that).

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH RESEND 3/3] mm: Add write-protect and clean utilities for address space ranges

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 01:22:41PM +, Thomas Hellstrom wrote:
> Add two utilities to a) write-protect and b) clean all ptes pointing into
> a range of an address space
> The utilities are intended to aid in tracking dirty pages (either
> driver-allocated system memory or pci device memory).
> The write-protect utility should be used in conjunction with
> page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
> accesses. Typically one would want to use this on sparse accesses into
> large memory regions. The clean utility should be used to utilize
> hardware dirtying functionality and avoid the overhead of page-faults,
> typically on large accesses into small memory regions.


Again this does not use mmu notifier and there is no scary comment to
explain the very limited use case it should be use for ie mmap of a
device file and only by the device driver.

Using it ouside of this would break softdirty or trigger false COW or
other scary thing.

> 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Minchan Kim 
> Cc: Michal Hocko 
> Cc: Huang Ying 
> Cc: Souptick Joarder 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Thomas Hellstrom 
> ---
>  include/linux/mm.h  |   9 +-
>  mm/Makefile |   2 +-
>  mm/apply_as_range.c | 257 
>  3 files changed, 266 insertions(+), 2 deletions(-)
>  create mode 100644 mm/apply_as_range.c
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b7dd4ddd6efb..62f24dd0bfa0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2642,7 +2642,14 @@ struct pfn_range_apply {
>  };
>  extern int apply_to_pfn_range(struct pfn_range_apply *closure,
> unsigned long address, unsigned long size);
> -
> +unsigned long apply_as_wrprotect(struct address_space *mapping,
> +  pgoff_t first_index, pgoff_t nr);
> +unsigned long apply_as_clean(struct address_space *mapping,
> +  pgoff_t first_index, pgoff_t nr,
> +  pgoff_t bitmap_pgoff,
> +  unsigned long *bitmap,
> +  pgoff_t *start,
> +  pgoff_t *end);
>  #ifdef CONFIG_PAGE_POISONING
>  extern bool page_poisoning_enabled(void);
>  extern void kernel_poison_pages(struct page *page, int numpages, int enable);
> diff --git a/mm/Makefile b/mm/Makefile
> index d210cc9d6f80..a94b78f12692 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -39,7 +39,7 @@ obj-y   := filemap.o mempool.o 
> oom_kill.o fadvise.o \
>  mm_init.o mmu_context.o percpu.o slab_common.o \
>  compaction.o vmacache.o \
>  interval_tree.o list_lru.o workingset.o \
> -debug.o $(mmu-y)
> +debug.o apply_as_range.o $(mmu-y)
>  
>  obj-y += init-mm.o
>  obj-y += memblock.o
> diff --git a/mm/apply_as_range.c b/mm/apply_as_range.c
> new file mode 100644
> index ..9f03e272ebd0
> --- /dev/null
> +++ b/mm/apply_as_range.c
> @@ -0,0 +1,257 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/**
> + * struct apply_as - Closure structure for apply_as_range
> + * @base: struct pfn_range_apply we derive from
> + * @start: Address of first modified pte
> + * @end: Address of last modified pte + 1
> + * @total: Total number of modified ptes
> + * @vma: Pointer to the struct vm_area_struct we're currently operating on
> + * @flush_cache: Whether to call a cache flush before modifying a pte
> + * @flush_tlb: Whether to flush the tlb after modifying a pte
> + */
> +struct apply_as {
> + struct pfn_range_apply base;
> + unsigned long start, end;
> + unsigned long total;
> + const struct vm_area_struct *vma;
> + u32 flush_cache : 1;
> + u32 flush_tlb : 1;
> +};
> +
> +/**
> + * apply_pt_wrprotect - Leaf pte callback to write-protect a pte
> + * @pte: Pointer to the pte
> + * @token: Page table token, see apply_to_pfn_range()
> + * @addr: The virtual page address
> + * @closure: Pointer to a struct pfn_range_apply embedded in a
> + * struct apply_as
> + *
> + * The function write-protects a pte and records the range in
> + * virtual address space of touched ptes for efficient TLB flushes.
> + *
> + * Return: Always zero.
> + */
> +static int apply_pt_wrprotect(pte_t *pte, pgtable_t token,
> +   unsigned long addr,
> +   struct pfn_range_apply *closure)
> +{
> + struct apply_as *aas = container_of(closure, typeof(*aas), base);
> +
> + if (pte_write(*pte)) {
> + set_pte_at(closure->mm, addr, pte, pte_wrprotect(*pte));

So there is no flushing here, even for x96 this is wrong. It
should be something like:
ptep_clear_fl

Re: [RFC PATCH RESEND 2/3] mm: Add an apply_to_pfn_range interface

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 01:22:35PM +, Thomas Hellstrom wrote:
> This is basically apply_to_page_range with added functionality:
> Allocating missing parts of the page table becomes optional, which
> means that the function can be guaranteed not to error if allocation
> is disabled. Also passing of the closure struct and callback function
> becomes different and more in line with how things are done elsewhere.
> 
> Finally we keep apply_to_page_range as a wrapper around apply_to_pfn_range

The apply_to_page_range() is dangerous API it does not follow other
mm patterns like mmu notifier. It is suppose to be use in arch code
or vmalloc or similar thing but not in regular driver code. I see
it has crept out of this and is being use by few device driver. I am
not sure we should encourage that.

> 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Minchan Kim 
> Cc: Michal Hocko 
> Cc: Huang Ying 
> Cc: Souptick Joarder 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Thomas Hellstrom 
> ---
>  include/linux/mm.h |  10 
>  mm/memory.c| 121 +
>  2 files changed, 99 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..b7dd4ddd6efb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2632,6 +2632,16 @@ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, 
> unsigned long addr,
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
>  unsigned long size, pte_fn_t fn, void *data);
>  
> +struct pfn_range_apply;
> +typedef int (*pter_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
> +  struct pfn_range_apply *closure);
> +struct pfn_range_apply {
> + struct mm_struct *mm;
> + pter_fn_t ptefn;
> + unsigned int alloc;
> +};
> +extern int apply_to_pfn_range(struct pfn_range_apply *closure,
> +   unsigned long address, unsigned long size);
>  
>  #ifdef CONFIG_PAGE_POISONING
>  extern bool page_poisoning_enabled(void);
> diff --git a/mm/memory.c b/mm/memory.c
> index dcd80313cf10..0feb7191c2d2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1938,18 +1938,17 @@ int vm_iomap_memory(struct vm_area_struct *vma, 
> phys_addr_t start, unsigned long
>  }
>  EXPORT_SYMBOL(vm_iomap_memory);
>  
> -static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> -  unsigned long addr, unsigned long end,
> -  pte_fn_t fn, void *data)
> +static int apply_to_pte_range(struct pfn_range_apply *closure, pmd_t *pmd,
> +   unsigned long addr, unsigned long end)
>  {
>   pte_t *pte;
>   int err;
>   pgtable_t token;
>   spinlock_t *uninitialized_var(ptl);
>  
> - pte = (mm == &init_mm) ?
> + pte = (closure->mm == &init_mm) ?
>   pte_alloc_kernel(pmd, addr) :
> - pte_alloc_map_lock(mm, pmd, addr, &ptl);
> + pte_alloc_map_lock(closure->mm, pmd, addr, &ptl);
>   if (!pte)
>   return -ENOMEM;
>  
> @@ -1960,86 +1959,103 @@ static int apply_to_pte_range(struct mm_struct *mm, 
> pmd_t *pmd,
>   token = pmd_pgtable(*pmd);
>  
>   do {
> - err = fn(pte++, token, addr, data);
> + err = closure->ptefn(pte++, token, addr, closure);
>   if (err)
>   break;
>   } while (addr += PAGE_SIZE, addr != end);
>  
>   arch_leave_lazy_mmu_mode();
>  
> - if (mm != &init_mm)
> + if (closure->mm != &init_mm)
>   pte_unmap_unlock(pte-1, ptl);
>   return err;
>  }
>  
> -static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
> -  unsigned long addr, unsigned long end,
> -  pte_fn_t fn, void *data)
> +static int apply_to_pmd_range(struct pfn_range_apply *closure, pud_t *pud,
> +   unsigned long addr, unsigned long end)
>  {
>   pmd_t *pmd;
>   unsigned long next;
> - int err;
> + int err = 0;
>  
>   BUG_ON(pud_huge(*pud));
>  
> - pmd = pmd_alloc(mm, pud, addr);
> + pmd = pmd_alloc(closure->mm, pud, addr);
>   if (!pmd)
>   return -ENOMEM;
> +
>   do {
>   next = pmd_addr_end(addr, end);
> - err = apply_to_pte_range(mm, pmd, addr, next, fn, data);
> + if (!closure->alloc && pmd_none_or_clear_bad(pmd))
> + continue;
> + err = apply_to_pte_range(closure, pmd, addr, next);
>   if (err)
>   break;
>   } while (pmd++, addr = next, addr != end);
>   return err;
>  }
>  
> -static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
> -  unsigned long addr, unsigned long end,

Re: [RFC PATCH RESEND 0/3] mm modifications / helpers for emulated GPU coherent memory

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 01:22:22PM +, Thomas Hellstrom wrote:
> Resending since last series was sent through a mis-configured SMTP server.
> 
> Hi,
> This is an early RFC to make sure I don't go too far in the wrong direction.
> 
> Non-coherent GPUs that can't directly see contents in CPU-visible memory,
> like VMWare's SVGA device, run into trouble when trying to implement
> coherent memory requirements of modern graphics APIs. Examples are
> Vulkan and OpenGL 4.4's ARB_buffer_storage.
> 
> To remedy, we need to emulate coherent memory. Typically when it's detected
> that a buffer object is about to be accessed by the GPU, we need to
> gather the ranges that have been dirtied by the CPU since the last operation,
> apply an operation to make the content visible to the GPU and clear the
> the dirty tracking.
> 
> Depending on the size of the buffer object and the access pattern there are
> two major possibilities:
> 
> 1) Use page_mkwrite() and pfn_mkwrite(). (GPU buffer objects are backed
> either by PCI device memory or by driver-alloced pages).
> The dirty-tracking needs to be reset by write-protecting the affected ptes
> and flush tlb. This has a complexity of O(num_dirty_pages), but the
> write page-fault is of course costly.
> 
> 2) Use hardware dirty-flags in the ptes. The dirty-tracking needs to be reset
> by clearing the dirty bits and flush tlb. This has a complexity of
> O(num_buffer_object_pages) and dirty bits need to be scanned in full before
> each gpu-access.
> 
> So in practice the two methods need to be interleaved for best performance.
> 
> So to facilitate this, I propose two new helpers, apply_as_wrprotect() and
> apply_as_clean() ("as" stands for address-space) both inspired by
> unmap_mapping_range(). Users of these helpers are in the making, but needs
> some cleaning-up.

To be clear this should _only be use_ for mmap of device file ? If so
the API should try to enforce that as much as possible for instance by
mandating the file as argument so that the function can check it is
only use in that case. Also big scary comment to make sure no one just
start using those outside this very limited frame.

> 
> There's also a change to x_mkwrite() to allow dropping the mmap_sem while
> waiting.

This will most likely conflict with userfaultfd write protection. Maybe
building your thing on top of that would be better.

https://lwn.net/Articles/783571/

I will take a cursory look at the patches.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC][PATCH 0/5 v2] DMA-BUF Heaps (destaging ION)

2019-03-15 Thread Jerome Glisse
On Tue, Mar 05, 2019 at 12:54:28PM -0800, John Stultz wrote:
> Here is a initial RFC of the dma-buf heaps patchset Andrew and I
> have been working on which tries to destage a fair chunk of ION
> functionality.
> 
> The patchset implements per-heap devices which can be opened
> directly and then an ioctl is used to allocate a dmabuf from the
> heap.
> 
> The interface is similar, but much simpler then IONs, only
> providing an ALLOC ioctl.
> 
> Also, I've provided simple system and cma heaps. The system
> heap in particular is missing the page-pool optimizations ION
> had, but works well enough to validate the interface.
> 
> I've booted and tested these patches with AOSP on the HiKey960
> using the kernel tree here:
>   
> https://git.linaro.org/people/john.stultz/android-dev.git/log/?h=dev/dma-buf-heap
> 
> And the userspace changes here:
>   https://android-review.googlesource.com/c/device/linaro/hikey/+/909436

What upstream driver will use this eventualy ? And why is it
needed ?

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 0/9] mmu notifier provide context informations

2019-02-19 Thread Jerome Glisse
On Tue, Feb 19, 2019 at 01:19:09PM -0800, Dan Williams wrote:
> On Tue, Feb 19, 2019 at 12:58 PM Jerome Glisse  wrote:
> >
> > On Tue, Feb 19, 2019 at 12:40:37PM -0800, Dan Williams wrote:
> > > On Tue, Feb 19, 2019 at 12:30 PM Jerome Glisse  wrote:
> > > >
> > > > On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> > > > > On Tue, Feb 19, 2019 at 12:04 PM  wrote:
> > > > > >
> > > > > > From: Jérôme Glisse 
> > > > > >
> > > > > > Since last version [4] i added the extra bits needed for the 
> > > > > > change_pte
> > > > > > optimization (which is a KSM thing). Here i am not posting users of
> > > > > > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > > > > > RDMA, ...) once this serie get upstream. If you want to look at 
> > > > > > users
> > > > > > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > > > > > those users for 5.2 (including KVM if KVM folks feel comfortable 
> > > > > > with
> > > > > > it).
> > > > >
> > > > > The users look small and straightforward. Why not await acks and
> > > > > reviewed-by's for the users like a typical upstream submission and
> > > > > merge them together? Is all of the functionality of this
> > > > > infrastructure consumed by the proposed users? Last time I checked it
> > > > > was only a subset.
> > > >
> > > > Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
> > > > vs UNMAP. Both of which i intend to use. The RDMA folks already ack
> > > > the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
> > > > were ok with it too. I do not want to merge things through Andrew
> > > > for all of this we discussed that in the past, merge mm bits through
> > > > Andrew in one release and bits that use things in the next release.
> > >
> > > Ok, I was trying to find the links to the acks on the mailing list,
> > > those references would address my concerns. I see no reason to rush
> > > SOFT_DIRTY and CLEAR ahead of the upstream user.
> >
> > I intend to post user for those in next couple weeks for 5.2 HMM bits.
> > So user for this (CLEAR/UNMAP/SOFTDIRTY) will definitly materialize in
> > time for 5.2.
> >
> > ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
> > ACKS RDMA https://lkml.org/lkml/2018/12/6/1473
> 
> Nice, thanks!
> 
> > For KVM Andrea Arcangeli seems to like the whole idea to restore the
> > change_pte optimization but i have not got ACK from Radim or Paolo,
> > however given the small performance improvement figure i get with it
> > i do not see while they would not ACK.
> 
> Sure, but no need to push ahead without that confirmation, right? At
> least for the piece that KVM cares about, maybe that's already covered
> in the infrastructure RDMA and RADEON are using?

The change_pte() for KVM is just one bit flag on top of the rest. So
i don't see much value in saving this last patch. I will be working
with KVM folks to merge KVM bits in 5.2. If they do not want that then
removing that extra flags is not much work.

But if you prefer than Andrew can drop the last patch in the serie.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 0/9] mmu notifier provide context informations

2019-02-19 Thread Jerome Glisse
On Tue, Feb 19, 2019 at 12:40:37PM -0800, Dan Williams wrote:
> On Tue, Feb 19, 2019 at 12:30 PM Jerome Glisse  wrote:
> >
> > On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> > > On Tue, Feb 19, 2019 at 12:04 PM  wrote:
> > > >
> > > > From: Jérôme Glisse 
> > > >
> > > > Since last version [4] i added the extra bits needed for the change_pte
> > > > optimization (which is a KSM thing). Here i am not posting users of
> > > > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > > > RDMA, ...) once this serie get upstream. If you want to look at users
> > > > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > > > those users for 5.2 (including KVM if KVM folks feel comfortable with
> > > > it).
> > >
> > > The users look small and straightforward. Why not await acks and
> > > reviewed-by's for the users like a typical upstream submission and
> > > merge them together? Is all of the functionality of this
> > > infrastructure consumed by the proposed users? Last time I checked it
> > > was only a subset.
> >
> > Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
> > vs UNMAP. Both of which i intend to use. The RDMA folks already ack
> > the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
> > were ok with it too. I do not want to merge things through Andrew
> > for all of this we discussed that in the past, merge mm bits through
> > Andrew in one release and bits that use things in the next release.
> 
> Ok, I was trying to find the links to the acks on the mailing list,
> those references would address my concerns. I see no reason to rush
> SOFT_DIRTY and CLEAR ahead of the upstream user.

I intend to post user for those in next couple weeks for 5.2 HMM bits.
So user for this (CLEAR/UNMAP/SOFTDIRTY) will definitly materialize in
time for 5.2.

ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
ACKS RDMA https://lkml.org/lkml/2018/12/6/1473

For KVM Andrea Arcangeli seems to like the whole idea to restore the
change_pte optimization but i have not got ACK from Radim or Paolo,
however given the small performance improvement figure i get with it
i do not see while they would not ACK.

https://lkml.org/lkml/2019/2/18/1530

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 0/9] mmu notifier provide context informations

2019-02-19 Thread Jerome Glisse
On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> On Tue, Feb 19, 2019 at 12:04 PM  wrote:
> >
> > From: Jérôme Glisse 
> >
> > Since last version [4] i added the extra bits needed for the change_pte
> > optimization (which is a KSM thing). Here i am not posting users of
> > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > RDMA, ...) once this serie get upstream. If you want to look at users
> > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > those users for 5.2 (including KVM if KVM folks feel comfortable with
> > it).
> 
> The users look small and straightforward. Why not await acks and
> reviewed-by's for the users like a typical upstream submission and
> merge them together? Is all of the functionality of this
> infrastructure consumed by the proposed users? Last time I checked it
> was only a subset.

Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
vs UNMAP. Both of which i intend to use. The RDMA folks already ack
the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
were ok with it too. I do not want to merge things through Andrew
for all of this we discussed that in the past, merge mm bits through
Andrew in one release and bits that use things in the next release.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v4 0/9] mmu notifier provide context informations

2019-02-11 Thread Jerome Glisse
On Fri, Feb 01, 2019 at 10:02:30PM +0100, Jan Kara wrote:
> On Thu 31-01-19 11:10:06, Jerome Glisse wrote:
> > 
> > Andrew what is your plan for this ? I had a discussion with Peter Xu
> > and Andrea about change_pte() and kvm. Today the change_pte() kvm
> > optimization is effectively disabled because of invalidate_range
> > calls. With a minimal couple lines patch on top of this patchset
> > we can bring back the kvm change_pte optimization and we can also
> > optimize some other cases like for instance when write protecting
> > after fork (but i am not sure this is something qemu does often so
> > it might not help for real kvm workload).
> > 
> > I will be posting a the extra patch as an RFC, but in the meantime
> > i wanted to know what was the status for this.
> > 
> > Jan, Christian does your previous ACK still holds for this ?
> 
> Yes, I still think the approach makes sense. Dan's concern about in tree
> users is valid but it seems you have those just not merged yet, right?

(Catching up on email)

This version included some of the first users for this but i do not
want to merge them through Andrew but through the individual driver
project tree. Also in the meantime i found a use for this with kvm
and i expect few others users of mmu notifier will leverage this
extra informations.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling

2019-02-11 Thread Jerome Glisse
On Sun, Feb 10, 2019 at 12:09:08PM +0100, Krzysztof Grygiencz wrote:
> Dear Sir,
> 
> I'm using ArchLinux distribution. After kernel upgrade form 4.19.14 to
> 4.19.15 my X environment stopped working. I have AMD HD3300 (RS780D)
> graphics card. I have bisected kernel and found a failing commit:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.19.20&id=ec5471c92fb29ad848c81875840478be201eeb3f

This is a false positive, you should skip that commit. It will not impact
the GPU driver for your specific GPUs. My advice is to first bisect on
drivers/gpu/drm/radeon only.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


  1   2   3   4   5   6   7   8   9   10   >