Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-16 Thread Jerome Glisse
On Fri, Aug 16, 2019 at 11:31:45AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2019 at 02:26:25PM +0200, Michal Hocko wrote:
> > On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote:
> > > On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> > > > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > > > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:

[...]

> > > I would like to inject it into the notifier path as this is very
> > > difficult for driver authors to discover and know about, but I'm
> > > worried about your false positive remark.
> > > 
> > > I think I understand we can use only GFP_ATOMIC in the notifiers, but
> > > we need a strategy to handle OOM to guarentee forward progress.
> > 
> > Your example is from the notifier registration IIUC. 
> 
> Yes, that is where this commit hit it.. Triggering this under an
> actual notifier to get a lockdep report is hard.
> 
> > Can you pre-allocate before taking locks? Could you point me to some
> > examples when the allocation is necessary in the range notifier
> > callback?
> 
> Hmm. I took a careful look, I only found mlx5 as obviously allocating
> memory:
> 
>  mlx5_ib_invalidate_range()
>   mlx5_ib_update_xlt()
>__get_free_pages(gfp, get_order(size));
> 
> However, I think this could be changed to fall back to some small
> buffer if allocation fails. The existing scheme looks sketchy
> 
> nouveau does:
> 
>  nouveau_svmm_invalidate
>   nvif_object_mthd
>kmalloc(GFP_KERNEL)
> 
> But I think it reliably uses a stack buffer here
> 
> i915 I think Daniel said he audited.
> 
> amd_mn.. The actual invalidate_range_start does not allocate memory,
> but it is entangled with so many locks it would need careful analysis
> to be sure.
> 
> The others look generally OK, which is good, better than I hoped :)

It is on my TODO list to get rid of allocation in notifier callback
(iirc nouveau already use the stack unless it was lost in all the
revision it wants through). Anyway i do not think we need allocation
in notifier.

Cheers,
Jérôme


Re: [PATCH] mm/hmm: hmm_range_fault handle pages swapped out

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 08:52:56PM +, Yang, Philip wrote:
> hmm_range_fault may return NULL pages because some of pfns are equal to
> HMM_PFN_NONE. This happens randomly under memory pressure. The reason is
> for swapped out page pte path, hmm_vma_handle_pte doesn't update fault
> variable from cpu_flags, so it failed to call hmm_vam_do_fault to swap
> the page in.
> 
> The fix is to call hmm_pte_need_fault to update fault variable.
> 
> Change-Id: I2e8611485563d11d938881c18b7935fa1e7c91ee
> Signed-off-by: Philip Yang 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/hmm.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 9f22562e2c43..7ca4fb39d3d8 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -544,6 +544,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
> unsigned long addr,
>   swp_entry_t entry = pte_to_swp_entry(pte);
>  
>   if (!non_swap_entry(entry)) {
> + cpu_flags = pte_to_hmm_pfn_flags(range, pte);
> + hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
> +, _fault);
>   if (fault || write_fault)
>   goto fault;
>   return 0;
> -- 
> 2.17.1
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 08:41:33PM +, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 04:33:06PM -0400, Jerome Glisse wrote:
> 
> > So nor HMM nor driver should dereference the struct page (i do not
> > think any iommu driver would either),
> 
> Er, they do technically deref the struct page:
> 
> nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
>struct hmm_range *range)
>   struct page *page;
>   page = hmm_pfn_to_page(range, range->pfns[i]);
>   if (!nouveau_dmem_page(drm, page)) {
> 
> 
> nouveau_dmem_page(struct nouveau_drm *drm, struct page *page)
> {
>   return is_device_private_page(page) && drm->dmem == page_to_dmem(page)
> 
> 
> Which does touch 'page->pgmap'
> 
> Is this OK without having a get_dev_pagemap() ?
>
> Noting that the collision-retry scheme doesn't protect anything here
> as we can have a concurrent invalidation while doing the above deref.

Uh ? How so ? We are not reading the same code i think.

My read is that function is call when holding the device
lock which exclude any racing mmu notifier from making
forward progress and it is also protected by the range so
at the time this happens it is safe to dereference the
struct page. In this case any way we can update the
nouveau_dmem_page() to check that page page->pgmap == the
expected pgmap.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 01:12:22PM -0700, Dan Williams wrote:
> On Thu, Aug 15, 2019 at 12:44 PM Jerome Glisse  wrote:
> >
> > On Thu, Aug 15, 2019 at 12:36:58PM -0700, Dan Williams wrote:
> > > On Thu, Aug 15, 2019 at 11:07 AM Jerome Glisse  wrote:
> > > >
> > > > On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> > > > > On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > > > > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > > > > > Section alignment constraints somewhat save us here. The only 
> > > > > > > > example
> > > > > > > > I can think of a PMD not containing a uniform pgmap association 
> > > > > > > > for
> > > > > > > > each pte is the case when the pgmap overlaps normal dram, i.e. 
> > > > > > > > shares
> > > > > > > > the same 'struct memory_section' for a given span. Otherwise, 
> > > > > > > > distinct
> > > > > > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > > > > > subsections as of v5.3). Otherwise the implementation could not
> > > > > > > > guarantee different mapping lifetimes.
> > > > > > > >
> > > > > > > > That said, this seems to want a better mechanism to determine 
> > > > > > > > "pfn is
> > > > > > > > ZONE_DEVICE".
> > > > > > >
> > > > > > > So I guess this patch is fine for now, and once you provide a 
> > > > > > > better
> > > > > > > mechanism we can switch over to it?
> > > > > >
> > > > > > What about the version I sent to just get rid of all the strange
> > > > > > put_dev_pagemaps while scanning? Odds are good we will work with 
> > > > > > only
> > > > > > a single pagemap, so it makes some sense to cache it once we find 
> > > > > > it?
> > > > >
> > > > > Yes, if the scan is over a single pmd then caching it makes sense.
> > > >
> > > > Quite frankly an easier an better solution is to remove the pagemap
> > > > lookup as HMM user abide by mmu notifier it means we will not make
> > > > use or dereference the struct page so that we are safe from any
> > > > racing hotunplug of dax memory (as long as device driver using hmm
> > > > do not have a bug).
> > >
> > > Yes, as long as the driver remove is synchronized against HMM
> > > operations via another mechanism then there is no need to take pagemap
> > > references. Can you briefly describe what that other mechanism is?
> >
> > So if you hotunplug some dax memory i assume that this can only
> > happens once all the pages are unmapped (as it must have the
> > zero refcount, well 1 because of the bias) and any unmap will
> > trigger a mmu notifier callback. User of hmm mirror abiding by
> > the API will never make use of information they get through the
> > fault or snapshot function until checking for racing notifier
> > under lock.
> 
> Hmm that first assumption is not guaranteed by the dev_pagemap core.
> The dev_pagemap end of life model is "disable, invalidate, drain" so
> it's possible to call devm_munmap_pages() while pages are still mapped
> it just won't complete the teardown of the pagemap until the last
> reference is dropped. New references are blocked during this teardown.
> 
> However, if the driver is validating the liveness of the mapping in
> the mmu-notifier path and blocking new references it sounds like it
> should be ok. Might there be GPU driver unit tests that cover this
> racing teardown case?

So nor HMM nor driver should dereference the struct page (i do not
think any iommu driver would either), they only care about the pfn.
So even if we race with a teardown as soon as we get the mmu notifier
callback to invalidate the mmapping we will do so. The pattern is:

mydriver_populate_vaddr_range(start, end) {
hmm_range_register(range, start, end)
again:
ret = hmm_range_fault(start, end)
if (ret < 0)
return ret;

take_driver_page_table_lock();
if (range.valid) {
populate_device_page_table();
release_driver_page_table

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 12:36:58PM -0700, Dan Williams wrote:
> On Thu, Aug 15, 2019 at 11:07 AM Jerome Glisse  wrote:
> >
> > On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> > > On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  wrote:
> > > >
> > > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > > > Section alignment constraints somewhat save us here. The only 
> > > > > > example
> > > > > > I can think of a PMD not containing a uniform pgmap association for
> > > > > > each pte is the case when the pgmap overlaps normal dram, i.e. 
> > > > > > shares
> > > > > > the same 'struct memory_section' for a given span. Otherwise, 
> > > > > > distinct
> > > > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > > > subsections as of v5.3). Otherwise the implementation could not
> > > > > > guarantee different mapping lifetimes.
> > > > > >
> > > > > > That said, this seems to want a better mechanism to determine "pfn 
> > > > > > is
> > > > > > ZONE_DEVICE".
> > > > >
> > > > > So I guess this patch is fine for now, and once you provide a better
> > > > > mechanism we can switch over to it?
> > > >
> > > > What about the version I sent to just get rid of all the strange
> > > > put_dev_pagemaps while scanning? Odds are good we will work with only
> > > > a single pagemap, so it makes some sense to cache it once we find it?
> > >
> > > Yes, if the scan is over a single pmd then caching it makes sense.
> >
> > Quite frankly an easier an better solution is to remove the pagemap
> > lookup as HMM user abide by mmu notifier it means we will not make
> > use or dereference the struct page so that we are safe from any
> > racing hotunplug of dax memory (as long as device driver using hmm
> > do not have a bug).
> 
> Yes, as long as the driver remove is synchronized against HMM
> operations via another mechanism then there is no need to take pagemap
> references. Can you briefly describe what that other mechanism is?

So if you hotunplug some dax memory i assume that this can only
happens once all the pages are unmapped (as it must have the
zero refcount, well 1 because of the bias) and any unmap will
trigger a mmu notifier callback. User of hmm mirror abiding by
the API will never make use of information they get through the
fault or snapshot function until checking for racing notifier
under lock.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 03:01:59PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote:
> > On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> > > 
> > > > I'm not really well versed in the details of our userptr, but both
> > > > amdgpu and i915 wait for the gpu to complete from
> > > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > > > one, so maybe he can explain what exactly it is we're doing ...
> > > 
> > > amdgpu is (wrongly) using hmm for something, I can't really tell what
> > > it is trying to do. The calls to dma_fence under the
> > > invalidate_range_start do not give me a good feeling.
> > > 
> > > However, i915 shows all the signs of trying to follow the registration
> > > cache model, it even has a nice comment in
> > > i915_gem_userptr_get_pages() explaining that the races it has don't
> > > matter because it is a user space bug to change the VA mapping in the
> > > first place. That just screams registration cache to me.
> > > 
> > > So it is fine to run HW that way, but if you do, there is no reason to
> > > fence inside the invalidate_range end. Just orphan the DMA buffer and
> > > clean it up & release the page pins when all DMA buffer refs go to
> > > zero. The next access to that VA should get a new DMA buffer with the
> > > right mapping.
> > > 
> > > In other words the invalidation should be very simple without
> > > complicated locking, or wait_event's. Look at hfi1 for example.
> > 
> > This would break the today usage model of uptr and it will
> > break userspace expectation ie if GPU is writting to that
> > memory and that memory then the userspace want to make sure
> > that it will see what the GPU write.
> 
> How exactly? This is holding the page pin, so the only way the VA
> mapping can be changed is via explicit user action.
> 
> ie:
> 
>gpu_write_something(va, size)
>mmap(.., va, size, MMAP_FIXED);
>gpu_wait_done()
> 
> This is racy and indeterminate with both models.
> 
> Based on the comment in i915 it appears to be going on the model that
> changes to the mmap by userspace when the GPU is working on it is a
> programming bug. This is reasonable, lots of systems use this kind of
> consistency model.

Well userspace process doing munmap(), mremap(), fork() and things like
that are a bug from the i915 kernel and userspace contract point of view.

But things like migration or reclaim are not cover under that contract
and for those the expectation is that CPU access to the same virtual address
should allow to get what was last written to it either by the GPU or the
CPU.

> 
> Since the driver seems to rely on a dma_fence to block DMA access, it
> looks to me like the kernel has full visibility to the
> 'gpu_write_something()' and 'gpu_wait_done()' markers.

So let's only consider the case where GPU wants to write to the memory
(the read only case is obviously right and not need any notifier in
fact) and like above the only thing we care about is reclaim or migration
(for instance because of some numa compaction) as the rest is cover by
i915 userspace contract.

So in the write case we do GUPfast(write=true) which will be "safe" from
any concurrent CPU page table update ie if GUPfast get a reference for
the page then any racing CPU page table update will not be able to migrate
or reclaim the page and thus the virtual address to page association will
be preserve.

This is only true because of GUPfast(), now if GUPfast() fails it will
fallback to the slow GUP case which make the same thing safe by taking
the page table lock.

Because of the reference on the page the i915 driver can forego the mmu
notifier end callback. The thing here is that taking a page reference
is pointless if we have better synchronization and tracking of mmu
notifier. Hence converting to hmm mirror allows to avoid taking a ref
on the page while still keeping the same functionality as of today.


> I think trying to use hmm_range_fault on HW that can't do HW page
> faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that
> is what amd gpu is trying to do.
> 
> I haven't yet seen anything that looks like 'TLB shootdown' in i915??

GPU driver have complex usage pattern the tlb shootdown is implicit
once the GEM object associated with the uptr is invalidated it means
next time userspace submit command against that GEM object it will
have to re-validate it which means re-program the GPU page table to
point to the proper address (and re-call GUP).

So the whole GPU page tabl

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jerome Glisse
On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  wrote:
> >
> > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > Section alignment constraints somewhat save us here. The only example
> > > > I can think of a PMD not containing a uniform pgmap association for
> > > > each pte is the case when the pgmap overlaps normal dram, i.e. shares
> > > > the same 'struct memory_section' for a given span. Otherwise, distinct
> > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > subsections as of v5.3). Otherwise the implementation could not
> > > > guarantee different mapping lifetimes.
> > > >
> > > > That said, this seems to want a better mechanism to determine "pfn is
> > > > ZONE_DEVICE".
> > >
> > > So I guess this patch is fine for now, and once you provide a better
> > > mechanism we can switch over to it?
> >
> > What about the version I sent to just get rid of all the strange
> > put_dev_pagemaps while scanning? Odds are good we will work with only
> > a single pagemap, so it makes some sense to cache it once we find it?
> 
> Yes, if the scan is over a single pmd then caching it makes sense.

Quite frankly an easier an better solution is to remove the pagemap
lookup as HMM user abide by mmu notifier it means we will not make
use or dereference the struct page so that we are safe from any
racing hotunplug of dax memory (as long as device driver using hmm
do not have a bug).

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > 
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > > 
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> > 
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> 
> I still do not get why do you put FS/IO into the picture. This is really
> about __GFP_DIRECT_RECLAIM.
> 
> > 
> > > If you can express that in the existing lockdep machinery. All
> > > fine. But then consider deployments where lockdep is no-no because
> > > of the overhead.
> > 
> > This is all for driver debugging. The point of lockdep is to find all
> > these paths without have to hit them as actual races, using debug
> > kernels.
> > 
> > I don't think we need this kind of debugging on production kernels?
> 
> Again, the primary motivation was a simple debugging aid that could be
> used without worrying about overhead. So lockdep is very often out of
> the question.
> 
> > > > The best we got was drivers tested the VA range and returned success
> > > > if they had no interest. Which is a big win to be sure, but it looks
> > > > like getting any more is not really posssible.
> > > 
> > > And that is already a great win! Because many notifiers only do care
> > > about particular mappings. Please note that backing off unconditioanlly
> > > will simply cause that the oom reaper will have to back off not doing
> > > any tear down anything.
> > 
> > Well, I'm working to propose that we do the VA range test under core
> > mmu notifier code that cannot block and then we simply remove the idea
> > of blockable from drivers using this new 'range notifier'. 
> > 
> > I think this pretty much solves the concern?
> 
> Well, my idea was that a range check and early bail out was a first step
> and then each specific notifier would be able to do a more specific
> check. I was not able to do the second step because that requires a deep
> understanding of the respective subsystem.
> 
> Really all I do care about is to reclaim as much memory from the
> oom_reaper context as possible. And that cannot really be an unbounded
> process. Quite contrary it should be as swift as possible. From my
> cursory look some notifiers are able to achieve their task without
> blocking or depending on memory just fine. So bailing out
> unconditionally on the range of interest would just put us back.

Agree, OOM just asking the question: can i unmap that page quickly ?
so that me (OOM) can swap it out. In many cases the driver need to
lookup something to see if at the time the memory is just not in use
and can be reclaim right away. So driver should have a path to
optimistically update its state to allow quick reclaim.


> > > > However, we could (probably even should) make the drivers fs_reclaim
> > > > safe.
> > > > 
> > > > If that is enough to guarantee progress of OOM, then lets consider
> > > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > > > allocation behavior on the driver callback and lockdep to try and keep
> > > > pushing on the the debugging, and dropping !blocking.
> > > 
> > > How are you going to enforce indirect dependency? E.g. a lock that is
> > > also used in other context which depend on sleepable memory allocation
> > > to move forward.
> > 
> > You mean like this:
> > 
> >CPU0 CPU1
> > mutex_lock()
> > kmalloc(GFP_KERNEL)
> 
> no I mean __GFP_DIRECT_RECLAIM here.
> 
> > mutex_unlock()
> >   fs_reclaim_acquire()
> >   mutex_lock() <- illegal: lock dep assertion
> 
> I cannot really comment on how that is achieveable by lockdep. I managed
> to forget details about FS/IO reclaim recursion tracking already and I
> do not have time to learn it again. It was quite a hack. Anyway, let me
> repeat that the primary motivation was a simple aid. Not something as
> poverful as lockdep.

I feel that the fs_reclaim_acquire() is just too heavy weight here. I
do think that Daniel patches improve the debugging situation without
burdening anything so i am in favor or merging that.

I do not think we should devote too much time into fs_reclaim(), our
time would be better spend in improving the 

Re: [Nouveau] [PATCH] nouveau/hmm: map pages after migration

2019-08-15 Thread Jerome Glisse
On Tue, Aug 13, 2019 at 05:58:52PM -0400, Jerome Glisse wrote:
> On Wed, Aug 07, 2019 at 08:02:14AM -0700, Ralph Campbell wrote:
> > When memory is migrated to the GPU it is likely to be accessed by GPU
> > code soon afterwards. Instead of waiting for a GPU fault, map the
> > migrated memory into the GPU page tables with the same access permissions
> > as the source CPU page table entries. This preserves copy on write
> > semantics.
> > 
> > Signed-off-by: Ralph Campbell 
> > Cc: Christoph Hellwig 
> > Cc: Jason Gunthorpe 
> > Cc: "Jérôme Glisse" 
> > Cc: Ben Skeggs 
> 
> Sorry for delay i am swamp, couple issues:
> - nouveau_pfns_map() is never call, it should be call after
>   the dma copy is done (iirc it is lacking proper fencing
>   so that would need to be implemented first)
> 
> - the migrate ioctl is disconnected from the svm part and
>   thus we would need first to implement svm reference counting
>   and take a reference at begining of migration process and
>   release it at the end ie struct nouveau_svmm needs refcounting
>   of some sort. I let Ben decides what he likes best for that.

Thinking more about that the svm lifetime is bound to the file
descriptor on the device driver file held by the process. So
when you call migrate ioctl the svm should not go away because
you are in an ioctl against the file descriptor. I need to double
check all that in respect of process that open the device file
multiple time with different file descriptor (or fork thing and
all).


> - i rather not have an extra pfns array, i am pretty sure we
>   can directly feed what we get from the dma array to the svm
>   code to update the GPU page table
> 
> Observation that can be delayed to latter patches:
> - i do not think we want to map directly if the dma engine
>   is queue up and thus if the fence will take time to signal
> 
>   This is why i did not implement this in the first place.
>   Maybe using a workqueue to schedule a pre-fill of the GPU
>   page table and wakeup the workqueue with the fence notify
>   event.
> 
> 
> > ---
> > 
> > This patch is based on top of Christoph Hellwig's 9 patch series
> > https://lore.kernel.org/linux-mm/20190729234611.gc7...@redhat.com/T/#u
> > "turn the hmm migrate_vma upside down" but without patch 9
> > "mm: remove the unused MIGRATE_PFN_WRITE" and adds a use for the flag.
> > 
> > 
> >  drivers/gpu/drm/nouveau/nouveau_dmem.c | 45 +-
> >  drivers/gpu/drm/nouveau/nouveau_svm.c  | 86 ++
> >  drivers/gpu/drm/nouveau/nouveau_svm.h  | 19 ++
> >  3 files changed, 133 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
> > b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > index ef9de82b0744..c83e6f118817 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> > @@ -25,11 +25,13 @@
> >  #include "nouveau_dma.h"
> >  #include "nouveau_mem.h"
> >  #include "nouveau_bo.h"
> > +#include "nouveau_svm.h"
> >  
> >  #include 
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -560,11 +562,12 @@ nouveau_dmem_init(struct nouveau_drm *drm)
> >  }
> >  
> >  static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
> > -   struct vm_area_struct *vma, unsigned long addr,
> > -   unsigned long src, dma_addr_t *dma_addr)
> > +   struct vm_area_struct *vma, unsigned long src,
> > +   dma_addr_t *dma_addr, u64 *pfn)
> >  {
> > struct device *dev = drm->dev->dev;
> > struct page *dpage, *spage;
> > +   unsigned long paddr;
> >  
> > spage = migrate_pfn_to_page(src);
> > if (!spage || !(src & MIGRATE_PFN_MIGRATE))
> > @@ -572,17 +575,21 @@ static unsigned long 
> > nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
> >  
> > dpage = nouveau_dmem_page_alloc_locked(drm);
> > if (!dpage)
> > -   return 0;
> > +   goto out;
> >  
> > *dma_addr = dma_map_page(dev, spage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
> > if (dma_mapping_error(dev, *dma_addr))
> > goto out_free_page;
> >  
> > +   paddr = nouveau_dmem_page_addr(dpage);
> > if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_VRAM,
> > -   nouveau_dmem_page_addr(dpage)

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> 
> > I'm not really well versed in the details of our userptr, but both
> > amdgpu and i915 wait for the gpu to complete from
> > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > one, so maybe he can explain what exactly it is we're doing ...
> 
> amdgpu is (wrongly) using hmm for something, I can't really tell what
> it is trying to do. The calls to dma_fence under the
> invalidate_range_start do not give me a good feeling.
> 
> However, i915 shows all the signs of trying to follow the registration
> cache model, it even has a nice comment in
> i915_gem_userptr_get_pages() explaining that the races it has don't
> matter because it is a user space bug to change the VA mapping in the
> first place. That just screams registration cache to me.
> 
> So it is fine to run HW that way, but if you do, there is no reason to
> fence inside the invalidate_range end. Just orphan the DMA buffer and
> clean it up & release the page pins when all DMA buffer refs go to
> zero. The next access to that VA should get a new DMA buffer with the
> right mapping.
> 
> In other words the invalidation should be very simple without
> complicated locking, or wait_event's. Look at hfi1 for example.

This would break the today usage model of uptr and it will
break userspace expectation ie if GPU is writting to that
memory and that memory then the userspace want to make sure
that it will see what the GPU write.

Yes i915 is broken in respect to not having a end notifier
and tracking active invalidation for a range but the GUP
side of thing kind of hide this bug and it shrinks the window
for bad to happen to something so small that i doubt anyone
could ever hit it (still a bug thought).

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 07:21:47PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe  wrote:
> >
> > On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> > > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > > >
> > > > > You have to wait for the gpu to finnish current processing in
> > > > > invalidate_range_start. Otherwise there's no point to any of this
> > > > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > > >
> > > > I don't envy your task :|
> > > >
> > > > But, what you describe sure sounds like a 'registration cache' model,
> > > > not the 'shadow pte' model of coherency.
> > > >
> > > > The key difference is that a regirstationcache is allowed to become
> > > > incoherent with the VMA's because it holds page pins. It is a
> > > > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > > > while the device is working on that VA, but it does not harm system
> > > > integrity because of the page pin.
> > > >
> > > > The cache ensures that each initiated operation sees a DMA setup that
> > > > matches the current VA map when the operation is initiated and allows
> > > > expensive device DMA setups to be re-used.
> > > >
> > > > A 'shadow pte' model (ie hmm) *really* needs device support to
> > > > directly block DMA access - ie trigger 'device page fault'. ie the
> > > > invalidate_start should inform the device to enter a fault mode and
> > > > that is it.  If the device can't do that, then the driver probably
> > > > shouldn't persue this level of coherency. The driver would quickly get
> > > > into the messy locking problems like dma_fence_wait from a notifier.
> > >
> > > I think here we do not agree on the hardware requirement. For GPU
> > > we will always need to be able to wait for some GPU fence from inside
> > > the notifier callback, there is just no way around that for many of
> > > the GPUs today (i do not see any indication of that changing).
> >
> > I didn't say you couldn't wait, I was trying to say that the wait
> > should only be contigent on the HW itself. Ie you can wait on a GPU
> > page table lock, and you can wait on a GPU page table flush completion
> > via IRQ.
> >
> > What is troubling is to wait till some other thread gets a GPU command
> > completion and decr's a kref on the DMA buffer - which kinda looks
> > like what this dma_fence() stuff is all about. A driver like that
> > would have to be super careful to ensure consistent forward progress
> > toward dma ref == 0 when the system is under reclaim.
> >
> > ie by running it's entire IRQ flow under fs_reclaim locking.
> 
> This is correct. At least for i915 it's already a required due to our
> shrinker also having to do the same. I think amdgpu isn't bothering
> with that since they have vram for most of the stuff, and just limit
> system memory usage to half of all and forgo the shrinker. Probably
> not the nicest approach. Anyway, both do the same mmu_notifier dance,
> just want to explain that we've been living with this for longer
> already.
> 
> So yeah writing a gpu driver is not easy.
> 
> > > associated with the mm_struct. In all GPU driver so far it is a short
> > > lived lock and nothing blocking is done while holding it (it is just
> > > about updating page table directory really wether it is filling it or
> > > clearing it).
> >
> > The main blocking I expect in a shadow PTE flow is waiting for the HW
> > to complete invalidations of its PTE cache.
> >
> > > > It is important to identify what model you are going for as defining a
> > > > 'registration cache' coherence expectation allows the driver to skip
> > > > blocking in invalidate_range_start. All it does is invalidate the
> > > > cache so that future operations pick up the new VA mapping.
> > > >
> > > > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > > > well proven, within some limitations of course.
> > > >
> > > > At least, 'registration cache' is the only use model I know of where
> > > > it is acceptable to skip invalidate_range_end.
> > >
> > > Here GPU are not in the registration cache model, i know it might looks
> > > like it becaus

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> 
> > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > __GFP_DIRECT_RECLAIM..
> > >
> > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > allocations during OOM, then I think fs_reclaim already matches what
> > > you described?
> > 
> > No GFP_NOFS is equally bad. Please read my other email explaining what
> > the oom_reaper actually requires. In short no blocking on direct or
> > indirect dependecy on memory allocation that might sleep.
> 
> It is much easier to follow with some hints on code, so the true
> requirement is that the OOM repear not block on GFP_FS and GFP_IO
> allocations, great, that constraint is now clear.
> 
> > If you can express that in the existing lockdep machinery. All
> > fine. But then consider deployments where lockdep is no-no because
> > of the overhead.
> 
> This is all for driver debugging. The point of lockdep is to find all
> these paths without have to hit them as actual races, using debug
> kernels.
> 
> I don't think we need this kind of debugging on production kernels?
> 
> > > The best we got was drivers tested the VA range and returned success
> > > if they had no interest. Which is a big win to be sure, but it looks
> > > like getting any more is not really posssible.
> > 
> > And that is already a great win! Because many notifiers only do care
> > about particular mappings. Please note that backing off unconditioanlly
> > will simply cause that the oom reaper will have to back off not doing
> > any tear down anything.
> 
> Well, I'm working to propose that we do the VA range test under core
> mmu notifier code that cannot block and then we simply remove the idea
> of blockable from drivers using this new 'range notifier'. 
> 
> I think this pretty much solves the concern?

I am not sure i follow what you propose here ? Like i pointed out in
another email for GPU we do need to be able to sleep (we might get
lucky and not need too but this is runtime thing) within notifier
range_start callback. This has been something allow by notifier since
it has been introduced in the kernel.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jerome Glisse
On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> 
> > You have to wait for the gpu to finnish current processing in
> > invalidate_range_start. Otherwise there's no point to any of this
> > really. So the wait_event/dma_fence_wait are unavoidable really.
> 
> I don't envy your task :|
> 
> But, what you describe sure sounds like a 'registration cache' model,
> not the 'shadow pte' model of coherency.
> 
> The key difference is that a regirstationcache is allowed to become
> incoherent with the VMA's because it holds page pins. It is a
> programming bug in userspace to change VA mappings via mmap/munmap/etc
> while the device is working on that VA, but it does not harm system
> integrity because of the page pin.
> 
> The cache ensures that each initiated operation sees a DMA setup that
> matches the current VA map when the operation is initiated and allows
> expensive device DMA setups to be re-used.
> 
> A 'shadow pte' model (ie hmm) *really* needs device support to
> directly block DMA access - ie trigger 'device page fault'. ie the
> invalidate_start should inform the device to enter a fault mode and
> that is it.  If the device can't do that, then the driver probably
> shouldn't persue this level of coherency. The driver would quickly get
> into the messy locking problems like dma_fence_wait from a notifier.

I think here we do not agree on the hardware requirement. For GPU
we will always need to be able to wait for some GPU fence from inside
the notifier callback, there is just no way around that for many of
the GPUs today (i do not see any indication of that changing).

Driver should avoid lock complexity by using wait queue so that the
driver notifier callback can wait without having to hold some driver
lock. However there will be at least one lock needed to update the
internal driver state for the range being invalidated. That lock is
just the device driver page table lock for the GPU page table
associated with the mm_struct. In all GPU driver so far it is a short
lived lock and nothing blocking is done while holding it (it is just
about updating page table directory really wether it is filling it or
clearing it).


> 
> It is important to identify what model you are going for as defining a
> 'registration cache' coherence expectation allows the driver to skip
> blocking in invalidate_range_start. All it does is invalidate the
> cache so that future operations pick up the new VA mapping.
> 
> Intel's HFI RDMA driver uses this model extensively, and I think it is
> well proven, within some limitations of course.
> 
> At least, 'registration cache' is the only use model I know of where
> it is acceptable to skip invalidate_range_end.

Here GPU are not in the registration cache model, i know it might looks
like it because of GUP but GUP was use just because hmm did not exist
at the time.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] nouveau/hmm: map pages after migration

2019-08-13 Thread Jerome Glisse
On Wed, Aug 07, 2019 at 08:02:14AM -0700, Ralph Campbell wrote:
> When memory is migrated to the GPU it is likely to be accessed by GPU
> code soon afterwards. Instead of waiting for a GPU fault, map the
> migrated memory into the GPU page tables with the same access permissions
> as the source CPU page table entries. This preserves copy on write
> semantics.
> 
> Signed-off-by: Ralph Campbell 
> Cc: Christoph Hellwig 
> Cc: Jason Gunthorpe 
> Cc: "Jérôme Glisse" 
> Cc: Ben Skeggs 

Sorry for delay i am swamp, couple issues:
- nouveau_pfns_map() is never call, it should be call after
  the dma copy is done (iirc it is lacking proper fencing
  so that would need to be implemented first)

- the migrate ioctl is disconnected from the svm part and
  thus we would need first to implement svm reference counting
  and take a reference at begining of migration process and
  release it at the end ie struct nouveau_svmm needs refcounting
  of some sort. I let Ben decides what he likes best for that.

- i rather not have an extra pfns array, i am pretty sure we
  can directly feed what we get from the dma array to the svm
  code to update the GPU page table

Observation that can be delayed to latter patches:
- i do not think we want to map directly if the dma engine
  is queue up and thus if the fence will take time to signal

  This is why i did not implement this in the first place.
  Maybe using a workqueue to schedule a pre-fill of the GPU
  page table and wakeup the workqueue with the fence notify
  event.


> ---
> 
> This patch is based on top of Christoph Hellwig's 9 patch series
> https://lore.kernel.org/linux-mm/20190729234611.gc7...@redhat.com/T/#u
> "turn the hmm migrate_vma upside down" but without patch 9
> "mm: remove the unused MIGRATE_PFN_WRITE" and adds a use for the flag.
> 
> 
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 45 +-
>  drivers/gpu/drm/nouveau/nouveau_svm.c  | 86 ++
>  drivers/gpu/drm/nouveau/nouveau_svm.h  | 19 ++
>  3 files changed, 133 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
> b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index ef9de82b0744..c83e6f118817 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -25,11 +25,13 @@
>  #include "nouveau_dma.h"
>  #include "nouveau_mem.h"
>  #include "nouveau_bo.h"
> +#include "nouveau_svm.h"
>  
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -560,11 +562,12 @@ nouveau_dmem_init(struct nouveau_drm *drm)
>  }
>  
>  static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
> - struct vm_area_struct *vma, unsigned long addr,
> - unsigned long src, dma_addr_t *dma_addr)
> + struct vm_area_struct *vma, unsigned long src,
> + dma_addr_t *dma_addr, u64 *pfn)
>  {
>   struct device *dev = drm->dev->dev;
>   struct page *dpage, *spage;
> + unsigned long paddr;
>  
>   spage = migrate_pfn_to_page(src);
>   if (!spage || !(src & MIGRATE_PFN_MIGRATE))
> @@ -572,17 +575,21 @@ static unsigned long 
> nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>  
>   dpage = nouveau_dmem_page_alloc_locked(drm);
>   if (!dpage)
> - return 0;
> + goto out;
>  
>   *dma_addr = dma_map_page(dev, spage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
>   if (dma_mapping_error(dev, *dma_addr))
>   goto out_free_page;
>  
> + paddr = nouveau_dmem_page_addr(dpage);
>   if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_VRAM,
> - nouveau_dmem_page_addr(dpage), NOUVEAU_APER_HOST,
> - *dma_addr))
> + paddr, NOUVEAU_APER_HOST, *dma_addr))
>   goto out_dma_unmap;
>  
> + *pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
> + ((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
> + if (src & MIGRATE_PFN_WRITE)
> + *pfn |= NVIF_VMM_PFNMAP_V0_W;
>   return migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
>  
>  out_dma_unmap:
> @@ -590,18 +597,19 @@ static unsigned long 
> nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
>  out_free_page:
>   nouveau_dmem_page_free_locked(drm, dpage);
>  out:
> + *pfn = NVIF_VMM_PFNMAP_V0_NONE;
>   return 0;
>  }
>  
>  static void nouveau_dmem_migrate_chunk(struct migrate_vma *args,
> - struct nouveau_drm *drm, dma_addr_t *dma_addrs)
> + struct nouveau_drm *drm, dma_addr_t *dma_addrs, u64 *pfns)
>  {
>   struct nouveau_fence *fence;
>   unsigned long addr = args->start, nr_dma = 0, i;
>  
>   for (i = 0; addr < args->end; i++) {
>   args->dst[i] = nouveau_dmem_migrate_copy_one(drm, args->vma,
> - addr, args->src[i], _addrs[nr_dma]);
> + 

Re: [PATCH 9/9] mm: remove the MIGRATE_PFN_WRITE flag

2019-07-30 Thread Jerome Glisse
On Tue, Jul 30, 2019 at 07:46:33AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 29, 2019 at 07:30:44PM -0400, Jerome Glisse wrote:
> > On Mon, Jul 29, 2019 at 05:28:43PM +0300, Christoph Hellwig wrote:
> > > The MIGRATE_PFN_WRITE is only used locally in migrate_vma_collect_pmd,
> > > where it can be replaced with a simple boolean local variable.
> > > 
> > > Signed-off-by: Christoph Hellwig 
> > 
> > NAK that flag is useful, for instance a anonymous vma might have
> > some of its page read only even if the vma has write permission.
> > 
> > It seems that the code in nouveau is wrong (probably lost that
> > in various rebase/rework) as this flag should be use to decide
> > wether to map the device memory with write permission or not.
> > 
> > I am traveling right now, i will investigate what happened to
> > nouveau code.
> 
> We can add it back when needed pretty easily.  Much of this has bitrotted
> way to fast, and the pending ppc kvmhmm code doesn't need it either.

Not using is a serious bug, i will investigate this friday.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 9/9] mm: remove the MIGRATE_PFN_WRITE flag

2019-07-29 Thread Jerome Glisse
On Mon, Jul 29, 2019 at 04:42:01PM -0700, Ralph Campbell wrote:
> 
> On 7/29/19 7:28 AM, Christoph Hellwig wrote:
> > The MIGRATE_PFN_WRITE is only used locally in migrate_vma_collect_pmd,
> > where it can be replaced with a simple boolean local variable.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Reviewed-by: Ralph Campbell 
> 
> > ---
> >   include/linux/migrate.h | 1 -
> >   mm/migrate.c| 9 +
> >   2 files changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 8b46cfdb1a0e..ba74ef5a7702 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -165,7 +165,6 @@ static inline int 
> > migrate_misplaced_transhuge_page(struct mm_struct *mm,
> >   #define MIGRATE_PFN_VALID (1UL << 0)
> >   #define MIGRATE_PFN_MIGRATE   (1UL << 1)
> >   #define MIGRATE_PFN_LOCKED(1UL << 2)
> > -#define MIGRATE_PFN_WRITE  (1UL << 3)
> >   #define MIGRATE_PFN_SHIFT 6
> >   static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 74735256e260..724f92dcc31b 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -2212,6 +2212,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > unsigned long mpfn, pfn;
> > struct page *page;
> > swp_entry_t entry;
> > +   bool writable = false;
> > pte_t pte;
> > pte = *ptep;
> > @@ -2240,7 +2241,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > mpfn = migrate_pfn(page_to_pfn(page)) |
> > MIGRATE_PFN_MIGRATE;
> > if (is_write_device_private_entry(entry))
> > -   mpfn |= MIGRATE_PFN_WRITE;
> > +   writable = true;
> > } else {
> > if (is_zero_pfn(pfn)) {
> > mpfn = MIGRATE_PFN_MIGRATE;
> > @@ -2250,7 +2251,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > }
> > page = vm_normal_page(migrate->vma, addr, pte);
> > mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> > -   mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> > +   if (pte_write(pte))
> > +   writable = true;
> > }
> > /* FIXME support THP */
> > @@ -2284,8 +2286,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> > ptep_get_and_clear(mm, addr, ptep);
> > /* Setup special migration page table entry */
> > -   entry = make_migration_entry(page, mpfn &
> > -MIGRATE_PFN_WRITE);
> > +   entry = make_migration_entry(page, writable);
> > swp_pte = swp_entry_to_pte(entry);
> > if (pte_soft_dirty(pte))
> > swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > 
> 
> MIGRATE_PFN_WRITE may mot being used but that seems like a bug to me.
> If a page is migrated to device memory, it could be mapped at the same
> time to avoid a device page fault but it would need the flag to know
> whether to map it RW or RO. But I suppose that could be inferred from
> the vma->vm_flags.

It is a bug that it is not being use right now. I will have to dig my
git repo to see when that got kill. Will look into it once i get back.

The vma->vm_flags is of no use here. A page can be write protected
inside a writable vma for various reasons.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/9] mm: turn migrate_vma upside down

2019-07-29 Thread Jerome Glisse
On Mon, Jul 29, 2019 at 05:28:35PM +0300, Christoph Hellwig wrote:
> There isn't any good reason to pass callbacks to migrate_vma.  Instead
> we can just export the three steps done by this function to drivers and
> let them sequence the operation without callbacks.  This removes a lot
> of boilerplate code as-is, and will allow the drivers to drastically
> improve code flow and error handling further on.
> 
> Signed-off-by: Christoph Hellwig 


I haven't finished review, especialy the nouveau code, i will look
into this once i get back. In the meantime below are few corrections.

> ---
>  Documentation/vm/hmm.rst   |  55 +-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 122 +++--
>  include/linux/migrate.h| 118 ++--
>  mm/migrate.c   | 242 +++--
>  4 files changed, 193 insertions(+), 344 deletions(-)
> 

[...]

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8992741f10aa..dc4e60a496f2 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2118,16 +2118,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct 
> *mm,
>  #endif /* CONFIG_NUMA */
>  
>  #if defined(CONFIG_MIGRATE_VMA_HELPER)
> -struct migrate_vma {
> - struct vm_area_struct   *vma;
> - unsigned long   *dst;
> - unsigned long   *src;
> - unsigned long   cpages;
> - unsigned long   npages;
> - unsigned long   start;
> - unsigned long   end;
> -};
> -
>  static int migrate_vma_collect_hole(unsigned long start,
>   unsigned long end,
>   struct mm_walk *walk)
> @@ -2578,6 +2568,108 @@ static void migrate_vma_unmap(struct migrate_vma 
> *migrate)
>   }
>  }
>  
> +/**
> + * migrate_vma_setup() - prepare to migrate a range of memory
> + * @args: contains the vma, start, and and pfns arrays for the migration
> + *
> + * Returns: negative errno on failures, 0 when 0 or more pages were migrated
> + * without an error.
> + *
> + * Prepare to migrate a range of memory virtual address range by collecting 
> all
> + * the pages backing each virtual address in the range, saving them inside 
> the
> + * src array.  Then lock those pages and unmap them. Once the pages are 
> locked
> + * and unmapped, check whether each page is pinned or not.  Pages that aren't
> + * pinned have the MIGRATE_PFN_MIGRATE flag set (by this function) in the
> + * corresponding src array entry.  Then restores any pages that are pinned, 
> by
> + * remapping and unlocking those pages.
> + *
> + * The caller should then allocate destination memory and copy source memory 
> to
> + * it for all those entries (ie with MIGRATE_PFN_VALID and 
> MIGRATE_PFN_MIGRATE
> + * flag set).  Once these are allocated and copied, the caller must update 
> each
> + * corresponding entry in the dst array with the pfn value of the destination
> + * page and with the MIGRATE_PFN_VALID and MIGRATE_PFN_LOCKED flags set
> + * (destination pages must have their struct pages locked, via lock_page()).
> + *
> + * Note that the caller does not have to migrate all the pages that are 
> marked
> + * with MIGRATE_PFN_MIGRATE flag in src array unless this is a migration from
> + * device memory to system memory.  If the caller cannot migrate a device 
> page
> + * back to system memory, then it must return VM_FAULT_SIGBUS, which will
> + * might have severe consequences for the userspace process, so it should 
> best

  ^s/might//  ^s/should 
best/must/

> + * be avoided if possible.
 ^s/if possible//

Maybe adding something about failing only on unrecoverable device error. The
only reason we allow failure for migration here is because GPU devices can
go into bad state (GPU lockup) and when that happens the GPU memory might be
corrupted (power to GPU memory might be cut by GPU driver to recover the
GPU).

So failing migration back to main memory is only a last resort event.


> + *
> + * For empty entries inside CPU page table (pte_none() or pmd_none() is 
> true) we
> + * do set MIGRATE_PFN_MIGRATE flag inside the corresponding source array thus
> + * allowing the caller to allocate device memory for those unback virtual
> + * address.  For this the caller simply havs to allocate device memory and
   ^ haves

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 9/9] mm: remove the MIGRATE_PFN_WRITE flag

2019-07-29 Thread Jerome Glisse
On Mon, Jul 29, 2019 at 05:28:43PM +0300, Christoph Hellwig wrote:
> The MIGRATE_PFN_WRITE is only used locally in migrate_vma_collect_pmd,
> where it can be replaced with a simple boolean local variable.
> 
> Signed-off-by: Christoph Hellwig 

NAK that flag is useful, for instance a anonymous vma might have
some of its page read only even if the vma has write permission.

It seems that the code in nouveau is wrong (probably lost that
in various rebase/rework) as this flag should be use to decide
wether to map the device memory with write permission or not.

I am traveling right now, i will investigate what happened to
nouveau code.

Cheers,
Jérôme

> ---
>  include/linux/migrate.h | 1 -
>  mm/migrate.c| 9 +
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 8b46cfdb1a0e..ba74ef5a7702 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -165,7 +165,6 @@ static inline int migrate_misplaced_transhuge_page(struct 
> mm_struct *mm,
>  #define MIGRATE_PFN_VALID(1UL << 0)
>  #define MIGRATE_PFN_MIGRATE  (1UL << 1)
>  #define MIGRATE_PFN_LOCKED   (1UL << 2)
> -#define MIGRATE_PFN_WRITE(1UL << 3)
>  #define MIGRATE_PFN_SHIFT6
>  
>  static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 74735256e260..724f92dcc31b 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2212,6 +2212,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   unsigned long mpfn, pfn;
>   struct page *page;
>   swp_entry_t entry;
> + bool writable = false;
>   pte_t pte;
>  
>   pte = *ptep;
> @@ -2240,7 +2241,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   mpfn = migrate_pfn(page_to_pfn(page)) |
>   MIGRATE_PFN_MIGRATE;
>   if (is_write_device_private_entry(entry))
> - mpfn |= MIGRATE_PFN_WRITE;
> + writable = true;
>   } else {
>   if (is_zero_pfn(pfn)) {
>   mpfn = MIGRATE_PFN_MIGRATE;
> @@ -2250,7 +2251,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   }
>   page = vm_normal_page(migrate->vma, addr, pte);
>   mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
> - mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
> + if (pte_write(pte))
> + writable = true;
>   }
>  
>   /* FIXME support THP */
> @@ -2284,8 +2286,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   ptep_get_and_clear(mm, addr, ptep);
>  
>   /* Setup special migration page table entry */
> - entry = make_migration_entry(page, mpfn &
> -  MIGRATE_PFN_WRITE);
> + entry = make_migration_entry(page, writable);
>   swp_pte = swp_entry_to_pte(entry);
>   if (pte_soft_dirty(pte))
>   swp_pte = pte_swp_mksoft_dirty(swp_pte);
> -- 
> 2.20.1
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/nouveau/svm: Convert to use hmm_range_fault()

2019-06-17 Thread Jerome Glisse
On Sat, Jun 08, 2019 at 12:14:50AM +0530, Souptick Joarder wrote:
> Hi Jason,
> 
> On Tue, May 21, 2019 at 12:27 AM Souptick Joarder  
> wrote:
> >
> > Convert to use hmm_range_fault().
> >
> > Signed-off-by: Souptick Joarder 
> 
> Would you like to take it through your new hmm tree or do I
> need to resend it ?

This patch is wrong as the API is different between the two see what
is in hmm.h to see the differences between hmm_vma_fault() hmm_range_fault()
a simple rename break things.

> 
> > ---
> >  drivers/gpu/drm/nouveau/nouveau_svm.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
> > b/drivers/gpu/drm/nouveau/nouveau_svm.c
> > index 93ed43c..8d56bd6 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> > @@ -649,7 +649,7 @@ struct nouveau_svmm {
> > range.values = nouveau_svm_pfn_values;
> > range.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT;
> >  again:
> > -   ret = hmm_vma_fault(, true);
> > +   ret = hmm_range_fault(, true);
> > if (ret == 0) {
> > mutex_lock(>mutex);
> > if (!hmm_vma_range_done()) {
> > --
> > 1.9.1
> >
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 4/4] mm, notifier: Add a lockdep map for invalidate_range_start

2019-05-21 Thread Jerome Glisse
On Tue, May 21, 2019 at 06:00:36PM +0200, Daniel Vetter wrote:
> On Tue, May 21, 2019 at 5:41 PM Jerome Glisse  wrote:
> >
> > On Mon, May 20, 2019 at 11:39:45PM +0200, Daniel Vetter wrote:
> > > This is a similar idea to the fs_reclaim fake lockdep lock. It's
> > > fairly easy to provoke a specific notifier to be run on a specific
> > > range: Just prep it, and then munmap() it.
> > >
> > > A bit harder, but still doable, is to provoke the mmu notifiers for
> > > all the various callchains that might lead to them. But both at the
> > > same time is really hard to reliable hit, especially when you want to
> > > exercise paths like direct reclaim or compaction, where it's not
> > > easy to control what exactly will be unmapped.
> > >
> > > By introducing a lockdep map to tie them all together we allow lockdep
> > > to see a lot more dependencies, without having to actually hit them
> > > in a single challchain while testing.
> > >
> > > Aside: Since I typed this to test i915 mmu notifiers I've only rolled
> > > this out for the invaliate_range_start callback. If there's
> > > interest, we should probably roll this out to all of them. But my
> > > undestanding of core mm is seriously lacking, and I'm not clear on
> > > whether we need a lockdep map for each callback, or whether some can
> > > be shared.
> >
> > I need to read more on lockdep but it is legal to have mmu notifier
> > invalidation within each other. For instance when you munmap you
> > might split a huge pmd and it will trigger a second invalidate range
> > while the munmap one is not done yet. Would that trigger the lockdep
> > here ?
> 
> Depends how it's nesting. I'm wrapping the annotation only just around
> the individual mmu notifier callback, so if the nesting is just
> - munmap starts
> - invalidate_range_start #1
> - we noticed that there's a huge pmd we need to split
> - invalidate_range_start #2
> - invalidate_reange_end #2
> - invalidate_range_end #1
> - munmap is done

Yeah this is how it looks. All the callback from range_start #1 would
happens before range_start #2 happens so we should be fine.

> 
> But if otoh it's ok to trigger the 2nd invalidate range from within an
> mmu_notifier->invalidate_range_start callback, then lockdep will be
> pissed about that.

No that would be illegal for a callback to do that. There is no existing
callback that would do that at least AFAIK. So we can just say that it
is illegal. I would not see the point.

> 
> > Worst case i can think of is 2 invalidate_range_start chain one after
> > the other. I don't think you can triggers a 3 levels nesting but maybe.
> 
> Lockdep has special nesting annotations. I think it'd be more an issue
> of getting those funneled through the entire call chain, assuming we
> really need that.

I think we are fine. So this patch looks good.

Reviewed-by: Jérôme Glisse 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/4] mm: Check if mmu notifier callbacks are allowed to fail

2019-05-21 Thread Jerome Glisse
On Mon, May 20, 2019 at 11:39:42PM +0200, Daniel Vetter wrote:
> Just a bit of paranoia, since if we start pushing this deep into
> callchains it's hard to spot all places where an mmu notifier
> implementation might fail when it's not allowed to.
> 
> Inspired by some confusion we had discussing i915 mmu notifiers and
> whether we could use the newly-introduced return value to handle some
> corner cases. Until we realized that these are only for when a task
> has been killed by the oom reaper.
> 
> An alternative approach would be to split the callback into two
> versions, one with the int return value, and the other with void
> return value like in older kernels. But that's a lot more churn for
> fairly little gain I think.
> 
> Summary from the m-l discussion on why we want something at warning
> level: This allows automated tooling in CI to catch bugs without
> humans having to look at everything. If we just upgrade the existing
> pr_info to a pr_warn, then we'll have false positives. And as-is, no
> one will ever spot the problem since it's lost in the massive amounts
> of overall dmesg noise.
> 
> v2: Drop the full WARN_ON backtrace in favour of just a pr_warn for
> the problematic case (Michal Hocko).
> 
> v3: Rebase on top of Glisse's arg rework.
> 
> v4: More rebase on top of Glisse reworking everything.
> 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: "Christian König" 
> Cc: David Rientjes 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: Paolo Bonzini 
> Reviewed-by: Christian König 
> Signed-off-by: Daniel Vetter 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/mmu_notifier.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index ee36068077b6..c05e406a7cd7 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -181,6 +181,9 @@ int __mmu_notifier_invalidate_range_start(struct 
> mmu_notifier_range *range)
>   pr_info("%pS callback failed with %d in 
> %sblockable context.\n",
>   mn->ops->invalidate_range_start, _ret,
>   !mmu_notifier_range_blockable(range) ? 
> "non-" : "");
> + if (!mmu_notifier_range_blockable(range))
> + pr_warn("%pS callback failure not 
> allowed\n",
> + 
> mn->ops->invalidate_range_start);
>   ret = _ret;
>   }
>   }
> -- 
> 2.20.1
> 


Re: [PATCH 4/4] mm, notifier: Add a lockdep map for invalidate_range_start

2019-05-21 Thread Jerome Glisse
On Mon, May 20, 2019 at 11:39:45PM +0200, Daniel Vetter wrote:
> This is a similar idea to the fs_reclaim fake lockdep lock. It's
> fairly easy to provoke a specific notifier to be run on a specific
> range: Just prep it, and then munmap() it.
> 
> A bit harder, but still doable, is to provoke the mmu notifiers for
> all the various callchains that might lead to them. But both at the
> same time is really hard to reliable hit, especially when you want to
> exercise paths like direct reclaim or compaction, where it's not
> easy to control what exactly will be unmapped.
> 
> By introducing a lockdep map to tie them all together we allow lockdep
> to see a lot more dependencies, without having to actually hit them
> in a single challchain while testing.
> 
> Aside: Since I typed this to test i915 mmu notifiers I've only rolled
> this out for the invaliate_range_start callback. If there's
> interest, we should probably roll this out to all of them. But my
> undestanding of core mm is seriously lacking, and I'm not clear on
> whether we need a lockdep map for each callback, or whether some can
> be shared.

I need to read more on lockdep but it is legal to have mmu notifier
invalidation within each other. For instance when you munmap you
might split a huge pmd and it will trigger a second invalidate range
while the munmap one is not done yet. Would that trigger the lockdep
here ?

Worst case i can think of is 2 invalidate_range_start chain one after
the other. I don't think you can triggers a 3 levels nesting but maybe.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 3/4] mm, notifier: Catch sleeping/blocking for !blockable

2019-05-21 Thread Jerome Glisse
On Mon, May 20, 2019 at 11:39:44PM +0200, Daniel Vetter wrote:
> We need to make sure implementations don't cheat and don't have a
> possible schedule/blocking point deeply burried where review can't
> catch it.
> 
> I'm not sure whether this is the best way to make sure all the
> might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> But it gets the job done.
> 
> Inspired by an i915 patch series which did exactly that, because the
> rules haven't been entirely clear to us.
> 
> v2: Use the shiny new non_block_start/end annotations instead of
> abusing preempt_disable/enable.
> 
> v3: Rebase on top of Glisse's arg rework.
> 
> v4: Rebase on top of more Glisse rework.
> 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: David Rientjes 
> Cc: "Christian König" 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Reviewed-by: Christian König 
> Signed-off-by: Daniel Vetter 
> ---
>  mm/mmu_notifier.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index c05e406a7cd7..a09e737711d5 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -176,7 +176,13 @@ int __mmu_notifier_invalidate_range_start(struct 
> mmu_notifier_range *range)
>   id = srcu_read_lock();
>   hlist_for_each_entry_rcu(mn, >mm->mmu_notifier_mm->list, hlist) {
>   if (mn->ops->invalidate_range_start) {
> - int _ret = mn->ops->invalidate_range_start(mn, range);
> + int _ret;
> +
> + if (!mmu_notifier_range_blockable(range))
> + non_block_start();
> + _ret = mn->ops->invalidate_range_start(mn, range);
> + if (!mmu_notifier_range_blockable(range))
> + non_block_end();

This is a taste thing so feel free to ignore it as maybe other
will dislike more what i prefer:

+   if (!mmu_notifier_range_blockable(range)) {
+   non_block_start();
+   _ret = mn->ops->invalidate_range_start(mn, 
range);
+   non_block_end();
+   } else
+   _ret = mn->ops->invalidate_range_start(mn, 
range);

If only we had predicate on CPU like on GPU :)

In any case:

Reviewed-by: Jérôme Glisse 


>   if (_ret) {
>   pr_info("%pS callback failed with %d in 
> %sblockable context.\n",
>   mn->ops->invalidate_range_start, _ret,
> -- 
> 2.20.1
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/2] mm/hmm: support automatic NUMA balancing

2019-05-13 Thread Jerome Glisse
On Mon, May 13, 2019 at 02:27:20PM -0700, Andrew Morton wrote:
> On Fri, 10 May 2019 19:53:23 + "Kuehling, Felix"  
> wrote:
> 
> > From: Philip Yang 
> > 
> > While the page is migrating by NUMA balancing, HMM failed to detect this
> > condition and still return the old page. Application will use the new
> > page migrated, but driver pass the old page physical address to GPU,
> > this crash the application later.
> > 
> > Use pte_protnone(pte) to return this condition and then hmm_vma_do_fault
> > will allocate new page.
> > 
> > Signed-off-by: Philip Yang 
> 
> This should have included your signed-off-by:, since you were on the
> patch delivery path.  I'll make that change to my copy of the patch,
> OK?

Yes it should have included that.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/2] mm/hmm: Only set FAULT_FLAG_ALLOW_RETRY for non-blocking

2019-05-13 Thread Jerome Glisse
Andrew can we get this 2 fixes line up for 5.2 ?

On Mon, May 13, 2019 at 07:36:44PM +, Kuehling, Felix wrote:
> Hi Jerome,
> 
> Do you want me to push the patches to your branch? Or are you going to 
> apply them yourself?
> 
> Is your hmm-5.2-v3 branch going to make it into Linux 5.2? If so, do you 
> know when? I'd like to coordinate with Dave Airlie so that we can also 
> get that update into a drm-next branch soon.
> 
> I see that Linus merged Dave's pull request for Linux 5.2, which 
> includes the first changes in amdgpu using HMM. They're currently broken 
> without these two patches.

HMM patch do not go through any git branch they go through the mmotm
collection. So it is not something you can easily coordinate with drm
branch.

By broken i expect you mean that if numabalance happens it breaks ?
Or it might sleep when you are not expecting it too ?

Cheers,
Jérôme

> 
> Thanks,
>    Felix
> 
> On 2019-05-10 4:14 p.m., Jerome Glisse wrote:
> > [CAUTION: External Email]
> >
> > On Fri, May 10, 2019 at 07:53:24PM +, Kuehling, Felix wrote:
> >> Don't set this flag by default in hmm_vma_do_fault. It is set
> >> conditionally just a few lines below. Setting it unconditionally
> >> can lead to handle_mm_fault doing a non-blocking fault, returning
> >> -EBUSY and unlocking mmap_sem unexpectedly.
> >>
> >> Signed-off-by: Felix Kuehling 
> > Reviewed-by: Jérôme Glisse 
> >
> >> ---
> >>   mm/hmm.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/hmm.c b/mm/hmm.c
> >> index b65c27d5c119..3c4f1d62202f 100644
> >> --- a/mm/hmm.c
> >> +++ b/mm/hmm.c
> >> @@ -339,7 +339,7 @@ struct hmm_vma_walk {
> >>   static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
> >>bool write_fault, uint64_t *pfn)
> >>   {
> >> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
> >> + unsigned int flags = FAULT_FLAG_REMOTE;
> >>struct hmm_vma_walk *hmm_vma_walk = walk->private;
> >>struct hmm_range *range = hmm_vma_walk->range;
> >>struct vm_area_struct *vma = walk->vma;
> >> --
> >> 2.17.1
> >>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/2] mm/hmm: support automatic NUMA balancing

2019-05-10 Thread Jerome Glisse
On Fri, May 10, 2019 at 07:53:23PM +, Kuehling, Felix wrote:
> From: Philip Yang 
> 
> While the page is migrating by NUMA balancing, HMM failed to detect this
> condition and still return the old page. Application will use the new
> page migrated, but driver pass the old page physical address to GPU,
> this crash the application later.
> 
> Use pte_protnone(pte) to return this condition and then hmm_vma_do_fault
> will allocate new page.
> 
> Signed-off-by: Philip Yang 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/hmm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 75d2ea906efb..b65c27d5c119 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -554,7 +554,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
>  
>  static inline uint64_t pte_to_hmm_pfn_flags(struct hmm_range *range, pte_t 
> pte)
>  {
> - if (pte_none(pte) || !pte_present(pte))
> + if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
>   return 0;
>   return pte_write(pte) ? range->flags[HMM_PFN_VALID] |
>   range->flags[HMM_PFN_WRITE] :
> -- 
> 2.17.1
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/2] mm/hmm: Only set FAULT_FLAG_ALLOW_RETRY for non-blocking

2019-05-10 Thread Jerome Glisse
On Fri, May 10, 2019 at 07:53:24PM +, Kuehling, Felix wrote:
> Don't set this flag by default in hmm_vma_do_fault. It is set
> conditionally just a few lines below. Setting it unconditionally
> can lead to handle_mm_fault doing a non-blocking fault, returning
> -EBUSY and unlocking mmap_sem unexpectedly.
> 
> Signed-off-by: Felix Kuehling 

Reviewed-by: Jérôme Glisse 

> ---
>  mm/hmm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index b65c27d5c119..3c4f1d62202f 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -339,7 +339,7 @@ struct hmm_vma_walk {
>  static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
>   bool write_fault, uint64_t *pfn)
>  {
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_REMOTE;
> + unsigned int flags = FAULT_FLAG_REMOTE;
>   struct hmm_vma_walk *hmm_vma_walk = walk->private;
>   struct hmm_range *range = hmm_vma_walk->range;
>   struct vm_area_struct *vma = walk->vma;
> -- 
> 2.17.1
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] drm/nouveau: Fix DEVICE_PRIVATE dependencies

2019-04-17 Thread Jerome Glisse
On Wed, Apr 17, 2019 at 10:26:32PM +0800, Yue Haibing wrote:
> From: YueHaibing 
> 
> During randconfig builds, I occasionally run into an invalid configuration
> 
> WARNING: unmet direct dependencies detected for DEVICE_PRIVATE
>   Depends on [n]: ARCH_HAS_HMM_DEVICE [=n] && ZONE_DEVICE [=n]
>   Selected by [y]:
>   - DRM_NOUVEAU_SVM [=y] && HAS_IOMEM [=y] && ARCH_HAS_HMM [=y] && 
> DRM_NOUVEAU [=y] && STAGING [=y]
> 
> mm/memory.o: In function `do_swap_page':
> memory.c:(.text+0x2754): undefined reference to `device_private_entry_fault'
> 
> commit 5da25090ab04 ("mm/hmm: kconfig split HMM address space mirroring from 
> device memory")
> split CONFIG_DEVICE_PRIVATE dependencies from
> ARCH_HAS_HMM to ARCH_HAS_HMM_DEVICE and ZONE_DEVICE,
> so enable DRM_NOUVEAU_SVM will trigger this warning,
> cause building failed.
> 
> Reported-by: Hulk Robot 
> Fixes: 5da25090ab04 ("mm/hmm: kconfig split HMM address space mirroring from 
> device memory")
> Signed-off-by: YueHaibing 

Reviewed-by: Jérôme Glisse 

> ---
>  drivers/gpu/drm/nouveau/Kconfig | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/Kconfig b/drivers/gpu/drm/nouveau/Kconfig
> index 00cd9ab..99e30c1 100644
> --- a/drivers/gpu/drm/nouveau/Kconfig
> +++ b/drivers/gpu/drm/nouveau/Kconfig
> @@ -74,7 +74,8 @@ config DRM_NOUVEAU_BACKLIGHT
>  
>  config DRM_NOUVEAU_SVM
>   bool "(EXPERIMENTAL) Enable SVM (Shared Virtual Memory) support"
> - depends on ARCH_HAS_HMM
> + depends on ARCH_HAS_HMM_DEVICE
> + depends on ZONE_DEVICE
>   depends on DRM_NOUVEAU
>   depends on STAGING
>   select HMM_MIRROR
> -- 
> 2.7.4
> 
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/9] mm: Add an apply_to_pfn_range interface

2019-04-17 Thread Jerome Glisse
On Wed, Apr 17, 2019 at 09:15:52AM +, Thomas Hellstrom wrote:
> On Tue, 2019-04-16 at 10:46 -0400, Jerome Glisse wrote:
> > On Sat, Apr 13, 2019 at 08:34:02AM +, Thomas Hellstrom wrote:
> > > Hi, Jérôme
> > > 
> > > On Fri, 2019-04-12 at 17:07 -0400, Jerome Glisse wrote:
> > > > On Fri, Apr 12, 2019 at 04:04:18PM +, Thomas Hellstrom wrote:

[...]

> > > > > -/*
> > > > > - * Scan a region of virtual memory, filling in page tables as
> > > > > necessary
> > > > > - * and calling a provided function on each leaf page table.
> > > > > +/**
> > > > > + * apply_to_pfn_range - Scan a region of virtual memory,
> > > > > calling a
> > > > > provided
> > > > > + * function on each leaf page table entry
> > > > > + * @closure: Details about how to scan and what function to
> > > > > apply
> > > > > + * @addr: Start virtual address
> > > > > + * @size: Size of the region
> > > > > + *
> > > > > + * If @closure->alloc is set to 1, the function will fill in
> > > > > the
> > > > > page table
> > > > > + * as necessary. Otherwise it will skip non-present parts.
> > > > > + * Note: The caller must ensure that the range does not
> > > > > contain
> > > > > huge pages.
> > > > > + * The caller must also assure that the proper mmu_notifier
> > > > > functions are
> > > > > + * called. Either in the pte leaf function or before and after
> > > > > the
> > > > > call to
> > > > > + * apply_to_pfn_range.
> > > > 
> > > > This is wrong there should be a big FAT warning that this can
> > > > only be
> > > > use
> > > > against mmap of device file. The page table walking above is
> > > > broken
> > > > for
> > > > various thing you might find in any other vma like THP, device
> > > > pte,
> > > > hugetlbfs,
> > > 
> > > I was figuring since we didn't export the function anymore, the
> > > warning
> > > and checks could be left to its users, assuming that any other
> > > future
> > > usage of this function would require mm people audit anyway. But I
> > > can
> > > of course add that warning also to this function if you still want
> > > that?
> > 
> > Yeah more warning are better, people might start using this, i know
> > some poeple use unexported symbol and then report bugs while they
> > just were doing something illegal.
> > 
> > > > ...
> > > > 
> > > > Also the mmu notifier can not be call from the pfn callback as
> > > > that
> > > > callback
> > > > happens under page table lock (the change_pte notifier callback
> > > > is
> > > > useless
> > > > and not enough). So it _must_ happen around the call to
> > > > apply_to_pfn_range
> > > 
> > > In the comments I was having in mind usage of, for example
> > > ptep_clear_flush_notify(). But you're the mmu_notifier expert here.
> > > Are
> > > you saying that function by itself would not be sufficient?
> > > In that case, should I just scratch the text mentioning the pte
> > > leaf
> > > function?
> > 
> > ptep_clear_flush_notify() is useless ... i have posted patches to
> > either
> > restore it or remove it. In any case you must call mmu notifier range
> > and
> > they can not happen under lock. You usage looked fine (in the next
> > patch)
> > but i would rather have a bit of comment here to make sure people are
> > also
> > aware of that.
> > 
> > While we can hope that people would cc mm when using mm function, it
> > is
> > not always the case. So i rather be cautious and warn in comment as
> > much
> > as possible.
> > 
> 
> OK. Understood. All this actually makes me tend to want to try a bit
> harder using a slight modification to the pagewalk code instead. Don't
> really want to encourage two parallel code paths doing essentially the
> same thing; one good and one bad.
> 
> One thing that confuses me a bit with the pagewalk code is that callers
> (for example softdirty) typically call
> mmu_notifier_invalidate_range_start() around the pagewalk, but then if
> it ends up splitting a pmd, mmu_notifier_invalidate_range is called
> again, within the first range. 

Re: [PATCH 2/9] mm: Add an apply_to_pfn_range interface

2019-04-16 Thread Jerome Glisse
On Sat, Apr 13, 2019 at 08:34:02AM +, Thomas Hellstrom wrote:
> Hi, Jérôme
> 
> On Fri, 2019-04-12 at 17:07 -0400, Jerome Glisse wrote:
> > On Fri, Apr 12, 2019 at 04:04:18PM +, Thomas Hellstrom wrote:
> > > This is basically apply_to_page_range with added functionality:
> > > Allocating missing parts of the page table becomes optional, which
> > > means that the function can be guaranteed not to error if
> > > allocation
> > > is disabled. Also passing of the closure struct and callback
> > > function
> > > becomes different and more in line with how things are done
> > > elsewhere.
> > > 
> > > Finally we keep apply_to_page_range as a wrapper around
> > > apply_to_pfn_range
> > > 
> > > The reason for not using the page-walk code is that we want to
> > > perform
> > > the page-walk on vmas pointing to an address space without
> > > requiring the
> > > mmap_sem to be held rather thand on vmas belonging to a process
> > > with the
> > > mmap_sem held.
> > > 
> > > Notable changes since RFC:
> > > Don't export apply_to_pfn range.
> > > 
> > > Cc: Andrew Morton 
> > > Cc: Matthew Wilcox 
> > > Cc: Will Deacon 
> > > Cc: Peter Zijlstra 
> > > Cc: Rik van Riel 
> > > Cc: Minchan Kim 
> > > Cc: Michal Hocko 
> > > Cc: Huang Ying 
> > > Cc: Souptick Joarder 
> > > Cc: "Jérôme Glisse" 
> > > Cc: linux...@kvack.org
> > > Cc: linux-ker...@vger.kernel.org
> > > Signed-off-by: Thomas Hellstrom 
> > > ---
> > >  include/linux/mm.h |  10 
> > >  mm/memory.c| 130 ++---
> > > 
> > >  2 files changed, 108 insertions(+), 32 deletions(-)
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 80bb6408fe73..b7dd4ddd6efb 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -2632,6 +2632,16 @@ typedef int (*pte_fn_t)(pte_t *pte,
> > > pgtable_t token, unsigned long addr,
> > >  extern int apply_to_page_range(struct mm_struct *mm, unsigned long
> > > address,
> > >  unsigned long size, pte_fn_t fn, void
> > > *data);
> > >  
> > > +struct pfn_range_apply;
> > > +typedef int (*pter_fn_t)(pte_t *pte, pgtable_t token, unsigned
> > > long addr,
> > > +  struct pfn_range_apply *closure);
> > > +struct pfn_range_apply {
> > > + struct mm_struct *mm;
> > > + pter_fn_t ptefn;
> > > + unsigned int alloc;
> > > +};
> > > +extern int apply_to_pfn_range(struct pfn_range_apply *closure,
> > > +   unsigned long address, unsigned long
> > > size);
> > >  
> > >  #ifdef CONFIG_PAGE_POISONING
> > >  extern bool page_poisoning_enabled(void);
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index a95b4a3b1ae2..60d67158964f 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1938,18 +1938,17 @@ int vm_iomap_memory(struct vm_area_struct
> > > *vma, phys_addr_t start, unsigned long
> > >  }
> > >  EXPORT_SYMBOL(vm_iomap_memory);
> > >  
> > > -static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> > > -  unsigned long addr, unsigned long
> > > end,
> > > -  pte_fn_t fn, void *data)
> > > +static int apply_to_pte_range(struct pfn_range_apply *closure,
> > > pmd_t *pmd,
> > > +   unsigned long addr, unsigned long end)
> > >  {
> > >   pte_t *pte;
> > >   int err;
> > >   pgtable_t token;
> > >   spinlock_t *uninitialized_var(ptl);
> > >  
> > > - pte = (mm == _mm) ?
> > > + pte = (closure->mm == _mm) ?
> > >   pte_alloc_kernel(pmd, addr) :
> > > - pte_alloc_map_lock(mm, pmd, addr, );
> > > + pte_alloc_map_lock(closure->mm, pmd, addr, );
> > >   if (!pte)
> > >   return -ENOMEM;
> > >  
> > > @@ -1960,86 +1959,107 @@ static int apply_to_pte_range(struct
> > > mm_struct *mm, pmd_t *pmd,
> > >   token = pmd_pgtable(*pmd);
> > >  
> > >   do {
> > > - err = fn(pte++, token, addr, data);
> > > + err = closure->ptefn(pte++, token, addr, closure);
> > >   if (err)
>

Re: [PATCH 2/9] mm: Add an apply_to_pfn_range interface

2019-04-12 Thread Jerome Glisse
On Fri, Apr 12, 2019 at 04:04:18PM +, Thomas Hellstrom wrote:
> This is basically apply_to_page_range with added functionality:
> Allocating missing parts of the page table becomes optional, which
> means that the function can be guaranteed not to error if allocation
> is disabled. Also passing of the closure struct and callback function
> becomes different and more in line with how things are done elsewhere.
> 
> Finally we keep apply_to_page_range as a wrapper around apply_to_pfn_range
> 
> The reason for not using the page-walk code is that we want to perform
> the page-walk on vmas pointing to an address space without requiring the
> mmap_sem to be held rather thand on vmas belonging to a process with the
> mmap_sem held.
> 
> Notable changes since RFC:
> Don't export apply_to_pfn range.
> 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Minchan Kim 
> Cc: Michal Hocko 
> Cc: Huang Ying 
> Cc: Souptick Joarder 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Thomas Hellstrom 
> ---
>  include/linux/mm.h |  10 
>  mm/memory.c| 130 ++---
>  2 files changed, 108 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..b7dd4ddd6efb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2632,6 +2632,16 @@ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, 
> unsigned long addr,
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
>  unsigned long size, pte_fn_t fn, void *data);
>  
> +struct pfn_range_apply;
> +typedef int (*pter_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
> +  struct pfn_range_apply *closure);
> +struct pfn_range_apply {
> + struct mm_struct *mm;
> + pter_fn_t ptefn;
> + unsigned int alloc;
> +};
> +extern int apply_to_pfn_range(struct pfn_range_apply *closure,
> +   unsigned long address, unsigned long size);
>  
>  #ifdef CONFIG_PAGE_POISONING
>  extern bool page_poisoning_enabled(void);
> diff --git a/mm/memory.c b/mm/memory.c
> index a95b4a3b1ae2..60d67158964f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1938,18 +1938,17 @@ int vm_iomap_memory(struct vm_area_struct *vma, 
> phys_addr_t start, unsigned long
>  }
>  EXPORT_SYMBOL(vm_iomap_memory);
>  
> -static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> -  unsigned long addr, unsigned long end,
> -  pte_fn_t fn, void *data)
> +static int apply_to_pte_range(struct pfn_range_apply *closure, pmd_t *pmd,
> +   unsigned long addr, unsigned long end)
>  {
>   pte_t *pte;
>   int err;
>   pgtable_t token;
>   spinlock_t *uninitialized_var(ptl);
>  
> - pte = (mm == _mm) ?
> + pte = (closure->mm == _mm) ?
>   pte_alloc_kernel(pmd, addr) :
> - pte_alloc_map_lock(mm, pmd, addr, );
> + pte_alloc_map_lock(closure->mm, pmd, addr, );
>   if (!pte)
>   return -ENOMEM;
>  
> @@ -1960,86 +1959,107 @@ static int apply_to_pte_range(struct mm_struct *mm, 
> pmd_t *pmd,
>   token = pmd_pgtable(*pmd);
>  
>   do {
> - err = fn(pte++, token, addr, data);
> + err = closure->ptefn(pte++, token, addr, closure);
>   if (err)
>   break;
>   } while (addr += PAGE_SIZE, addr != end);
>  
>   arch_leave_lazy_mmu_mode();
>  
> - if (mm != _mm)
> + if (closure->mm != _mm)
>   pte_unmap_unlock(pte-1, ptl);
>   return err;
>  }
>  
> -static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
> -  unsigned long addr, unsigned long end,
> -  pte_fn_t fn, void *data)
> +static int apply_to_pmd_range(struct pfn_range_apply *closure, pud_t *pud,
> +   unsigned long addr, unsigned long end)
>  {
>   pmd_t *pmd;
>   unsigned long next;
> - int err;
> + int err = 0;
>  
>   BUG_ON(pud_huge(*pud));
>  
> - pmd = pmd_alloc(mm, pud, addr);
> + pmd = pmd_alloc(closure->mm, pud, addr);
>   if (!pmd)
>   return -ENOMEM;
> +
>   do {
>   next = pmd_addr_end(addr, end);
> - err = apply_to_pte_range(mm, pmd, addr, next, fn, data);
> + if (!closure->alloc && pmd_none_or_clear_bad(pmd))
> + continue;
> + err = apply_to_pte_range(closure, pmd, addr, next);
>   if (err)
>   break;
>   } while (pmd++, addr = next, addr != end);
>   return err;
>  }
>  
> -static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
> -  unsigned long addr, unsigned long end,
> -  

Re: [PATCH v6 7/8] mm/mmu_notifier: pass down vma and reasons why mmu notifier is happening v2

2019-04-11 Thread Jerome Glisse
On Thu, Apr 11, 2019 at 03:21:08PM +, Weiny, Ira wrote:
> > On Wed, Apr 10, 2019 at 04:41:57PM -0700, Ira Weiny wrote:
> > > On Tue, Mar 26, 2019 at 12:47:46PM -0400, Jerome Glisse wrote:
> > > > From: Jérôme Glisse 
> > > >
> > > > CPU page table update can happens for many reasons, not only as a
> > > > result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...)
> > > > but also as a result of kernel activities (memory compression,
> > > > reclaim, migration, ...).
> > > >
> > > > Users of mmu notifier API track changes to the CPU page table and
> > > > take specific action for them. While current API only provide range
> > > > of virtual address affected by the change, not why the changes is
> > > > happening
> > > >
> > > > This patch is just passing down the new informations by adding it to
> > > > the mmu_notifier_range structure.
> > > >
> > > > Changes since v1:
> > > > - Initialize flags field from mmu_notifier_range_init()
> > > > arguments
> > > >
> > > > Signed-off-by: Jérôme Glisse 
> > > > Cc: Andrew Morton 
> > > > Cc: linux...@kvack.org
> > > > Cc: Christian König 
> > > > Cc: Joonas Lahtinen 
> > > > Cc: Jani Nikula 
> > > > Cc: Rodrigo Vivi 
> > > > Cc: Jan Kara 
> > > > Cc: Andrea Arcangeli 
> > > > Cc: Peter Xu 
> > > > Cc: Felix Kuehling 
> > > > Cc: Jason Gunthorpe 
> > > > Cc: Ross Zwisler 
> > > > Cc: Dan Williams 
> > > > Cc: Paolo Bonzini 
> > > > Cc: Radim Krčmář 
> > > > Cc: Michal Hocko 
> > > > Cc: Christian Koenig 
> > > > Cc: Ralph Campbell 
> > > > Cc: John Hubbard 
> > > > Cc: k...@vger.kernel.org
> > > > Cc: dri-devel@lists.freedesktop.org
> > > > Cc: linux-r...@vger.kernel.org
> > > > Cc: Arnd Bergmann 
> > > > ---
> > > >  include/linux/mmu_notifier.h | 6 +-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/include/linux/mmu_notifier.h
> > > > b/include/linux/mmu_notifier.h index 62f94cd85455..0379956fff23
> > > > 100644
> > > > --- a/include/linux/mmu_notifier.h
> > > > +++ b/include/linux/mmu_notifier.h
> > > > @@ -58,10 +58,12 @@ struct mmu_notifier_mm {  #define
> > > > MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> > > >
> > > >  struct mmu_notifier_range {
> > > > +   struct vm_area_struct *vma;
> > > > struct mm_struct *mm;
> > > > unsigned long start;
> > > > unsigned long end;
> > > > unsigned flags;
> > > > +   enum mmu_notifier_event event;
> > > >  };
> > > >
> > > >  struct mmu_notifier_ops {
> > > > @@ -363,10 +365,12 @@ static inline void
> > mmu_notifier_range_init (struct mmu_notifier_range *range,
> > > >unsigned long start,
> > > >unsigned long end)
> > > >  {
> > > > +   range->vma = vma;
> > > > +   range->event = event;
> > > > range->mm = mm;
> > > > range->start = start;
> > > > range->end = end;
> > > > -   range->flags = 0;
> > > > +   range->flags = flags;
> > >
> > > Which of the "user patch sets" uses the new flags?
> > >
> > > I'm not seeing that user yet.  In general I don't see anything wrong
> > > with the series and I like the idea of telling drivers why the invalidate 
> > > has
> > fired.
> > >
> > > But is the flags a future feature?
> > >
> > 
> > I believe the link were in the cover:
> > 
> > https://lkml.org/lkml/2019/1/23/833
> > https://lkml.org/lkml/2019/1/23/834
> > https://lkml.org/lkml/2019/1/23/832
> > https://lkml.org/lkml/2019/1/23/831
> > 
> > I have more coming for HMM but i am waiting after 5.2 once amdgpu HMM
> > patch are merge upstream as it will change what is passed down to driver
> > and it would conflict with non merged HMM driver (like amdgpu today).
> > 
> 
> Unfortunately this does not answer my question.  Yes I saw the links to the 
> patches which use this in the he

Re: [PATCH v6 7/8] mm/mmu_notifier: pass down vma and reasons why mmu notifier is happening v2

2019-04-11 Thread Jerome Glisse
On Wed, Apr 10, 2019 at 04:41:57PM -0700, Ira Weiny wrote:
> On Tue, Mar 26, 2019 at 12:47:46PM -0400, Jerome Glisse wrote:
> > From: Jérôme Glisse 
> > 
> > CPU page table update can happens for many reasons, not only as a result
> > of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also
> > as a result of kernel activities (memory compression, reclaim, migration,
> > ...).
> > 
> > Users of mmu notifier API track changes to the CPU page table and take
> > specific action for them. While current API only provide range of virtual
> > address affected by the change, not why the changes is happening
> > 
> > This patch is just passing down the new informations by adding it to the
> > mmu_notifier_range structure.
> > 
> > Changes since v1:
> > - Initialize flags field from mmu_notifier_range_init() arguments
> > 
> > Signed-off-by: Jérôme Glisse 
> > Cc: Andrew Morton 
> > Cc: linux...@kvack.org
> > Cc: Christian König 
> > Cc: Joonas Lahtinen 
> > Cc: Jani Nikula 
> > Cc: Rodrigo Vivi 
> > Cc: Jan Kara 
> > Cc: Andrea Arcangeli 
> > Cc: Peter Xu 
> > Cc: Felix Kuehling 
> > Cc: Jason Gunthorpe 
> > Cc: Ross Zwisler 
> > Cc: Dan Williams 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Michal Hocko 
> > Cc: Christian Koenig 
> > Cc: Ralph Campbell 
> > Cc: John Hubbard 
> > Cc: k...@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: Arnd Bergmann 
> > ---
> >  include/linux/mmu_notifier.h | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 62f94cd85455..0379956fff23 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -58,10 +58,12 @@ struct mmu_notifier_mm {
> >  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> >  
> >  struct mmu_notifier_range {
> > +   struct vm_area_struct *vma;
> > struct mm_struct *mm;
> > unsigned long start;
> > unsigned long end;
> > unsigned flags;
> > +   enum mmu_notifier_event event;
> >  };
> >  
> >  struct mmu_notifier_ops {
> > @@ -363,10 +365,12 @@ static inline void mmu_notifier_range_init(struct 
> > mmu_notifier_range *range,
> >unsigned long start,
> >unsigned long end)
> >  {
> > +   range->vma = vma;
> > +   range->event = event;
> > range->mm = mm;
> > range->start = start;
> > range->end = end;
> > -   range->flags = 0;
> > +   range->flags = flags;
> 
> Which of the "user patch sets" uses the new flags?
> 
> I'm not seeing that user yet.  In general I don't see anything wrong with the
> series and I like the idea of telling drivers why the invalidate has fired.
> 
> But is the flags a future feature?
> 

I believe the link were in the cover:

https://lkml.org/lkml/2019/1/23/833
https://lkml.org/lkml/2019/1/23/834
https://lkml.org/lkml/2019/1/23/832
https://lkml.org/lkml/2019/1/23/831

I have more coming for HMM but i am waiting after 5.2 once amdgpu
HMM patch are merge upstream as it will change what is passed down
to driver and it would conflict with non merged HMM driver (like
amdgpu today).

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v6 0/8] mmu notifier provide context informations

2019-04-10 Thread Jerome Glisse
On Tue, Apr 09, 2019 at 03:08:55PM -0700, Andrew Morton wrote:
> On Tue, 26 Mar 2019 12:47:39 -0400 jgli...@redhat.com wrote:
> 
> > From: Jérôme Glisse 
> > 
> > (Andrew this apply on top of my HMM patchset as otherwise you will have
> >  conflict with changes to mm/hmm.c)
> > 
> > Changes since v5:
> > - drop KVM bits waiting for KVM people to express interest if they
> >   do not then i will post patchset to remove change_pte_notify as
> >   without the changes in v5 change_pte_notify is just useless (it
> >   it is useless today upstream it is just wasting cpu cycles)
> > - rebase on top of lastest Linus tree
> > 
> > Previous cover letter with minor update:
> > 
> > 
> > Here i am not posting users of this, they already have been posted to
> > appropriate mailing list [6] and will be merge through the appropriate
> > tree once this patchset is upstream.
> > 
> > Note that this serie does not change any behavior for any existing
> > code. It just pass down more information to mmu notifier listener.
> > 
> > The rational for this patchset:
> > 
> > CPU page table update can happens for many reasons, not only as a
> > result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...)
> > but also as a result of kernel activities (memory compression, reclaim,
> > migration, ...).
> > 
> > This patch introduce a set of enums that can be associated with each
> > of the events triggering a mmu notifier:
> > 
> > - UNMAP: munmap() or mremap()
> > - CLEAR: page table is cleared (migration, compaction, reclaim, ...)
> > - PROTECTION_VMA: change in access protections for the range
> > - PROTECTION_PAGE: change in access protections for page in the range
> > - SOFT_DIRTY: soft dirtyness tracking
> > 
> > Being able to identify munmap() and mremap() from other reasons why the
> > page table is cleared is important to allow user of mmu notifier to
> > update their own internal tracking structure accordingly (on munmap or
> > mremap it is not longer needed to track range of virtual address as it
> > becomes invalid). Without this serie, driver are force to assume that
> > every notification is an munmap which triggers useless trashing within
> > drivers that associate structure with range of virtual address. Each
> > driver is force to free up its tracking structure and then restore it
> > on next device page fault. With this serie we can also optimize device
> > page table update [6].
> > 
> > More over this can also be use to optimize out some page table updates
> > like for KVM where we can update the secondary MMU directly from the
> > callback instead of clearing it.
> 
> We seem to be rather short of review input on this patchset.  ie: there
> is none.

I forgot to update the review tag but Ralph did review v5:
https://lkml.org/lkml/2019/2/22/564
https://lkml.org/lkml/2019/2/22/561
https://lkml.org/lkml/2019/2/22/558
https://lkml.org/lkml/2019/2/22/710
https://lkml.org/lkml/2019/2/22/711
https://lkml.org/lkml/2019/2/22/695
https://lkml.org/lkml/2019/2/22/738
https://lkml.org/lkml/2019/2/22/757

and since this v6 is a rebase just with better comments here and
there i believe those reviews holds.

> 
> > ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
> 
> OK, kind of ackish, but not a review.
> 
> > ACKS RDMA https://lkml.org/lkml/2018/12/6/1473
> 
> This actually acks the infiniband part of a patch which isn't in this
> series.

This to show that they are end user and that those end user are
wanted. Also obviously i will be using this within HMM and thus
it will be use by mlx5, nouveau and amdgpu (which are all the
HMM user that are either upstream or queue up for 5.2 or 5.3).

> So we have some work to do, please.  Who would be suitable reviewers?

Anyone willing to review mmu notifier code. I believe this patchset is
not that hard to review this is about giving contextual informations
on why mmu notifier are happening it does not change the logic of any-
thing. They are no maintainers for the mmu notifier so i don't have a
person i can single out for review, thought given i have been the one
doing most changes in that area it could fall on me ...

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v6 0/8] mmu notifier provide context informations

2019-04-09 Thread Jerome Glisse
Andrew anything blocking this for 5.2 ? Should i ask people (ie the end
user of this) to re-ack v6 (it is the same as previous version just rebase
and dropped kvm bits).



On Tue, Mar 26, 2019 at 12:47:39PM -0400, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> (Andrew this apply on top of my HMM patchset as otherwise you will have
>  conflict with changes to mm/hmm.c)
> 
> Changes since v5:
> - drop KVM bits waiting for KVM people to express interest if they
>   do not then i will post patchset to remove change_pte_notify as
>   without the changes in v5 change_pte_notify is just useless (it
>   it is useless today upstream it is just wasting cpu cycles)
> - rebase on top of lastest Linus tree
> 
> Previous cover letter with minor update:
> 
> 
> Here i am not posting users of this, they already have been posted to
> appropriate mailing list [6] and will be merge through the appropriate
> tree once this patchset is upstream.
> 
> Note that this serie does not change any behavior for any existing
> code. It just pass down more information to mmu notifier listener.
> 
> The rational for this patchset:
> 
> CPU page table update can happens for many reasons, not only as a
> result of a syscall (munmap(), mprotect(), mremap(), madvise(), ...)
> but also as a result of kernel activities (memory compression, reclaim,
> migration, ...).
> 
> This patch introduce a set of enums that can be associated with each
> of the events triggering a mmu notifier:
> 
> - UNMAP: munmap() or mremap()
> - CLEAR: page table is cleared (migration, compaction, reclaim, ...)
> - PROTECTION_VMA: change in access protections for the range
> - PROTECTION_PAGE: change in access protections for page in the range
> - SOFT_DIRTY: soft dirtyness tracking
> 
> Being able to identify munmap() and mremap() from other reasons why the
> page table is cleared is important to allow user of mmu notifier to
> update their own internal tracking structure accordingly (on munmap or
> mremap it is not longer needed to track range of virtual address as it
> becomes invalid). Without this serie, driver are force to assume that
> every notification is an munmap which triggers useless trashing within
> drivers that associate structure with range of virtual address. Each
> driver is force to free up its tracking structure and then restore it
> on next device page fault. With this serie we can also optimize device
> page table update [6].
> 
> More over this can also be use to optimize out some page table updates
> like for KVM where we can update the secondary MMU directly from the
> callback instead of clearing it.
> 
> ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
> ACKS RDMA https://lkml.org/lkml/2018/12/6/1473
> 
> Cheers,
> Jérôme
> 
> [1] v1 https://lkml.org/lkml/2018/3/23/1049
> [2] v2 https://lkml.org/lkml/2018/12/5/10
> [3] v3 https://lkml.org/lkml/2018/12/13/620
> [4] v4 https://lkml.org/lkml/2019/1/23/838
> [5] v5 https://lkml.org/lkml/2019/2/19/752
> [6] patches to use this:
> https://lkml.org/lkml/2019/1/23/833
> https://lkml.org/lkml/2019/1/23/834
> https://lkml.org/lkml/2019/1/23/832
> https://lkml.org/lkml/2019/1/23/831
> 
> Cc: Andrew Morton 
> Cc: linux...@kvack.org
> Cc: Christian König 
> Cc: Joonas Lahtinen 
> Cc: Jani Nikula 
> Cc: Rodrigo Vivi 
> Cc: Jan Kara 
> Cc: Andrea Arcangeli 
> Cc: Peter Xu 
> Cc: Felix Kuehling 
> Cc: Jason Gunthorpe 
> Cc: Ross Zwisler 
> Cc: Dan Williams 
> Cc: Paolo Bonzini 
> Cc: Alex Deucher 
> Cc: Radim Krčmář 
> Cc: Michal Hocko 
> Cc: Christian Koenig 
> Cc: Ben Skeggs 
> Cc: Ralph Campbell 
> Cc: John Hubbard 
> Cc: k...@vger.kernel.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: linux-r...@vger.kernel.org
> Cc: Arnd Bergmann 
> 
> Jérôme Glisse (8):
>   mm/mmu_notifier: helper to test if a range invalidation is blockable
>   mm/mmu_notifier: convert user range->blockable to helper function
>   mm/mmu_notifier: convert mmu_notifier_range->blockable to a flags
>   mm/mmu_notifier: contextual information for event enums
>   mm/mmu_notifier: contextual information for event triggering
> invalidation v2
>   mm/mmu_notifier: use correct mmu_notifier events for each invalidation
>   mm/mmu_notifier: pass down vma and reasons why mmu notifier is
> happening v2
>   mm/mmu_notifier: mmu_notifier_range_update_to_read_only() helper
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  |  8 ++--
>  drivers/gpu/drm/i915/i915_gem_userptr.c |  2 +-
>  drivers/gpu/drm/radeon/radeon_mn.c  |  4 +-
>  drivers/infiniband/core/umem_odp.c  |  5 +-
>  drivers/xen/gntdev.c|  6 +--
>  fs/proc/task_mmu.c  |  3 +-
>  include/linux/mmu_notifier.h| 63 +++--
>  kernel/events/uprobes.c |  3 +-
>  mm/hmm.c|  6 +--
>  mm/huge_memory.c| 14 +++---
>  mm/hugetlb.c| 12 +++--
>  

Re: [RFC PATCH Xilinx Alveo 0/6] Xilinx PCIe accelerator driver

2019-04-03 Thread Jerome Glisse
On Fri, Mar 29, 2019 at 06:09:18PM -0700, Ronan KERYELL wrote:
> I am adding linux-f...@vger.kernel.org, since this is why I missed this
> thread in the first place...
> > On Fri, 29 Mar 2019 14:56:17 +1000, Dave Airlie  
> > said:
> Dave> On Thu, 28 Mar 2019 at 10:14, Sonal Santan  
> wrote:
> >>> From: Daniel Vetter [mailto:daniel.vet...@ffwll.ch]

[...]

> Long answer:
> 
> - processors, GPU and other digital circuits are designed from a lot of
>   elementary transistors, wires, capacitors, resistors... using some
>   very complex (and expensive) tools from some EDA companies but at the
>   end, after months of work, they come often with a "simple" public
>   interface, the... instruction set! So it is rather "easy" at the end
>   to generate some instructions with a compiler such as LLVM from a
>   description of this ISA or some reverse engineering. Note that even if
>   the ISA is public, it is very difficult to make another efficient
>   processor from scratch just from this ISA, so there is often no
>   concern about making this ISA public to develop the ecosystem ;
> 
> - FPGA are field-programmable gate arrays, made also from a lot of
>   elementary transistors, wires, capacitors, resistors... but organized
>   in billions of very low-level elementary gates, memory elements, DSP
>   blocks, I/O blocks, clock generators, specific
>   accelerators... directly exposed to the user and that can be
>   programmed according to a configuration memory (the bitstream) that
>   details how to connect each part, routing element, configuring each
>   elemental piece of hardware.  So instead of just writing instructions
>   like on a CPU or a GPU, you need to configure each bit of the
>   architecture in such a way it does something interesting for
>   you. Concretely, you write some programs in RTL languages (Verilog,
>   VHDL) or higher-level (C/C++, OpenCL, SYCL...)  and you use some very
>   complex (and expensive) tools from some EDA companies to generate the
>   bitstream implementing an equivalent circuit with the same
>   semantics. Since the architecture is so low level, there is a direct
>   mapping between the configuration memory (bitstream) and the hardware
>   architecture itself, so if it is public then it is easy to duplicate
>   the FPGA itself and to start a new FPGA company. That is unfortunately
>   something the existing FPGA companies do not want... ;-)

This is completely bogus argument, all FPGA documentation i have seen so far
_extensively_ describe _each_ basic blocks within the FGPA, this does include
the excelent documentation Xilinx provide on the inner working and layout of
Xilinx FPGA. Same apply to Altera, Atmel, Latice, ...

The extensive public documentation is enough for anyone with the money and
with half decent engineers to produce an FPGA.

The real know how of FPGA vendor is how to produce big chips on small process
capable to sustain high clock with the best power consumption possible. This
is the part where the years of experiences of each company pay off. The cost
for anyone to come to the market is in the hundred of millions just in setup
cost and to catch with established vendor on the hardware side. This without
any garanty of revenue at the end.

The bitstream is only giving away which bits correspond to which wire where
the LUT boolean table is store  ... Bitstream that have been reverse engineer
never revealed anything of value that was not already publicly documented.


So no the bitstream has _no_ value, please prove me wrong with Latice bitstream
for instance. If anything the fact that Latice has a reverse engineer bitstream
has made that FPGA popular with the maker community as it allows people to do
experiment for which the closed source tools are an impediment. So i would argue
that open bitstream is actualy beneficial.


The only valid reason i have ever seen for hidding the bitstream is to protect
the IP of the customer ie those customer that can pour quite a lot of money on
designing something with an FPGA and then wants to keep the VHDL/Verilog
protected and "safe" from reverse engineering.

But this is security by obscurity and FPGA company would be better off providing
strong bitstream encryption (and most already do but i have seen some paper on
how to break them).


I rather not see any bogus argument to try to justify something that is not
justifiable.


Daniel already stressed that we need to know what the bitstream can do and it
is even more important with FPGA where on some FPGA AFAICT the bitstream can
have total control over the PCIE BUS and thus can be use to attack either main
memory or other PCIE devices.

For instance with ATS/PASID you can have the device send pre-translated request
to the IOMMU and access any memory despite the IOMMU.

So without total confidence of what the bitstream can and can not do, and thus
without knowledge of the bitstream format and how it maps to LUT, switch, cross-
bar, clock, fix block (PCIE, 

Re: [RFC PATCH RESEND 3/3] mm: Add write-protect and clean utilities for address space ranges

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 08:29:31PM +, Thomas Hellstrom wrote:
> On Thu, 2019-03-21 at 10:12 -0400, Jerome Glisse wrote:
> > On Thu, Mar 21, 2019 at 01:22:41PM +, Thomas Hellstrom wrote:
> > > Add two utilities to a) write-protect and b) clean all ptes
> > > pointing into
> > > a range of an address space
> > > The utilities are intended to aid in tracking dirty pages (either
> > > driver-allocated system memory or pci device memory).
> > > The write-protect utility should be used in conjunction with
> > > page_mkwrite() and pfn_mkwrite() to trigger write page-faults on
> > > page
> > > accesses. Typically one would want to use this on sparse accesses
> > > into
> > > large memory regions. The clean utility should be used to utilize
> > > hardware dirtying functionality and avoid the overhead of page-
> > > faults,
> > > typically on large accesses into small memory regions.
> > 
> > Again this does not use mmu notifier and there is no scary comment to
> > explain the very limited use case it should be use for ie mmap of a
> > device file and only by the device driver.
> 
> Scary comment and asserts will be added.
> 
> > 
> > Using it ouside of this would break softdirty or trigger false COW or
> > other scary thing.
> 
> This is something that should clearly be avoided if at all possible.
> False COWs could be avoided by asserting that VMAs are shared. I need
> to look deaper into softdirty, but note that the __mkwrite / dirty /
> clean pattern is already used in a very similar way in
> drivers/video/fb_defio.c although it operates only on real pages one at
> a time.

It should just be allow only for mapping of device file for which none
of the above apply (softdirty, COW, ...).

> 
> > 
> > > Cc: Andrew Morton 
> > > Cc: Matthew Wilcox 
> > > Cc: Will Deacon 
> > > Cc: Peter Zijlstra 
> > > Cc: Rik van Riel 
> > > Cc: Minchan Kim 
> > > Cc: Michal Hocko 
> > > Cc: Huang Ying 
> > > Cc: Souptick Joarder 
> > > Cc: "Jérôme Glisse" 
> > > Cc: linux...@kvack.org
> > > Cc: linux-ker...@vger.kernel.org
> > > Signed-off-by: Thomas Hellstrom 
> > > ---
> > >  include/linux/mm.h  |   9 +-
> > >  mm/Makefile |   2 +-
> > >  mm/apply_as_range.c | 257
> > > 
> > >  3 files changed, 266 insertions(+), 2 deletions(-)
> > >  create mode 100644 mm/apply_as_range.c
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index b7dd4ddd6efb..62f24dd0bfa0 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -2642,7 +2642,14 @@ struct pfn_range_apply {
> > >  };
> > >  extern int apply_to_pfn_range(struct pfn_range_apply *closure,
> > > unsigned long address, unsigned long
> > > size);
> > > -
> > > +unsigned long apply_as_wrprotect(struct address_space *mapping,
> > > +  pgoff_t first_index, pgoff_t nr);
> > > +unsigned long apply_as_clean(struct address_space *mapping,
> > > +  pgoff_t first_index, pgoff_t nr,
> > > +  pgoff_t bitmap_pgoff,
> > > +  unsigned long *bitmap,
> > > +  pgoff_t *start,
> > > +  pgoff_t *end);
> > >  #ifdef CONFIG_PAGE_POISONING
> > >  extern bool page_poisoning_enabled(void);
> > >  extern void kernel_poison_pages(struct page *page, int numpages,
> > > int enable);
> > > diff --git a/mm/Makefile b/mm/Makefile
> > > index d210cc9d6f80..a94b78f12692 100644
> > > --- a/mm/Makefile
> > > +++ b/mm/Makefile
> > > @@ -39,7 +39,7 @@ obj-y   := filemap.o mempool.o
> > > oom_kill.o fadvise.o \
> > >  mm_init.o mmu_context.o percpu.o
> > > slab_common.o \
> > >  compaction.o vmacache.o \
> > >  interval_tree.o list_lru.o workingset.o \
> > > -debug.o $(mmu-y)
> > > +debug.o apply_as_range.o $(mmu-y)
> > >  
> > >  obj-y += init-mm.o
> > >  obj-y += memblock.o
> > > diff --git a/mm/apply_as_range.c b/mm/apply_as_range.c
> > > new file mode 100644
> > > index ..9f03e272ebd0
> > > --- /dev/null
> > > +++ b/mm/apply_as_range.c
> > > @@ -0,0 +1,257 @@
&

Re: [RFC PATCH RESEND 0/3] mm modifications / helpers for emulated GPU coherent memory

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 07:51:16PM +, Thomas Hellstrom wrote:
> Hi, Jérôme,
> 
> Thanks for commenting. I have a couple of questions / clarifications
> below.
> 
> On Thu, 2019-03-21 at 09:46 -0400, Jerome Glisse wrote:
> > On Thu, Mar 21, 2019 at 01:22:22PM +, Thomas Hellstrom wrote:
> > > Resending since last series was sent through a mis-configured SMTP
> > > server.
> > > 
> > > Hi,
> > > This is an early RFC to make sure I don't go too far in the wrong
> > > direction.
> > > 
> > > Non-coherent GPUs that can't directly see contents in CPU-visible
> > > memory,
> > > like VMWare's SVGA device, run into trouble when trying to
> > > implement
> > > coherent memory requirements of modern graphics APIs. Examples are
> > > Vulkan and OpenGL 4.4's ARB_buffer_storage.
> > > 
> > > To remedy, we need to emulate coherent memory. Typically when it's
> > > detected
> > > that a buffer object is about to be accessed by the GPU, we need to
> > > gather the ranges that have been dirtied by the CPU since the last
> > > operation,
> > > apply an operation to make the content visible to the GPU and clear
> > > the
> > > the dirty tracking.
> > > 
> > > Depending on the size of the buffer object and the access pattern
> > > there are
> > > two major possibilities:
> > > 
> > > 1) Use page_mkwrite() and pfn_mkwrite(). (GPU buffer objects are
> > > backed
> > > either by PCI device memory or by driver-alloced pages).
> > > The dirty-tracking needs to be reset by write-protecting the
> > > affected ptes
> > > and flush tlb. This has a complexity of O(num_dirty_pages), but the
> > > write page-fault is of course costly.
> > > 
> > > 2) Use hardware dirty-flags in the ptes. The dirty-tracking needs
> > > to be reset
> > > by clearing the dirty bits and flush tlb. This has a complexity of
> > > O(num_buffer_object_pages) and dirty bits need to be scanned in
> > > full before
> > > each gpu-access.
> > > 
> > > So in practice the two methods need to be interleaved for best
> > > performance.
> > > 
> > > So to facilitate this, I propose two new helpers,
> > > apply_as_wrprotect() and
> > > apply_as_clean() ("as" stands for address-space) both inspired by
> > > unmap_mapping_range(). Users of these helpers are in the making,
> > > but needs
> > > some cleaning-up.
> > 
> > To be clear this should _only be use_ for mmap of device file ? If so
> > the API should try to enforce that as much as possible for instance
> > by
> > mandating the file as argument so that the function can check it is
> > only use in that case. Also big scary comment to make sure no one
> > just
> > start using those outside this very limited frame.
> 
> Fine with me. Perhaps we could BUG() / WARN() on certain VMA flags 
> instead of mandating the file as argument. That can make sure we
> don't accidently hit pages we shouldn't hit.

You already provide the mapping as argument it should not be hard to
check it is a mapping to a device file as the vma flags will not be
enough to identify this case.

> 
> > 
> > > There's also a change to x_mkwrite() to allow dropping the mmap_sem
> > > while
> > > waiting.
> > 
> > This will most likely conflict with userfaultfd write protection. 
> 
> Are you referring to the x_mkwrite() usage itself or the mmap_sem
> dropping facilitation?

Both i believe, however i have not try to apply your patches on top of
the userfaultfd patchset

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH RESEND 2/3] mm: Add an apply_to_pfn_range interface

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 07:59:35PM +, Thomas Hellstrom wrote:
> On Thu, 2019-03-21 at 09:52 -0400, Jerome Glisse wrote:
> > On Thu, Mar 21, 2019 at 01:22:35PM +, Thomas Hellstrom wrote:
> > > This is basically apply_to_page_range with added functionality:
> > > Allocating missing parts of the page table becomes optional, which
> > > means that the function can be guaranteed not to error if
> > > allocation
> > > is disabled. Also passing of the closure struct and callback
> > > function
> > > becomes different and more in line with how things are done
> > > elsewhere.
> > > 
> > > Finally we keep apply_to_page_range as a wrapper around
> > > apply_to_pfn_range
> > 
> > The apply_to_page_range() is dangerous API it does not follow other
> > mm patterns like mmu notifier. It is suppose to be use in arch code
> > or vmalloc or similar thing but not in regular driver code. I see
> > it has crept out of this and is being use by few device driver. I am
> > not sure we should encourage that.
> 
> I can certainly remove the EXPORT of the new apply_to_pfn_range() which
> will make sure its use stays within the mm code. I don't expect any
> additional usage except for the two address-space utilities.
> 
> I'm looking for examples to see how it could be more in line with the
> rest of the mm code. The main difference from the pattern in, for
> example, page_mkclean() seems to be that it's lacking the
> mmu_notifier_invalidate_start() and mmu_notifier_invalidate_end()?
> Perhaps the intention is to have the pte leaf functions notify on pte
> updates? How does this relate to arch_enter_lazy_mmu() which is called
> outside of the page table locks? The documentation appears a bit
> scarce...

Best is to use something like walk_page_range() and have proper mmu
notifier in the callback. The apply_to_page_range() is broken for
huge page (THP) and other things like that. Thought you should not
have THP within mmap of a device file (at least i do not thing any
driver does that).

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC PATCH RESEND 3/3] mm: Add write-protect and clean utilities for address space ranges

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 01:22:41PM +, Thomas Hellstrom wrote:
> Add two utilities to a) write-protect and b) clean all ptes pointing into
> a range of an address space
> The utilities are intended to aid in tracking dirty pages (either
> driver-allocated system memory or pci device memory).
> The write-protect utility should be used in conjunction with
> page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
> accesses. Typically one would want to use this on sparse accesses into
> large memory regions. The clean utility should be used to utilize
> hardware dirtying functionality and avoid the overhead of page-faults,
> typically on large accesses into small memory regions.


Again this does not use mmu notifier and there is no scary comment to
explain the very limited use case it should be use for ie mmap of a
device file and only by the device driver.

Using it ouside of this would break softdirty or trigger false COW or
other scary thing.

> 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Minchan Kim 
> Cc: Michal Hocko 
> Cc: Huang Ying 
> Cc: Souptick Joarder 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Thomas Hellstrom 
> ---
>  include/linux/mm.h  |   9 +-
>  mm/Makefile |   2 +-
>  mm/apply_as_range.c | 257 
>  3 files changed, 266 insertions(+), 2 deletions(-)
>  create mode 100644 mm/apply_as_range.c
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b7dd4ddd6efb..62f24dd0bfa0 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2642,7 +2642,14 @@ struct pfn_range_apply {
>  };
>  extern int apply_to_pfn_range(struct pfn_range_apply *closure,
> unsigned long address, unsigned long size);
> -
> +unsigned long apply_as_wrprotect(struct address_space *mapping,
> +  pgoff_t first_index, pgoff_t nr);
> +unsigned long apply_as_clean(struct address_space *mapping,
> +  pgoff_t first_index, pgoff_t nr,
> +  pgoff_t bitmap_pgoff,
> +  unsigned long *bitmap,
> +  pgoff_t *start,
> +  pgoff_t *end);
>  #ifdef CONFIG_PAGE_POISONING
>  extern bool page_poisoning_enabled(void);
>  extern void kernel_poison_pages(struct page *page, int numpages, int enable);
> diff --git a/mm/Makefile b/mm/Makefile
> index d210cc9d6f80..a94b78f12692 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -39,7 +39,7 @@ obj-y   := filemap.o mempool.o 
> oom_kill.o fadvise.o \
>  mm_init.o mmu_context.o percpu.o slab_common.o \
>  compaction.o vmacache.o \
>  interval_tree.o list_lru.o workingset.o \
> -debug.o $(mmu-y)
> +debug.o apply_as_range.o $(mmu-y)
>  
>  obj-y += init-mm.o
>  obj-y += memblock.o
> diff --git a/mm/apply_as_range.c b/mm/apply_as_range.c
> new file mode 100644
> index ..9f03e272ebd0
> --- /dev/null
> +++ b/mm/apply_as_range.c
> @@ -0,0 +1,257 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/**
> + * struct apply_as - Closure structure for apply_as_range
> + * @base: struct pfn_range_apply we derive from
> + * @start: Address of first modified pte
> + * @end: Address of last modified pte + 1
> + * @total: Total number of modified ptes
> + * @vma: Pointer to the struct vm_area_struct we're currently operating on
> + * @flush_cache: Whether to call a cache flush before modifying a pte
> + * @flush_tlb: Whether to flush the tlb after modifying a pte
> + */
> +struct apply_as {
> + struct pfn_range_apply base;
> + unsigned long start, end;
> + unsigned long total;
> + const struct vm_area_struct *vma;
> + u32 flush_cache : 1;
> + u32 flush_tlb : 1;
> +};
> +
> +/**
> + * apply_pt_wrprotect - Leaf pte callback to write-protect a pte
> + * @pte: Pointer to the pte
> + * @token: Page table token, see apply_to_pfn_range()
> + * @addr: The virtual page address
> + * @closure: Pointer to a struct pfn_range_apply embedded in a
> + * struct apply_as
> + *
> + * The function write-protects a pte and records the range in
> + * virtual address space of touched ptes for efficient TLB flushes.
> + *
> + * Return: Always zero.
> + */
> +static int apply_pt_wrprotect(pte_t *pte, pgtable_t token,
> +   unsigned long addr,
> +   struct pfn_range_apply *closure)
> +{
> + struct apply_as *aas = container_of(closure, typeof(*aas), base);
> +
> + if (pte_write(*pte)) {
> + set_pte_at(closure->mm, addr, pte, pte_wrprotect(*pte));

So there is no flushing here, even for x96 this is wrong. It
should be something like:

Re: [RFC PATCH RESEND 2/3] mm: Add an apply_to_pfn_range interface

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 01:22:35PM +, Thomas Hellstrom wrote:
> This is basically apply_to_page_range with added functionality:
> Allocating missing parts of the page table becomes optional, which
> means that the function can be guaranteed not to error if allocation
> is disabled. Also passing of the closure struct and callback function
> becomes different and more in line with how things are done elsewhere.
> 
> Finally we keep apply_to_page_range as a wrapper around apply_to_pfn_range

The apply_to_page_range() is dangerous API it does not follow other
mm patterns like mmu notifier. It is suppose to be use in arch code
or vmalloc or similar thing but not in regular driver code. I see
it has crept out of this and is being use by few device driver. I am
not sure we should encourage that.

> 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Minchan Kim 
> Cc: Michal Hocko 
> Cc: Huang Ying 
> Cc: Souptick Joarder 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Thomas Hellstrom 
> ---
>  include/linux/mm.h |  10 
>  mm/memory.c| 121 +
>  2 files changed, 99 insertions(+), 32 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..b7dd4ddd6efb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2632,6 +2632,16 @@ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, 
> unsigned long addr,
>  extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
>  unsigned long size, pte_fn_t fn, void *data);
>  
> +struct pfn_range_apply;
> +typedef int (*pter_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
> +  struct pfn_range_apply *closure);
> +struct pfn_range_apply {
> + struct mm_struct *mm;
> + pter_fn_t ptefn;
> + unsigned int alloc;
> +};
> +extern int apply_to_pfn_range(struct pfn_range_apply *closure,
> +   unsigned long address, unsigned long size);
>  
>  #ifdef CONFIG_PAGE_POISONING
>  extern bool page_poisoning_enabled(void);
> diff --git a/mm/memory.c b/mm/memory.c
> index dcd80313cf10..0feb7191c2d2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1938,18 +1938,17 @@ int vm_iomap_memory(struct vm_area_struct *vma, 
> phys_addr_t start, unsigned long
>  }
>  EXPORT_SYMBOL(vm_iomap_memory);
>  
> -static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> -  unsigned long addr, unsigned long end,
> -  pte_fn_t fn, void *data)
> +static int apply_to_pte_range(struct pfn_range_apply *closure, pmd_t *pmd,
> +   unsigned long addr, unsigned long end)
>  {
>   pte_t *pte;
>   int err;
>   pgtable_t token;
>   spinlock_t *uninitialized_var(ptl);
>  
> - pte = (mm == _mm) ?
> + pte = (closure->mm == _mm) ?
>   pte_alloc_kernel(pmd, addr) :
> - pte_alloc_map_lock(mm, pmd, addr, );
> + pte_alloc_map_lock(closure->mm, pmd, addr, );
>   if (!pte)
>   return -ENOMEM;
>  
> @@ -1960,86 +1959,103 @@ static int apply_to_pte_range(struct mm_struct *mm, 
> pmd_t *pmd,
>   token = pmd_pgtable(*pmd);
>  
>   do {
> - err = fn(pte++, token, addr, data);
> + err = closure->ptefn(pte++, token, addr, closure);
>   if (err)
>   break;
>   } while (addr += PAGE_SIZE, addr != end);
>  
>   arch_leave_lazy_mmu_mode();
>  
> - if (mm != _mm)
> + if (closure->mm != _mm)
>   pte_unmap_unlock(pte-1, ptl);
>   return err;
>  }
>  
> -static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
> -  unsigned long addr, unsigned long end,
> -  pte_fn_t fn, void *data)
> +static int apply_to_pmd_range(struct pfn_range_apply *closure, pud_t *pud,
> +   unsigned long addr, unsigned long end)
>  {
>   pmd_t *pmd;
>   unsigned long next;
> - int err;
> + int err = 0;
>  
>   BUG_ON(pud_huge(*pud));
>  
> - pmd = pmd_alloc(mm, pud, addr);
> + pmd = pmd_alloc(closure->mm, pud, addr);
>   if (!pmd)
>   return -ENOMEM;
> +
>   do {
>   next = pmd_addr_end(addr, end);
> - err = apply_to_pte_range(mm, pmd, addr, next, fn, data);
> + if (!closure->alloc && pmd_none_or_clear_bad(pmd))
> + continue;
> + err = apply_to_pte_range(closure, pmd, addr, next);
>   if (err)
>   break;
>   } while (pmd++, addr = next, addr != end);
>   return err;
>  }
>  
> -static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
> -  unsigned long addr, unsigned long end,
> - 

Re: [RFC PATCH RESEND 0/3] mm modifications / helpers for emulated GPU coherent memory

2019-03-21 Thread Jerome Glisse
On Thu, Mar 21, 2019 at 01:22:22PM +, Thomas Hellstrom wrote:
> Resending since last series was sent through a mis-configured SMTP server.
> 
> Hi,
> This is an early RFC to make sure I don't go too far in the wrong direction.
> 
> Non-coherent GPUs that can't directly see contents in CPU-visible memory,
> like VMWare's SVGA device, run into trouble when trying to implement
> coherent memory requirements of modern graphics APIs. Examples are
> Vulkan and OpenGL 4.4's ARB_buffer_storage.
> 
> To remedy, we need to emulate coherent memory. Typically when it's detected
> that a buffer object is about to be accessed by the GPU, we need to
> gather the ranges that have been dirtied by the CPU since the last operation,
> apply an operation to make the content visible to the GPU and clear the
> the dirty tracking.
> 
> Depending on the size of the buffer object and the access pattern there are
> two major possibilities:
> 
> 1) Use page_mkwrite() and pfn_mkwrite(). (GPU buffer objects are backed
> either by PCI device memory or by driver-alloced pages).
> The dirty-tracking needs to be reset by write-protecting the affected ptes
> and flush tlb. This has a complexity of O(num_dirty_pages), but the
> write page-fault is of course costly.
> 
> 2) Use hardware dirty-flags in the ptes. The dirty-tracking needs to be reset
> by clearing the dirty bits and flush tlb. This has a complexity of
> O(num_buffer_object_pages) and dirty bits need to be scanned in full before
> each gpu-access.
> 
> So in practice the two methods need to be interleaved for best performance.
> 
> So to facilitate this, I propose two new helpers, apply_as_wrprotect() and
> apply_as_clean() ("as" stands for address-space) both inspired by
> unmap_mapping_range(). Users of these helpers are in the making, but needs
> some cleaning-up.

To be clear this should _only be use_ for mmap of device file ? If so
the API should try to enforce that as much as possible for instance by
mandating the file as argument so that the function can check it is
only use in that case. Also big scary comment to make sure no one just
start using those outside this very limited frame.

> 
> There's also a change to x_mkwrite() to allow dropping the mmap_sem while
> waiting.

This will most likely conflict with userfaultfd write protection. Maybe
building your thing on top of that would be better.

https://lwn.net/Articles/783571/

I will take a cursory look at the patches.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [RFC][PATCH 0/5 v2] DMA-BUF Heaps (destaging ION)

2019-03-15 Thread Jerome Glisse
On Tue, Mar 05, 2019 at 12:54:28PM -0800, John Stultz wrote:
> Here is a initial RFC of the dma-buf heaps patchset Andrew and I
> have been working on which tries to destage a fair chunk of ION
> functionality.
> 
> The patchset implements per-heap devices which can be opened
> directly and then an ioctl is used to allocate a dmabuf from the
> heap.
> 
> The interface is similar, but much simpler then IONs, only
> providing an ALLOC ioctl.
> 
> Also, I've provided simple system and cma heaps. The system
> heap in particular is missing the page-pool optimizations ION
> had, but works well enough to validate the interface.
> 
> I've booted and tested these patches with AOSP on the HiKey960
> using the kernel tree here:
>   
> https://git.linaro.org/people/john.stultz/android-dev.git/log/?h=dev/dma-buf-heap
> 
> And the userspace changes here:
>   https://android-review.googlesource.com/c/device/linaro/hikey/+/909436

What upstream driver will use this eventualy ? And why is it
needed ?

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 0/9] mmu notifier provide context informations

2019-02-19 Thread Jerome Glisse
On Tue, Feb 19, 2019 at 01:19:09PM -0800, Dan Williams wrote:
> On Tue, Feb 19, 2019 at 12:58 PM Jerome Glisse  wrote:
> >
> > On Tue, Feb 19, 2019 at 12:40:37PM -0800, Dan Williams wrote:
> > > On Tue, Feb 19, 2019 at 12:30 PM Jerome Glisse  wrote:
> > > >
> > > > On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> > > > > On Tue, Feb 19, 2019 at 12:04 PM  wrote:
> > > > > >
> > > > > > From: Jérôme Glisse 
> > > > > >
> > > > > > Since last version [4] i added the extra bits needed for the 
> > > > > > change_pte
> > > > > > optimization (which is a KSM thing). Here i am not posting users of
> > > > > > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > > > > > RDMA, ...) once this serie get upstream. If you want to look at 
> > > > > > users
> > > > > > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > > > > > those users for 5.2 (including KVM if KVM folks feel comfortable 
> > > > > > with
> > > > > > it).
> > > > >
> > > > > The users look small and straightforward. Why not await acks and
> > > > > reviewed-by's for the users like a typical upstream submission and
> > > > > merge them together? Is all of the functionality of this
> > > > > infrastructure consumed by the proposed users? Last time I checked it
> > > > > was only a subset.
> > > >
> > > > Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
> > > > vs UNMAP. Both of which i intend to use. The RDMA folks already ack
> > > > the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
> > > > were ok with it too. I do not want to merge things through Andrew
> > > > for all of this we discussed that in the past, merge mm bits through
> > > > Andrew in one release and bits that use things in the next release.
> > >
> > > Ok, I was trying to find the links to the acks on the mailing list,
> > > those references would address my concerns. I see no reason to rush
> > > SOFT_DIRTY and CLEAR ahead of the upstream user.
> >
> > I intend to post user for those in next couple weeks for 5.2 HMM bits.
> > So user for this (CLEAR/UNMAP/SOFTDIRTY) will definitly materialize in
> > time for 5.2.
> >
> > ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
> > ACKS RDMA https://lkml.org/lkml/2018/12/6/1473
> 
> Nice, thanks!
> 
> > For KVM Andrea Arcangeli seems to like the whole idea to restore the
> > change_pte optimization but i have not got ACK from Radim or Paolo,
> > however given the small performance improvement figure i get with it
> > i do not see while they would not ACK.
> 
> Sure, but no need to push ahead without that confirmation, right? At
> least for the piece that KVM cares about, maybe that's already covered
> in the infrastructure RDMA and RADEON are using?

The change_pte() for KVM is just one bit flag on top of the rest. So
i don't see much value in saving this last patch. I will be working
with KVM folks to merge KVM bits in 5.2. If they do not want that then
removing that extra flags is not much work.

But if you prefer than Andrew can drop the last patch in the serie.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 0/9] mmu notifier provide context informations

2019-02-19 Thread Jerome Glisse
On Tue, Feb 19, 2019 at 12:40:37PM -0800, Dan Williams wrote:
> On Tue, Feb 19, 2019 at 12:30 PM Jerome Glisse  wrote:
> >
> > On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> > > On Tue, Feb 19, 2019 at 12:04 PM  wrote:
> > > >
> > > > From: Jérôme Glisse 
> > > >
> > > > Since last version [4] i added the extra bits needed for the change_pte
> > > > optimization (which is a KSM thing). Here i am not posting users of
> > > > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > > > RDMA, ...) once this serie get upstream. If you want to look at users
> > > > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > > > those users for 5.2 (including KVM if KVM folks feel comfortable with
> > > > it).
> > >
> > > The users look small and straightforward. Why not await acks and
> > > reviewed-by's for the users like a typical upstream submission and
> > > merge them together? Is all of the functionality of this
> > > infrastructure consumed by the proposed users? Last time I checked it
> > > was only a subset.
> >
> > Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
> > vs UNMAP. Both of which i intend to use. The RDMA folks already ack
> > the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
> > were ok with it too. I do not want to merge things through Andrew
> > for all of this we discussed that in the past, merge mm bits through
> > Andrew in one release and bits that use things in the next release.
> 
> Ok, I was trying to find the links to the acks on the mailing list,
> those references would address my concerns. I see no reason to rush
> SOFT_DIRTY and CLEAR ahead of the upstream user.

I intend to post user for those in next couple weeks for 5.2 HMM bits.
So user for this (CLEAR/UNMAP/SOFTDIRTY) will definitly materialize in
time for 5.2.

ACKS AMD/RADEON https://lkml.org/lkml/2019/2/1/395
ACKS RDMA https://lkml.org/lkml/2018/12/6/1473

For KVM Andrea Arcangeli seems to like the whole idea to restore the
change_pte optimization but i have not got ACK from Radim or Paolo,
however given the small performance improvement figure i get with it
i do not see while they would not ACK.

https://lkml.org/lkml/2019/2/18/1530

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v5 0/9] mmu notifier provide context informations

2019-02-19 Thread Jerome Glisse
On Tue, Feb 19, 2019 at 12:15:55PM -0800, Dan Williams wrote:
> On Tue, Feb 19, 2019 at 12:04 PM  wrote:
> >
> > From: Jérôme Glisse 
> >
> > Since last version [4] i added the extra bits needed for the change_pte
> > optimization (which is a KSM thing). Here i am not posting users of
> > this, they will be posted to the appropriate sub-systems (KVM, GPU,
> > RDMA, ...) once this serie get upstream. If you want to look at users
> > of this see [5] [6]. If this gets in 5.1 then i will be submitting
> > those users for 5.2 (including KVM if KVM folks feel comfortable with
> > it).
> 
> The users look small and straightforward. Why not await acks and
> reviewed-by's for the users like a typical upstream submission and
> merge them together? Is all of the functionality of this
> infrastructure consumed by the proposed users? Last time I checked it
> was only a subset.

Yes pretty much all is use, the unuse case is SOFT_DIRTY and CLEAR
vs UNMAP. Both of which i intend to use. The RDMA folks already ack
the patches IIRC, so did radeon and amdgpu. I believe the i915 folks
were ok with it too. I do not want to merge things through Andrew
for all of this we discussed that in the past, merge mm bits through
Andrew in one release and bits that use things in the next release.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v4 0/9] mmu notifier provide context informations

2019-02-11 Thread Jerome Glisse
On Fri, Feb 01, 2019 at 10:02:30PM +0100, Jan Kara wrote:
> On Thu 31-01-19 11:10:06, Jerome Glisse wrote:
> > 
> > Andrew what is your plan for this ? I had a discussion with Peter Xu
> > and Andrea about change_pte() and kvm. Today the change_pte() kvm
> > optimization is effectively disabled because of invalidate_range
> > calls. With a minimal couple lines patch on top of this patchset
> > we can bring back the kvm change_pte optimization and we can also
> > optimize some other cases like for instance when write protecting
> > after fork (but i am not sure this is something qemu does often so
> > it might not help for real kvm workload).
> > 
> > I will be posting a the extra patch as an RFC, but in the meantime
> > i wanted to know what was the status for this.
> > 
> > Jan, Christian does your previous ACK still holds for this ?
> 
> Yes, I still think the approach makes sense. Dan's concern about in tree
> users is valid but it seems you have those just not merged yet, right?

(Catching up on email)

This version included some of the first users for this but i do not
want to merge them through Andrew but through the individual driver
project tree. Also in the meantime i found a use for this with kvm
and i expect few others users of mmu notifier will leverage this
extra informations.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v8 3/7] mm, devm_memremap_pages: Fix shutdown handling

2019-02-11 Thread Jerome Glisse
On Sun, Feb 10, 2019 at 12:09:08PM +0100, Krzysztof Grygiencz wrote:
> Dear Sir,
> 
> I'm using ArchLinux distribution. After kernel upgrade form 4.19.14 to
> 4.19.15 my X environment stopped working. I have AMD HD3300 (RS780D)
> graphics card. I have bisected kernel and found a failing commit:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.19.20=ec5471c92fb29ad848c81875840478be201eeb3f

This is a false positive, you should skip that commit. It will not impact
the GPU driver for your specific GPUs. My advice is to first bisect on
drivers/gpu/drm/radeon only.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4 0/9] mmu notifier provide context informations

2019-01-31 Thread Jerome Glisse
On Thu, Jan 31, 2019 at 11:55:35AM -0800, Andrew Morton wrote:
> On Thu, 31 Jan 2019 11:10:06 -0500 Jerome Glisse  wrote:
> 
> > Andrew what is your plan for this ? I had a discussion with Peter Xu
> > and Andrea about change_pte() and kvm. Today the change_pte() kvm
> > optimization is effectively disabled because of invalidate_range
> > calls. With a minimal couple lines patch on top of this patchset
> > we can bring back the kvm change_pte optimization and we can also
> > optimize some other cases like for instance when write protecting
> > after fork (but i am not sure this is something qemu does often so
> > it might not help for real kvm workload).
> > 
> > I will be posting a the extra patch as an RFC, but in the meantime
> > i wanted to know what was the status for this.
> 
> The various drm patches appear to be headed for collisions with drm
> tree development so we'll need to figure out how to handle that and in
> what order things happen.
> 
> It's quite unclear from the v4 patchset's changelogs that this has
> anything to do with KVM and "the change_pte() kvm optimization" hasn't
> been described anywhere(?).
> 
> So..  I expect the thing to do here is to get everything finished, get
> the changelogs completed with this new information and do a resend.
> 
> Can we omit the drm and rdma patches for now?  Feed them in via the
> subsystem maintainers when the dust has settled?

Yes, i should have pointed out that you can ignore the driver patches
i will resumit them through the appropriate tree once the mm bits are
in. I just wanted to show case how i intended to use this. I will try
not to forget next time to clearly tag things that are just there to
show case and that will be merge latter through different tree.

I will do a v5 with kvm bits once we have enough testing and confidence.
So i guess this all will be delayed to 5.2 and 5.3 for driver bits.
The kvm bits are outcomes of private emails and previous face to face
discussion around mmu notifier and kvm. I believe the context information
will turn to be useful to more users than the ones i am doing it for.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Jerome Glisse
On Thu, Jan 31, 2019 at 07:02:15PM +, Jason Gunthorpe wrote:
> On Thu, Jan 31, 2019 at 09:13:55AM +0100, Christoph Hellwig wrote:
> > On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> > > > *shrug* so what if the special GUP called a VMA op instead of
> > > > traversing the VMA PTEs today? Why does it really matter? It could
> > > > easily change to a struct page flow tomorrow..
> > > 
> > > Well it's so that it's composable. We want the SGL->DMA side to work for
> > > APIs from kernel space and not have to run a completely different flow
> > > for kernel drivers than from userspace memory.
> > 
> > Yes, I think that is the important point.
> > 
> > All the other struct page discussion is not about anyone of us wanting
> > struct page - heck it is a pain to deal with, but then again it is
> > there for a reason.
> > 
> > In the typical GUP flows we have three uses of a struct page:
> > 
> >  (1) to carry a physical address.  This is mostly through
> >  struct scatterlist and struct bio_vec.  We could just store
> >  a magic PFN-like value that encodes the physical address
> >  and allow looking up a page if it exists, and we had at least
> >  two attempts at it.  In some way I think that would actually
> >  make the interfaces cleaner, but Linus has NACKed it in the
> >  past, so we'll have to convince him first that this is the
> >  way forward
> 
> Something like this (and more) has always been the roadblock with
> trying to mix BAR memory into SGL. I think it is such a big problem as
> to be unsolvable in one step.. 
> 
> Struct page doesn't even really help anything beyond dma_map as we
> still can't pretend that __iomem is normal memory for general SGL
> users.
> 
> >  (2) to keep a reference to the memory so that it doesn't go away
> >  under us due to swapping, process exit, unmapping, etc.
> >  No idea how we want to solve this, but I guess you have
> >  some smart ideas?
> 
> Jerome, how does this work anyhow? Did you do something to make the
> VMA lifetime match the p2p_map/unmap? Or can we get into a situation
> were the VMA is destroyed and the importing driver can't call the
> unmap anymore?
> 
> I know in the case of notifiers the VMA liftime should be strictly
> longer than the map/unmap - but does this mean we can never support
> non-notifier users via this scheme?

So in this version the requirement is that the importer also have a mmu
notifier registered and that's what all GPU driver do already. Any
driver that map some range of vma to a device should register itself as
a mmu notifier listener to do something when vma goes away. I posted a
patchset a while ago to allow listener to differentiate when the vma is
going away from other type of invalidation [1]

With that in place you can easily handle the pin case. Driver really
need to do something when the vma goes away with GUP or not. As the
device is then writing/reading to/from something that does not match
anything in the process address space.

So user that want pin would register notifier, call p2p_map with pin
flag and ignore all notifier callback except the unmap one when the
unmap one happens they have the vma and they should call p2p_unmap
from their invalidate callback and update their device to either some
dummy memory or program it in a way that the userspace application
will notice.

This can all be handled by some helper so that driver do not have to
write more than 5 lines of code and function to update their device
mapping to something of their choosing.


> 
> >  (3) to make the PTEs dirty after writing to them.  Again no sure
> >  what our preferred interface here would be
> 
> This need doesn't really apply to BAR memory..
> 
> > If we solve all of the above problems I'd be more than happy to
> > go with a non-struct page based interface for BAR P2P.  But we'll
> > have to solve these issues in a generic way first.
> 
> I still think the right direction is to build on what Logan has done -
> realize that he created a DMA-only SGL - make that a formal type of
> the kernel and provide the right set of APIs to work with this type,
> without being forced to expose struct page.
> 
> Basically invert the API flow - the DMA map would be done close to
> GUP, not buried in the driver. This absolutely doesn't work for every
> flow we have, but it does enable the ones that people seem to care
> about when talking about P2P.

This does not work for GPU really i do not want to have to rewrite GPU
driver for this. Struct page is a burden and it does not bring anything
to the table. I rather provide an all in one stop for driver to use
this without having to worry between regular vma and special vma.

Note that in this patchset i reuse chunk of Logan works and intention is
to also allow PCI struct page to work too. But they should not be the
only mechanisms.

> 
> To get to where we are today we'd need a few new IB APIs, and some
> nvme change to work with DMA-only 

Re: [PATCH v4 0/9] mmu notifier provide context informations

2019-01-31 Thread Jerome Glisse

Andrew what is your plan for this ? I had a discussion with Peter Xu
and Andrea about change_pte() and kvm. Today the change_pte() kvm
optimization is effectively disabled because of invalidate_range
calls. With a minimal couple lines patch on top of this patchset
we can bring back the kvm change_pte optimization and we can also
optimize some other cases like for instance when write protecting
after fork (but i am not sure this is something qemu does often so
it might not help for real kvm workload).

I will be posting a the extra patch as an RFC, but in the meantime
i wanted to know what was the status for this.


Jan, Christian does your previous ACK still holds for this ?


On Wed, Jan 23, 2019 at 05:23:06PM -0500, jgli...@redhat.com wrote:
> From: Jérôme Glisse 
> 
> Hi Andrew, i see that you still have my event patch in you queue [1].
> This patchset replace that single patch and is broken down in further
> step so that it is easier to review and ascertain that no mistake were
> made during mechanical changes. Here are the step:
> 
> Patch 1 - add the enum values
> Patch 2 - coccinelle semantic patch to convert all call site of
>   mmu_notifier_range_init to default enum value and also
>   to passing down the vma when it is available
> Patch 3 - update many call site to more accurate enum values
> Patch 4 - add the information to the mmu_notifier_range struct
> Patch 5 - helper to test if a range is updated to read only
> 
> All the remaining patches are update to various driver to demonstrate
> how this new information get use by device driver. I build tested
> with make all and make all minus everything that enable mmu notifier
> ie building with MMU_NOTIFIER=no. Also tested with some radeon,amd
> gpu and intel gpu.
> 
> If they are no objections i believe best plan would be to merge the
> the first 5 patches (all mm changes) through your queue for 5.1 and
> then to delay driver update to each individual driver tree for 5.2.
> This will allow each individual device driver maintainer time to more
> thouroughly test this more then my own testing.
> 
> Note that i also intend to use this feature further in nouveau and
> HMM down the road. I also expect that other user like KVM might be
> interested into leveraging this new information to optimize some of
> there secondary page table invalidation.
> 
> Here is an explaination on the rational for this patchset:
> 
> 
> CPU page table update can happens for many reasons, not only as a result
> of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also
> as a result of kernel activities (memory compression, reclaim, migration,
> ...).
> 
> This patch introduce a set of enums that can be associated with each of
> the events triggering a mmu notifier. Latter patches take advantages of
> those enum values.
> 
> - UNMAP: munmap() or mremap()
> - CLEAR: page table is cleared (migration, compaction, reclaim, ...)
> - PROTECTION_VMA: change in access protections for the range
> - PROTECTION_PAGE: change in access protections for page in the range
> - SOFT_DIRTY: soft dirtyness tracking
> 
> Being able to identify munmap() and mremap() from other reasons why the
> page table is cleared is important to allow user of mmu notifier to
> update their own internal tracking structure accordingly (on munmap or
> mremap it is not longer needed to track range of virtual address as it
> becomes invalid).
> 
> [1] 
> https://www.ozlabs.org/~akpm/mmotm/broken-out/mm-mmu_notifier-contextual-information-for-event-triggering-invalidation-v2.patch
> 
> Cc: Christian König 
> Cc: Jan Kara 
> Cc: Felix Kuehling 
> Cc: Jason Gunthorpe 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Ross Zwisler 
> Cc: Dan Williams 
> Cc: Paolo Bonzini 
> Cc: Radim Krčmář 
> Cc: Michal Hocko 
> Cc: Ralph Campbell 
> Cc: John Hubbard 
> Cc: k...@vger.kernel.org
> Cc: dri-devel@lists.freedesktop.org
> Cc: linux-r...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: Arnd Bergmann 
> 
> Jérôme Glisse (9):
>   mm/mmu_notifier: contextual information for event enums
>   mm/mmu_notifier: contextual information for event triggering
> invalidation
>   mm/mmu_notifier: use correct mmu_notifier events for each invalidation
>   mm/mmu_notifier: pass down vma and reasons why mmu notifier is
> happening
>   mm/mmu_notifier: mmu_notifier_range_update_to_read_only() helper
>   gpu/drm/radeon: optimize out the case when a range is updated to read
> only
>   gpu/drm/amdgpu: optimize out the case when a range is updated to read
> only
>   gpu/drm/i915: optimize out the case when a range is updated to read
> only
>   RDMA/umem_odp: optimize out the case when a range is updated to read
> only
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 13 
>  drivers/gpu/drm/i915/i915_gem_userptr.c | 16 ++
>  drivers/gpu/drm/radeon/radeon_mn.c  | 13 
>  drivers/infiniband/core/umem_odp.c  | 22 +++--
>  

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Jerome Glisse
On Thu, Jan 31, 2019 at 09:13:55AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
> > > *shrug* so what if the special GUP called a VMA op instead of
> > > traversing the VMA PTEs today? Why does it really matter? It could
> > > easily change to a struct page flow tomorrow..
> > 
> > Well it's so that it's composable. We want the SGL->DMA side to work for
> > APIs from kernel space and not have to run a completely different flow
> > for kernel drivers than from userspace memory.
> 
> Yes, I think that is the important point.
> 
> All the other struct page discussion is not about anyone of us wanting
> struct page - heck it is a pain to deal with, but then again it is
> there for a reason.
> 
> In the typical GUP flows we have three uses of a struct page:

We do not want GUP. Yes some RDMA driver and other use GUP but they
should only use GUP on regular vma not on special vma (ie mmap of a
device file). Allowing GUP on those is insane. It is better to special
case the peer to peer mapping because _it is_ special, nothing inside
those are manage by core mm and driver can deal with them in weird
way (GPU certainly do and for very good reasons without which they
would perform badly).

> 
>  (1) to carry a physical address.  This is mostly through
>  struct scatterlist and struct bio_vec.  We could just store
>  a magic PFN-like value that encodes the physical address
>  and allow looking up a page if it exists, and we had at least
>  two attempts at it.  In some way I think that would actually
>  make the interfaces cleaner, but Linus has NACKed it in the
>  past, so we'll have to convince him first that this is the
>  way forward

Wasting 64bytes just to carry address is a waste for everyone.

>  (2) to keep a reference to the memory so that it doesn't go away
>  under us due to swapping, process exit, unmapping, etc.
>  No idea how we want to solve this, but I guess you have
>  some smart ideas?

The DMA API has _never_ dealt with page refcount and it have always
been up to the user of the DMA API to ascertain that it is safe for
them to map/unmap page/resource they are providing to the DMA API.

The lifetime management of page or resource provided to the DMA API
should remain the problem of the caller and not be something the DMA
API cares one bit about.

>  (3) to make the PTEs dirty after writing to them.  Again no sure
>  what our preferred interface here would be

Again the DMA API has never dealt with that nor should he. What does
dirty pte means for a special mapping (mmap of device file) ? There is
no single common definition for that, most driver do not care about it
and it get fully ignore.

> 
> If we solve all of the above problems I'd be more than happy to
> go with a non-struct page based interface for BAR P2P.  But we'll
> have to solve these issues in a generic way first.

None of the above are problems the DMA API need to solve. The DMA API
is about mapping some memory resource to a device. For regular main
memory it is easy on most architecture (anything with a sane IOMMU).
For IO resources it is not as straight forward as it was often left
undefined in the architecture platform documentation or the inter-
connect standard. AFAIK mapping BAR from one PCIE device to another
through IOMMU works well on recent Intel and AMD platform. We will
probably need to use some whitelist at i am not sure this is something
Intel or AMD guarantee, i believe they want to start guaranteeing it.

So having one DMA API for regular memory and one for IO memory aka
resource (dma_map_resource()) sounds like the only sane approach here.
It is fundamentally different memory and we should not try to muddle
the water by having it go through a single common API. There is no
benefit to that beside saving couple hundred of lines of code to some
driver and this couple hundred lines of code can be move to a common
helpers.

So to me it is lot sane to provide an helper that would deal with
the different vma type on behalf of device than forcing down struct
page. Something like:

vma_dma_map_range(vma, device, start, end, flags, pa[])
vma_dma_unmap_range(vma, device, start, end, flags, pa[])

VMA_DMA_MAP_FLAG_WRITE
VMA_DMA_MAP_FLAG_PIN

Which would use GUP or special vma handling on behalf of the calling
device or use a special p2p code path for special vma. Device that
need pinning set the flag and it is up to the exporting device to
accept or not. Pinning when using GUP is obvious.

When the vma goes away the importing device must update its device
page table to some dummy page or do something sane, because keeping
things map after that point does not make sense anymore. Device is
no longer operating on a range of virtual address that make sense.

So instead of pushing p2p handling within GUP to not disrupt existing
driver workflow. It is better to provide an helper that handle all
the gory details for the device driver. It 

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Jerome Glisse
On Thu, Jan 31, 2019 at 09:05:01AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 08:44:20PM +, Jason Gunthorpe wrote:
> > Not really, for MRs most drivers care about DMA addresses only. The
> > only reason struct page ever gets involved is because it is part of
> > the GUP, SGL and dma_map family of APIs.
> 
> And the only way you get the DMA address is through the dma mapping
> APIs.  Which except for the little oddball dma_map_resource expect
> a struct page in some form.  And dma_map_resource isn't really up
> to speed for full blown P2P.
> 
> Now we could and maybe eventually should change all this.  But that
> is a pre-requisitive for doing anything more fancy, and not something
> to be hacked around.
> 
> > O_DIRECT seems to be the justification for struct page, but nobody is
> > signing up to make O_DIRECT have the required special GUP/SGL/P2P flow
> > that would be needed to *actually* make that work - so it really isn't
> > a justification today.
> 
> O_DIRECT is just the messenger.  Anything using GUP will need a struct
> page, which is all our interfaces that do I/O directly to user pages.

I do not want to allow GUP to pin I/O space this would open a pandora
box that we do not want to open at all. Many driver manage their IO
space and if they get random pinning because some other kernel bits
they never heard of starts to do GUP on their stuff it is gonna cause
havoc.

So far mmap of device file have always been special and it has been
reflected to userspace in all the instance i know of (media and GPU).
Pretending we can handle them like any other vma is a lie because
they were never designed that way in the first place and it would be
disruptive to all those driver.

Minimum disruption with minimun changes is what we should aim for and
is what i am trying to do with this patchset. Using struct page and
allowing GUP would mean rewritting huge chunk of GPU drivers (pretty
much rewritting their whole memory management) with no benefit at the
end.

When something is special it is better to leave it that way.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-31 Thread Jerome Glisse
On Thu, Jan 31, 2019 at 09:02:03AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 01:50:27PM -0500, Jerome Glisse wrote:
> > I do not see how VMA changes are any different than using struct page
> > in respect to userspace exposure. Those vma callback do not need to be
> > set by everyone, in fact expectation is that only handful of driver
> > will set those.
> > 
> > How can we do p2p between RDMA and GPU for instance, without exposure
> > to userspace ? At some point you need to tell userspace hey this kernel
> > does allow you to do that :)
> 
> To do RDMA on a memory region you need struct page backіng to start
> with..

No you do not with this patchset and there is no reason to tie RDMA to
struct page it does not provide a single feature we would need. So as
it can be done without and they are not benefit of using one i do not
see why we should use one.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 10:51:55PM +, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 05:47:05PM -0500, Jerome Glisse wrote:
> > On Wed, Jan 30, 2019 at 10:33:04PM +, Jason Gunthorpe wrote:
> > > On Wed, Jan 30, 2019 at 05:30:27PM -0500, Jerome Glisse wrote:
> > > 
> > > > > What is the problem in the HMM mirror that it needs this restriction?
> > > > 
> > > > No restriction at all here. I think i just wasn't understood.
> > > 
> > > Are you are talking about from the exporting side - where the thing
> > > creating the VMA can really only put one distinct object into it?
> > 
> > The message i was trying to get accross is that HMM mirror will
> > always succeed for everything* except for special vma ie mmap of
> > device file. For those it can only succeed if a p2p_map() call
> > succeed.
> > 
> > So any user of HMM mirror might to know why the mirroring fail ie
> > was it because something exceptional is happening ? Or is it because
> > i was trying to map a special vma which can be forbiden.
> > 
> > Hence why i assume that you might want to know about such p2p_map
> > failure at the time you create the umem odp object as it might be
> > some failure you might want to report differently and handle
> > differently. If you do not care about differentiating OOM or
> > exceptional failure from p2p_map failure than you have nothing to
> > worry about you will get the same error from HMM for both.
> 
> I think my hope here was that we could have some kind of 'trial'
> interface where very eary users can call
> 'hmm_mirror_is_maybe_supported(dev, user_ptr, len)' and get a failure
> indication.
> 
> We probably wouldn't call this on the full address space though

Yes we can do special wrapper around the general case that allow
caller to differentiate failure. So at creation you call the
special flavor and get proper distinction between error. Afterward
during normal operation any failure is just treated in a same way
no matter what is the reasons (munmap, mremap, mprotect, ...).


> Beyond that it is just inevitable there can be problems faulting if
> the memory map is messed with after MR is created.
> 
> And here again, I don't want to worry about any particular VMA
> boundaries..

You do not have to worry about boundaries HMM will return -EFAULT
if there is no valid vma behind the address you are trying to map
(or if the vma prot does not allow you to access it). So then you
can handle that failure just like you do now and as my ODP HMM
patch preserve.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 10:33:04PM +, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 05:30:27PM -0500, Jerome Glisse wrote:
> 
> > > What is the problem in the HMM mirror that it needs this restriction?
> > 
> > No restriction at all here. I think i just wasn't understood.
> 
> Are you are talking about from the exporting side - where the thing
> creating the VMA can really only put one distinct object into it?

The message i was trying to get accross is that HMM mirror will
always succeed for everything* except for special vma ie mmap of
device file. For those it can only succeed if a p2p_map() call
succeed.

So any user of HMM mirror might to know why the mirroring fail ie
was it because something exceptional is happening ? Or is it because
i was trying to map a special vma which can be forbiden.

Hence why i assume that you might want to know about such p2p_map
failure at the time you create the umem odp object as it might be
some failure you might want to report differently and handle
differently. If you do not care about differentiating OOM or
exceptional failure from p2p_map failure than you have nothing to
worry about you will get the same error from HMM for both.

Cheers,
Jérôme

* Everything except when they are exceptional condition like OOM or
  poisonous memory.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 09:56:07PM +, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 04:45:25PM -0500, Jerome Glisse wrote:
> > On Wed, Jan 30, 2019 at 08:50:00PM +, Jason Gunthorpe wrote:
> > > On Wed, Jan 30, 2019 at 03:43:32PM -0500, Jerome Glisse wrote:
> > > > On Wed, Jan 30, 2019 at 08:11:19PM +, Jason Gunthorpe wrote:
> > > > > On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
> > > > > 
> > > > > > We never changed SGLs. We still use them to pass p2pdma pages, only 
> > > > > > we
> > > > > > need to be a bit careful where we send the entire SGL. I see no 
> > > > > > reason
> > > > > > why we can't continue to be careful once their in userspace if 
> > > > > > there's
> > > > > > something in GUP to deny them.
> > > > > > 
> > > > > > It would be nice to have heterogeneous SGLs and it is something we
> > > > > > should work toward but in practice they aren't really necessary at 
> > > > > > the
> > > > > > moment.
> > > > > 
> > > > > RDMA generally cannot cope well with an API that requires homogeneous
> > > > > SGLs.. User space can construct complex MRs (particularly with the
> > > > > proposed SGL MR flow) and we must marshal that into a single SGL or
> > > > > the drivers fall apart.
> > > > > 
> > > > > Jerome explained that GPU is worse, a single VMA may have a random mix
> > > > > of CPU or device pages..
> > > > > 
> > > > > This is a pretty big blocker that would have to somehow be fixed.
> > > > 
> > > > Note that HMM takes care of that RDMA ODP with my ODP to HMM patch,
> > > > so what you get for an ODP umem is just a list of dma address you
> > > > can program your device to. The aim is to avoid the driver to care
> > > > about that. The access policy when the UMEM object is created by
> > > > userspace through verbs API should however ascertain that for mmap
> > > > of device file it is only creating a UMEM that is fully covered by
> > > > one and only one vma. GPU device driver will have one vma per logical
> > > > GPU object. I expect other kind of device do that same so that they
> > > > can match a vma to a unique object in their driver.
> > > 
> > > A one VMA rule is not really workable.
> > > 
> > > With ODP VMA boundaries can move around across the lifetime of the MR
> > > and we have no obvious way to fail anything if userpace puts a VMA
> > > boundary in the middle of an existing ODP MR address range.
> > 
> > This is true only for vma that are not mmap of a device file. This is
> > what i was trying to get accross. An mmap of a file is never merge
> > so it can only get split/butcher by munmap/mremap but when that happen
> > you also need to reflect the virtual address space change to the
> > device ie any access to a now invalid range must trigger error.
> 
> Why is it invalid? The address range still has valid process memory?

If you do munmap(A, size) then all address in the range [A, A+size]
are invalid. This is what i am refering too here. Same for mremap.

> 
> What is the problem in the HMM mirror that it needs this restriction?

No restriction at all here. I think i just wasn't understood.

> There is also the situation where we create an ODP MR that spans 0 ->
> U64_MAX in the process address space. In this case there are lots of
> different VMAs it covers and we expect it to fully track all changes
> to all VMAs.

Yes and that works however any memory access above TASK_SIZE will
return -EFAULT as this is kernel address space so you can only access
anything that is a valid process virtual address.

> 
> So we have to spin up dedicated umem_odps that carefully span single
> VMAs, and somehow track changes to VMA ?

No you do not.

> 
> mlx5 odp does some of this already.. But yikes, this needs some pretty
> careful testing in all these situations.

Sorry if i confused you even more than the first time. Everything works
you have nothing to worry about :)

> 
> > > I think the HMM mirror API really needs to deal with this for the
> > > driver somehow.
> > 
> > Yes the HMM does deal with this for you, you do not have to worry about
> > it. Sorry if that was not clear. I just wanted to stress that vma that
> > are mmap of a file do not behave like other vma hence when you create
> > the UMEM you can check for those if y

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 08:50:00PM +, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 03:43:32PM -0500, Jerome Glisse wrote:
> > On Wed, Jan 30, 2019 at 08:11:19PM +, Jason Gunthorpe wrote:
> > > On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
> > > 
> > > > We never changed SGLs. We still use them to pass p2pdma pages, only we
> > > > need to be a bit careful where we send the entire SGL. I see no reason
> > > > why we can't continue to be careful once their in userspace if there's
> > > > something in GUP to deny them.
> > > > 
> > > > It would be nice to have heterogeneous SGLs and it is something we
> > > > should work toward but in practice they aren't really necessary at the
> > > > moment.
> > > 
> > > RDMA generally cannot cope well with an API that requires homogeneous
> > > SGLs.. User space can construct complex MRs (particularly with the
> > > proposed SGL MR flow) and we must marshal that into a single SGL or
> > > the drivers fall apart.
> > > 
> > > Jerome explained that GPU is worse, a single VMA may have a random mix
> > > of CPU or device pages..
> > > 
> > > This is a pretty big blocker that would have to somehow be fixed.
> > 
> > Note that HMM takes care of that RDMA ODP with my ODP to HMM patch,
> > so what you get for an ODP umem is just a list of dma address you
> > can program your device to. The aim is to avoid the driver to care
> > about that. The access policy when the UMEM object is created by
> > userspace through verbs API should however ascertain that for mmap
> > of device file it is only creating a UMEM that is fully covered by
> > one and only one vma. GPU device driver will have one vma per logical
> > GPU object. I expect other kind of device do that same so that they
> > can match a vma to a unique object in their driver.
> 
> A one VMA rule is not really workable.
> 
> With ODP VMA boundaries can move around across the lifetime of the MR
> and we have no obvious way to fail anything if userpace puts a VMA
> boundary in the middle of an existing ODP MR address range.

This is true only for vma that are not mmap of a device file. This is
what i was trying to get accross. An mmap of a file is never merge
so it can only get split/butcher by munmap/mremap but when that happen
you also need to reflect the virtual address space change to the
device ie any access to a now invalid range must trigger error.

> 
> I think the HMM mirror API really needs to deal with this for the
> driver somehow.

Yes the HMM does deal with this for you, you do not have to worry about
it. Sorry if that was not clear. I just wanted to stress that vma that
are mmap of a file do not behave like other vma hence when you create
the UMEM you can check for those if you feel the need.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 08:11:19PM +, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
> 
> > We never changed SGLs. We still use them to pass p2pdma pages, only we
> > need to be a bit careful where we send the entire SGL. I see no reason
> > why we can't continue to be careful once their in userspace if there's
> > something in GUP to deny them.
> > 
> > It would be nice to have heterogeneous SGLs and it is something we
> > should work toward but in practice they aren't really necessary at the
> > moment.
> 
> RDMA generally cannot cope well with an API that requires homogeneous
> SGLs.. User space can construct complex MRs (particularly with the
> proposed SGL MR flow) and we must marshal that into a single SGL or
> the drivers fall apart.
> 
> Jerome explained that GPU is worse, a single VMA may have a random mix
> of CPU or device pages..
> 
> This is a pretty big blocker that would have to somehow be fixed.

Note that HMM takes care of that RDMA ODP with my ODP to HMM patch,
so what you get for an ODP umem is just a list of dma address you
can program your device to. The aim is to avoid the driver to care
about that. The access policy when the UMEM object is created by
userspace through verbs API should however ascertain that for mmap
of device file it is only creating a UMEM that is fully covered by
one and only one vma. GPU device driver will have one vma per logical
GPU object. I expect other kind of device do that same so that they
can match a vma to a unique object in their driver.

> 
> > That doesn't even necessarily need to be the case. For HMM, I
> > understand, struct pages may not point to any accessible memory and the
> > memory that backs it (or not) may change over the life time of it. So
> > they don't have to be strictly tied to BARs addresses. p2pdma pages are
> > strictly tied to BAR addresses though.
> 
> No idea, but at least for this case I don't think we need magic HMM
> pages to make simple VMA ops p2p_map/umap work..

Yes, you do not need page for simple driver, if we start creating struct
page for all PCIE BAR we are gonna waste a lot of memory and resources
for no good reason. I doubt all of the PCIE BAR of a device enabling p2p
will ever be map as p2p. So simple driver do not need struct page, GPU
driver that do not use HMM (all GPU that are more than 2 years old) do
not need struct page. Struct page is a burden here more than anything
else. I have not seen one good thing the struct page gives you.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 12:52:44PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-30 12:22 p.m., Jerome Glisse wrote:
> > On Wed, Jan 30, 2019 at 06:56:59PM +, Jason Gunthorpe wrote:
> >> On Wed, Jan 30, 2019 at 10:17:27AM -0700, Logan Gunthorpe wrote:
> >>>
> >>>
> >>> On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
> >>>> Every attempt to give BAR memory to struct page has run into major
> >>>> trouble, IMHO, so I like that this approach avoids that.
> >>>>
> >>>> And if you don't have struct page then the only kernel object left to
> >>>> hang meta data off is the VMA itself.
> >>>>
> >>>> It seems very similar to the existing P2P work between in-kernel
> >>>> consumers, just that VMA is now mediating a general user space driven
> >>>> discovery process instead of being hard wired into a driver.
> >>>
> >>> But the kernel now has P2P bars backed by struct pages and it works
> >>> well. 
> >>
> >> I don't think it works that well..
> >>
> >> We ended up with a 'sgl' that is not really a sgl, and doesn't work
> >> with many of the common SGL patterns. sg_copy_buffer doesn't work,
> >> dma_map, doesn't work, sg_page doesn't work quite right, etc.
> >>
> >> Only nvme and rdma got the special hacks to make them understand these
> >> p2p-sgls, and I'm still not convinced some of the RDMA drivers that
> >> want access to CPU addresses from the SGL (rxe, usnic, hfi, qib) don't
> >> break in this scenario.
> >>
> >> Since the SGLs become broken, it pretty much means there is no path to
> >> make GUP work generically, we have to go through and make everything
> >> safe to use with p2p-sgls before allowing GUP. Which, frankly, sounds
> >> impossible with all the competing objections.
> >>
> >> But GPU seems to have a problem unrelated to this - what Jerome wants
> >> is to have two faulting domains for VMA's - visible-to-cpu and
> >> visible-to-dma. The new op is essentially faulting the pages into the
> >> visible-to-dma category and leaving them invisible-to-cpu.
> >>
> >> So that duality would still have to exists, and I think p2p_map/unmap
> >> is a much simpler implementation than trying to create some kind of
> >> special PTE in the VMA..
> >>
> >> At least for RDMA, struct page or not doesn't really matter. 
> >>
> >> We can make struct pages for the BAR the same way NVMe does.  GPU is
> >> probably the same, just with more mememory at stake?  
> >>
> >> And maybe this should be the first implementation. The p2p_map VMA
> >> operation should return a SGL and the caller should do the existing
> >> pci_p2pdma_map_sg() flow.. 
> > 
> > For GPU it would not work, GPU might want to use main memory (because
> > it is running out of BAR space) it is a lot easier if the p2p_map
> > callback calls the right dma map function (for page or io) rather than
> > having to define some format that would pass down the information.
> 
> >>
> >> Worry about optimizing away the struct page overhead later?
> > 
> > Struct page do not fit well for GPU as the BAR address can be reprogram
> > to point to any page inside the device memory (think 256M BAR versus
> > 16GB device memory). Forcing struct page on GPU driver would require
> > major surgery to the GPU driver inner working and there is no benefit
> > to have from the struct page. So it is hard to justify this.
> 
> I think we have to consider the struct pages to track the address space,
> not what backs it (essentially what HMM is doing). If we need to add
> operations for the driver to map the address space/struct pages back to
> physical memory then do that. Creating a whole new idea that's tied to
> userspace VMAs still seems wrong to me.

VMA is the object RDMA works on, GPU driver have been working with
VMA too, where VMA is tie to only one specific GPU object. So the most
disrupting approach here is using struct page. It was never use and
will not be use in many driver. Updating those to struct page is too
risky and too much changes. The vma call back is something you can
remove at any time if you have something better that do not need major
surgery to GPU driver.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 06:56:59PM +, Jason Gunthorpe wrote:
> On Wed, Jan 30, 2019 at 10:17:27AM -0700, Logan Gunthorpe wrote:
> > 
> > 
> > On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
> > > Every attempt to give BAR memory to struct page has run into major
> > > trouble, IMHO, so I like that this approach avoids that.
> > > 
> > > And if you don't have struct page then the only kernel object left to
> > > hang meta data off is the VMA itself.
> > > 
> > > It seems very similar to the existing P2P work between in-kernel
> > > consumers, just that VMA is now mediating a general user space driven
> > > discovery process instead of being hard wired into a driver.
> > 
> > But the kernel now has P2P bars backed by struct pages and it works
> > well. 
> 
> I don't think it works that well..
> 
> We ended up with a 'sgl' that is not really a sgl, and doesn't work
> with many of the common SGL patterns. sg_copy_buffer doesn't work,
> dma_map, doesn't work, sg_page doesn't work quite right, etc.
> 
> Only nvme and rdma got the special hacks to make them understand these
> p2p-sgls, and I'm still not convinced some of the RDMA drivers that
> want access to CPU addresses from the SGL (rxe, usnic, hfi, qib) don't
> break in this scenario.
> 
> Since the SGLs become broken, it pretty much means there is no path to
> make GUP work generically, we have to go through and make everything
> safe to use with p2p-sgls before allowing GUP. Which, frankly, sounds
> impossible with all the competing objections.
> 
> But GPU seems to have a problem unrelated to this - what Jerome wants
> is to have two faulting domains for VMA's - visible-to-cpu and
> visible-to-dma. The new op is essentially faulting the pages into the
> visible-to-dma category and leaving them invisible-to-cpu.
> 
> So that duality would still have to exists, and I think p2p_map/unmap
> is a much simpler implementation than trying to create some kind of
> special PTE in the VMA..
> 
> At least for RDMA, struct page or not doesn't really matter. 
> 
> We can make struct pages for the BAR the same way NVMe does.  GPU is
> probably the same, just with more mememory at stake?  
> 
> And maybe this should be the first implementation. The p2p_map VMA
> operation should return a SGL and the caller should do the existing
> pci_p2pdma_map_sg() flow.. 

For GPU it would not work, GPU might want to use main memory (because
it is running out of BAR space) it is a lot easier if the p2p_map
callback calls the right dma map function (for page or io) rather than
having to define some format that would pass down the information.

> 
> Worry about optimizing away the struct page overhead later?

Struct page do not fit well for GPU as the BAR address can be reprogram
to point to any page inside the device memory (think 256M BAR versus
16GB device memory). Forcing struct page on GPU driver would require
major surgery to the GPU driver inner working and there is no benefit
to have from the struct page. So it is hard to justify this.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 11:13:11AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-30 10:44 a.m., Jason Gunthorpe wrote:
> > I don't see why a special case with a VMA is really that different.
> 
> Well one *really* big difference is the VMA changes necessarily expose
> specialized new functionality to userspace which has to be supported
> forever and may be difficult to change. The p2pdma code is largely
> in-kernel and we can rework and change the interfaces all we want as we
> improve our struct page infrastructure.

I do not see how VMA changes are any different than using struct page
in respect to userspace exposure. Those vma callback do not need to be
set by everyone, in fact expectation is that only handful of driver
will set those.

How can we do p2p between RDMA and GPU for instance, without exposure
to userspace ? At some point you need to tell userspace hey this kernel
does allow you to do that :)

RDMA works on vma, and GPU driver can easily setup vma for an object
hence why vma sounds like a logical place. In fact vma (mmap of a device
file) is very common device driver pattern.

In the model i am proposing the exporting device is in control of
policy ie wether to allow or not the peer to peer mapping. So each
device driver can define proper device specific API to enable and
expose that feature to userspace.

If they do, the only thing we have to preserve is the end result for
the user. The userspace does not care one bit if we achieve this in
the kernel with a set of new callback within the vm_operations struct
or in some other way. Only the end result matter.

So question is do we want to allow RDMA to access GPU driver object ?
I believe we do, they are people using non upstream solution with open
source driver to do just that, so it is a testimony that they are
users for this. More use case have been propose too.

> 
> I'd also argue that p2pdma isn't nearly as specialized as this VMA thing
> and can be used pretty generically to do other things. Though, the other
> ideas we've talked about doing are pretty far off and may have other
> challenges.

I believe p2p is highly specialize on non cache-coherent inter-connect
platform like x86 with PCIE. So i do not think that using struct page
for this is a good idea, it is not warranted/needed, and it can only be
problematic if some random kernel code get holds of those struct page
without understanding it is not regular memory.

I believe the vma callback are the simplest solution with the minimum
burden for the device driver and for the kernel. If they are any better
solution that emerge there is nothing that would block us to remove
this to replace it with the other solution.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 06:26:53PM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 10:55:43AM -0500, Jerome Glisse wrote:
> > Even outside GPU driver, device driver like RDMA just want to share their
> > doorbell to other device and they do not want to see those doorbell page
> > use in direct I/O or anything similar AFAICT.
> 
> At least Mellanox HCA support and inline data feature where you
> can copy data directly into the BAR.  For something like a usrspace
> NVMe target it might be very useful to do direct I/O straight into
> the BAR for that.

And what i am proposing is not exclusive of that. If exporting device
wants to have struct page for its BAR than it can do so. What i do not
want is imposing that burden on everyone as many devices do not want
or do not care for that. Moreover having struct page and allowing that
struct page to trickle know in obscure corner of the kernel means that
exporter that want that will also have the burden to check that what
they are doing does not end up in something terribly bad.

While i would like a one API fits all i do not think that we can sanely
do that for P2P. They are too much differences between how different
devices expose and manage their BAR to make any such attempt reasonably
sane.

Maybe thing will evolve oragnicaly, but for now i do not see a way out
side the API i am proposing (again this is not exclusive of the struct
page API that is upstream both can co-exist and a device can use both
or just one).

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 10:33:39AM +, Koenig, Christian wrote:
> Am 30.01.19 um 09:02 schrieb Christoph Hellwig:
> > On Tue, Jan 29, 2019 at 08:58:35PM +, Jason Gunthorpe wrote:
> >> On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
> >>
> >>> implement the mapping. And I don't think we should have 'special' vma's
> >>> for this (though we may need something to ensure we don't get mapping
> >>> requests mixed with different types of pages...).
> >> I think Jerome explained the point here is to have a 'special vma'
> >> rather than a 'special struct page' as, really, we don't need a
> >> struct page at all to make this work.
> >>
> >> If I recall your earlier attempts at adding struct page for BAR
> >> memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.
> > Struct page is what makes O_DIRECT work, using sgls or biovecs, etc on
> > it work.  Without struct page none of the above can work at all.  That
> > is why we use struct page for backing BARs in the existing P2P code.
> > Not that I'm a particular fan of creating struct page for this device
> > memory, but without major invasive surgery to large parts of the kernel
> > it is the only way to make it work.
> 
> The problem seems to be that struct page does two things:
> 
> 1. Memory management for system memory.
> 2. The object to work with in the I/O layer.
> 
> This was done because a good part of that stuff overlaps, like reference 
> counting how often a page is used.  The problem now is that this doesn't 
> work very well for device memory in some cases.
> 
> For example on GPUs you usually have a large amount of memory which is 
> not even accessible by the CPU. In other words you can't easily create a 
> struct page for it because you can't reference it with a physical CPU 
> address.
> 
> Maybe struct page should be split up into smaller structures? I mean 
> it's really overloaded with data.

I think the simpler answer is that we do not want to allow GUP or any-
thing similar to pin BAR or device memory. Doing so can only hurt us
long term by fragmenting the GPU memory and forbidding us to move thing
around. For transparent use of device memory within a process this is
definitly forbidden to pin.

I do not see any good reasons we would like to pin device memory for
the existing GPU GEM objects. Userspace always had a very low expectation
on what it can do with mmap of those object and i believe it is better
to keep expectation low here and says nothing will work with those
pointer. I just do not see a valid and compelling use case to change
that :)

Even outside GPU driver, device driver like RDMA just want to share their
doorbell to other device and they do not want to see those doorbell page
use in direct I/O or anything similar AFAICT.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 09:00:06AM +0100, Christoph Hellwig wrote:
> On Wed, Jan 30, 2019 at 04:18:48AM +, Jason Gunthorpe wrote:
> > Every attempt to give BAR memory to struct page has run into major
> > trouble, IMHO, so I like that this approach avoids that.
> 
> Way less problems than not having struct page for doing anything
> non-trivial.  If you map the BAR to userspace with remap_pfn_range
> and friends the mapping is indeed very simple.  But any operation
> that expects a page structure, which is at least everything using
> get_user_pages won't work.
> 
> So you can't do direct I/O to your remapped BAR, you can't create MRs
> on it, etc, etc.

We do not want direct I/O, in fact at least for GPU we want to seldomly
allow access to object vma, so the less thing can access it the happier
we are :) All the GPU userspace driver API (OpenGL, OpenCL, Vulkan, ...)
that expose any such mapping with the application are very clear on the
limitation which is often worded: the only valid thing is direct CPU
access (no syscall can be use with those pointers).

So application developer already have low expectation on what is valid
and allowed to do.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-30 Thread Jerome Glisse
On Wed, Jan 30, 2019 at 04:30:27AM +, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 07:08:06PM -0500, Jerome Glisse wrote:
> > On Tue, Jan 29, 2019 at 11:02:25PM +, Jason Gunthorpe wrote:
> > > On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
> > > 
> > > > > But this API doesn't seem to offer any control - I thought that
> > > > > control was all coming from the mm/hmm notifiers triggering 
> > > > > p2p_unmaps?
> > > > 
> > > > The control is within the driver implementation of those callbacks. 
> > > 
> > > Seems like what you mean by control is 'the exporter gets to choose
> > > the physical address at the instant of map' - which seems reasonable
> > > for GPU.
> > > 
> > > 
> > > > will only allow p2p map to succeed for objects that have been tagged by 
> > > > the
> > > > userspace in some way ie the userspace application is in control of what
> > > > can be map to peer device.
> > > 
> > > I would have thought this means the VMA for the object is created
> > > without the map/unmap ops? Or are GPU objects and VMAs unrelated?
> > 
> > GPU object and VMA are unrelated in all open source GPU driver i am
> > somewhat familiar with (AMD, Intel, NVidia). You can create a GPU
> > object and never map it (and thus never have it associated with a
> > vma) and in fact this is very common. For graphic you usualy only
> > have hand full of the hundreds of GPU object your application have
> > mapped.
> 
> I mean the other way does every VMA with a p2p_map/unmap point to
> exactly one GPU object?
> 
> ie I'm surprised you say that p2p_map needs to have policy, I would
> have though the policy is applied when the VMA is created (ie objects
> that are not for p2p do not have p2p_map set), and even for GPU
> p2p_map should really only have to do with window allocation and pure
> 'can I even do p2p' type functionality.

All userspace API to enable p2p happens after object creation and in
some case they are mutable ie you can decide to no longer share the
object (userspace application decision). The BAR address space is a
resource from GPU driver point of view and thus from userspace point
of view. As such decissions that affect how it is use an what object
can use it, can change over application lifetime.

This is why i would like to allow kernel driver to apply any such
access policy, decided by the application on its object (on top of
which the kernel GPU driver can apply its own policy for GPU resource
sharing by forcing some object to main memory).


> 
> > Idea is that we can only ask exporter to be predictable and still allow
> > them to fail if things are really going bad.
> 
> I think hot unplug / PCI error recovery is one of the 'really going
> bad' cases..

GPU can hang and all data becomes _undefined_, it can also be suspended
to save power (think laptop with discret GPU for instance). GPU threads
can be kill ... So they are few cases i can think of where either you
want to kill the p2p mapping and make sure the importer is aware and
might have a change to report back through its own userspace API, or
at very least fallback to dummy pages. In some of the above cases, for
instance suspend, you just want to move thing around to allow to shut
down device memory.


> > I think i put it in the comment above the ops but in any cases i should
> > write something in documentation with example and thorough guideline.
> > Note that there won't be any mmu notifier to mmap of a device file
> > unless the device driver calls for it or there is a syscall like munmap
> > or mremap or mprotect well any syscall that work on vma.
> 
> This is something we might need to explore, does calling
> zap_vma_ptes() invoke enough notifiers that a MMU notifiers or HMM
> mirror consumer will release any p2p maps on that VMA?

Yes it does.

> 
> > If we ever want to support full pin then we might have to add a
> > flag so that GPU driver can refuse an importer that wants things
> > pin forever.
> 
> This would become interesting for VFIO and RDMA at least - I don't
> think VFIO has anything like SVA so it would want to import a p2p_map
> and indicate that it will not respond to MMU notifiers.
> 
> GPU can refuse, but maybe RDMA would allow it...

Ok i will add a flag field in next post. GPU could allow pin but they
would most likely use main memory for any such object, hence it is no
longer really p2p but at least both device look at the same data.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 06:17:43PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 4:47 p.m., Jerome Glisse wrote:
> > The whole point is to allow to use device memory for range of virtual
> > address of a process when it does make sense to use device memory for
> > that range. So they are multiple cases where it does make sense:
> > [1] - Only the device is accessing the range and they are no CPU access
> >   For instance the program is executing/running a big function on
> >   the GPU and they are not concurrent CPU access, this is very
> >   common in all the existing GPGPU code. In fact AFAICT It is the
> >   most common pattern. So here you can use HMM private or public
> >   memory.
> > [2] - Both device and CPU access a common range of virtul address
> >   concurrently. In that case if you are on a platform with cache
> >   coherent inter-connect like OpenCAPI or CCIX then you can use
> >   HMM public device memory and have both access the same memory.
> >   You can not use HMM private memory.
> > 
> > So far on x86 we only have PCIE and thus so far on x86 we only have
> > private HMM device memory that is not accessible by the CPU in any
> > way.
> 
> I feel like you're just moving the rug out from under us... Before you
> said ignore HMM and I was asking about the use case that wasn't using
> HMM and how it works without HMM. In response, you just give me *way*
> too much information describing HMM. And still, as best as I can see,
> managing DMA mappings (which is different from the userspace mappings)
> for GPU P2P should be handled by HMM and the userspace mappings should
> *just* link VMAs to HMM pages using the standard infrastructure we
> already have.

For HMM P2P mapping we need to call into the driver to know if driver
wants to fallback to main memory (running out of BAR addresses) or if
it can allow a peer device to directly access its memory. We also need
the call to exporting device driver as only the exporting device driver
can map the HMM page pfn to some physical BAR address (which would be
allocated by driver for GPU).

I wanted to make sure the HMM case was understood too, sorry if it
caused confusion with the non HMM case which i describe below.


> >> And what struct pages are actually going to be backing these VMAs if
> >> it's not using HMM?
> > 
> > When you have some range of virtual address migrated to HMM private
> > memory then the CPU pte are special swap entry and they behave just
> > as if the memory was swapped to disk. So CPU access to those will
> > fault and trigger a migration back to main memory.
> 
> This isn't answering my question at all... I specifically asked what is
> backing the VMA when we are *not* using HMM.

So when you are not using HMM ie existing GPU object without HMM then
like i said you do not have any valid pte most of the time inside the
CPU page table ie the GPU driver only populate the pte with valid
entry when they are CPU page fault and it clear those as soon as the
corresponding object is use by the GPU. In fact some driver also unmap
it agressively from the BAR making the memory totaly un-accessible to
anything but the GPU.

GPU driver do not like CPU mapping, they are quite aggressive about
clearing them. Then everything i said about having userspace deciding
which object can be share, and, with who, do apply here. So for GPU you
do want to give control to GPU driver and you do not want to require valid
CPU pte for the vma so that the exporting driver can return valid
address to the importing peer device only.

Also exporting device driver might decide to fallback to main memory
(running out of BAR addresses for instance). So again here we want to
go through the exporting device driver so that it can take the right
action.

So the expected pattern (for GPU driver) is:
- no valid pte for the special vma (mmap of device file)
- importing device call p2p_map() for the vma if it succeed the
  first time then we expect it will succeed for the same vma and
  range next time we call it.
- exporting driver can either return physical address to page
  into its BAR space that point to the correct device memory or
  fallback to main memory

Then at any point in time:
- if GPU driver want to move the object around (for whatever
  reasons) it calls zap_vma_ptes() the fact that there is no
  valid CPU pte does not matter it will call mmu notifier and thus
  any importing device driver will invalidate its mapping
- importing device driver that lost the mapping due to mmu
  notification can re-map by re-calling p2p_map() (it should
  check that the vma is still valid ...) and guideline is for
  the exporting device driver to succeed and return valid
  

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 11:02:25PM +, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
> 
> > > But this API doesn't seem to offer any control - I thought that
> > > control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
> > 
> > The control is within the driver implementation of those callbacks. 
> 
> Seems like what you mean by control is 'the exporter gets to choose
> the physical address at the instant of map' - which seems reasonable
> for GPU.
> 
> 
> > will only allow p2p map to succeed for objects that have been tagged by the
> > userspace in some way ie the userspace application is in control of what
> > can be map to peer device.
> 
> I would have thought this means the VMA for the object is created
> without the map/unmap ops? Or are GPU objects and VMAs unrelated?

GPU object and VMA are unrelated in all open source GPU driver i am
somewhat familiar with (AMD, Intel, NVidia). You can create a GPU
object and never map it (and thus never have it associated with a
vma) and in fact this is very common. For graphic you usualy only
have hand full of the hundreds of GPU object your application have
mapped.

The control for peer to peer can also be a mutable properties of the
object ie userspace do ioctl on the GPU driver which create an object;
Some times after the object is created the userspace do others ioctl
to allow to export the object to another specific device again this
result in ioctl to the device driver, those ioctl set flags and
update GPU object kernel structure with all the info.

In the meantime you have no control on when other driver might call
the vma p2p call backs. So you must have register the vma with
vm_operations that include the p2p_map and p2p_unmap. Those driver
function will check the object kernel structure each time they get
call and act accordingly.



> > For moving things around after a successful p2p_map yes the exporting
> > device have to call for instance zap_vma_ptes() or something
> > similar.
> 
> Okay, great, RDMA needs this flow for hotplug - we zap the VMA's when
> unplugging the PCI device and we can delay the PCI unplug completion
> until all the p2p_unmaps are called...
> 
> But in this case a future p2p_map will have to fail as the BAR no
> longer exists. How to handle this?

So the comment above the callback (i should write more thorough guideline
and documentation) state that export should/(must?) be predictable ie
if an importer device calls p2p_map() once on a vma and it does succeed
then if the same device calls again p2p_map() on the same vma and if the
vma is still valid (ie no unmap or does not correspond to a different
object ...) then the p2p_map() should/(must?) succeed.

The idea is that the importer would do a first call to p2p_map() when it
setup its own object, report failure to userspace if that fails. If it
does succeed then we should never have an issue next time we call p2p_map()
(after mapping being invalidated by mmu notifier for instance). So it will
succeed just like the first call (again assuming the vma is still valid).

Idea is that we can only ask exporter to be predictable and still allow
them to fail if things are really going bad.


> > > I would think that the importing driver can assume the BAR page is
> > > kept alive until it calls unmap (presumably triggered by notifiers)?
> > > 
> > > ie the exporting driver sees the BAR page as pinned until unmap.
> > 
> > The intention with this patchset is that it is not pin ie the importer
> > device _must_ abide by all mmu notifier invalidations and they can
> > happen at anytime. The importing device can however re-p2p_map the
> > same range after an invalidation.
> >
> > I would like to restrict this to importer that can invalidate for
> > now because i believe all the first device to use can support the
> > invalidation.
> 
> This seems reasonable (and sort of says importers not getting this
> from HMM need careful checking), was this in the comment above the
> ops?

I think i put it in the comment above the ops but in any cases i should
write something in documentation with example and thorough guideline.
Note that there won't be any mmu notifier to mmap of a device file
unless the device driver calls for it or there is a syscall like munmap
or mremap or mprotect well any syscall that work on vma.

So assuming the application is no doing something stupid, nor the
driver. Then the result of p2p_map can stay valid until the importer
is done and call p2p_unmap of its own free will. This is what i do
expect for this. But for GPU i would still like to allow GPU driver
to evict (and thus invalidate importer mapping) to main memory or
defragment their BAR 

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 03:58:45PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 2:50 p.m., Jerome Glisse wrote:
> > No this is the non HMM case i am talking about here. Fully ignore HMM
> > in this frame. A GPU driver that do not support or use HMM in anyway
> > has all the properties and requirement i do list above. So all the points
> > i was making are without HMM in the picture whatsoever. I should have
> > posted this a separate patches to avoid this confusion.
> > 
> > Regarding your HMM question. You can not map HMM pages, all code path
> > that would try that would trigger a migration back to regular memory
> > and will use the regular memory for CPU access.
> > 
> 
> I thought this was the whole point of HMM... And eventually it would
> support being able to map the pages through the BAR in cooperation with
> the driver. If not, what's that whole layer for? Why not just have HMM
> handle this situation?

The whole point is to allow to use device memory for range of virtual
address of a process when it does make sense to use device memory for
that range. So they are multiple cases where it does make sense:
[1] - Only the device is accessing the range and they are no CPU access
  For instance the program is executing/running a big function on
  the GPU and they are not concurrent CPU access, this is very
  common in all the existing GPGPU code. In fact AFAICT It is the
  most common pattern. So here you can use HMM private or public
  memory.
[2] - Both device and CPU access a common range of virtul address
  concurrently. In that case if you are on a platform with cache
  coherent inter-connect like OpenCAPI or CCIX then you can use
  HMM public device memory and have both access the same memory.
  You can not use HMM private memory.

So far on x86 we only have PCIE and thus so far on x86 we only have
private HMM device memory that is not accessible by the CPU in any
way.

It does not make that memory useless, far from it. Having only the
device work on the dataset while CPU is either waiting or accessing
something else is very common.


Then HMM is a toolbox, so here are some of the tools:
HMM mirror - helper to mirror process address on to a device
ie this is SVM(Share Virtual Memory)/SVA(Share Virtual Address)
in software

HMM private memory - allow to register device memory with the linux
kernel. The memory is not CPU accessible. The memory is fully manage
by the device driver. What and when to migrate is under the control
of the device driver.

HMM public memory - allow to register device memory with the linux
kernel. The memory must be CPU accessible and cache coherent and
abide by the platform memory model. The memory is fully manage by
the device driver because otherwise it would disrupt the device
driver operation (for instance GPU can also be use for graphics).

Migration - helper to perform migration to and from device memory.
It does not make any decission on itself it just perform all the
steps in the right order and call back into the driver to get the
migration going.

It is up to device driver to implement heuristic and provide userspace
API to control memory migration to and from device memory. For device
private memory on CPU page fault the kernel will force a migration back
to system memory so that the CPU can access the memory. What matter here
is that the memory model of the platform is intact and thus you can
safely use CPU atomic operation or rely on your platform memory model
for your program. Note that long term i would like to define common API
to expose to userspace to manage memory binding to specific device
memory so that we can mix and match multiple device memory into a single
process and define policy too.

Also CPU atomic instruction to PCIE BAR gives _undefined_ results and in
fact on some AMD/Intel platform it leads to weirdness/crash/freeze. So
obviously we can not map PCIE BAR to CPU without breaking the memory
model. More over on PCIE you might not be able to resize the BAR to
expose all the device memory. GPU can have several giga bytes of memory
and not all of them support PCIE bar resize, and sometimes PCIE bar
resize does not work either because of bios/firmware issue or simply
because you are running out of IO space.

So on x86 we are stuck with HMM private memory, i am hopping that some
day in the future we will have CCIX or something similar. But for now
we have to work with what we have.

> And what struct pages are actually going to be backing these VMAs if
> it's not using HMM?

When you have some range of virtual address migrated to HMM private
memory then the CPU pte are special swap entry and they behave just
as if the memory was swapped to disk. So CPU access to those will
fault and trigger a migration back to main memory.

We still want to 

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 02:30:49PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 1:57 p.m., Jerome Glisse wrote:
> > GPU driver must be in control and must be call to. Here there is 2 cases
> > in this patchset and i should have instead posted 2 separate patchset as
> > it seems that it is confusing things.
> > 
> > For the HMM page, the physical address of the page ie the pfn does not
> > correspond to anything ie there is nothing behind it. So the importing
> > device has no idea how to get a valid physical address from an HMM page
> > only the device driver exporting its memory with HMM device memory knows
> > that.
> > 
> > 
> > For the special vma ie mmap of a device file. GPU driver do manage their
> > BAR ie the GPU have a page table that map BAR page to GPU memory and the
> > driver _constantly_ update this page table, it is reflected by invalidating
> > the CPU mapping. In fact most of the time the CPU mapping of GPU object are
> > invalid they are valid only a small fraction of their lifetime. So you
> > _must_ have some call to inform the exporting device driver that another
> > device would like to map one of its vma. The exporting device can then
> > try to avoid as much churn as possible for the importing device. But this
> > has consequence and the exporting device driver must be allow to apply
> > policy and make decission on wether or not it authorize the other device
> > to peer map its memory. For GPU the userspace application have to call
> > specific API that translate into specific ioctl which themself set flags
> > on object (in the kernel struct tracking the user space object). The only
> > way to allow program predictability is if the application can ask and know
> > if it can peer export an object (ie is there enough BAR space left).
> 
> This all seems like it's an HMM problem and not related to mapping
> BARs/"potential BARs" to userspace. If some code wants to DMA map HMM
> pages, it calls an HMM function to map them. If HMM needs to consult
> with the driver on aspects of how that's mapped, then that's between HMM
> and the driver and not something I really care about. But making the
> entire mapping stuff tied to userspace VMAs does not make sense to me.
> What if somebody wants to map some HMM pages in the same way but from
> kernel space and they therefore don't have a VMA?

No this is the non HMM case i am talking about here. Fully ignore HMM
in this frame. A GPU driver that do not support or use HMM in anyway
has all the properties and requirement i do list above. So all the points
i was making are without HMM in the picture whatsoever. I should have
posted this a separate patches to avoid this confusion.

Regarding your HMM question. You can not map HMM pages, all code path
that would try that would trigger a migration back to regular memory
and will use the regular memory for CPU access.


> >> I also figured there'd be a fault version of p2p_ioremap_device_memory()
> >> for when you are mapping P2P memory and you want to assign the pages
> >> lazily. Though, this can come later when someone wants to implement that.
> > 
> > For GPU the BAR address space is manage page by page and thus you do not
> > want to map a range of BAR addresses but you want to allow mapping of
> > multiple page of BAR address that are not adjacent to each other nor
> > ordered in anyway. But providing helper for simpler device does make sense.
> 
> Well, this has little do with the backing device but how the memory is
> mapped into userspace. With p2p_ioremap_device_memory() the entire range
> is mapped into the userspace VMA immediately during the call to mmap().
> With p2p_fault_device_memory(), mmap() would not actually map anything
> and a page in the VMA would be mapped only when userspace accesses it
> (using fault()). It seems to me like GPUs would prefer the latter but if
> HMM takes care of the mapping from userspace potential pages to actual
> GPU pages through the BAR then that may not be true.

Again HMM has nothing to do here, ignore HMM it does not play any role
and it is not involve in anyway here. GPU want to control what object
they allow other device to access and object they do not allow. GPU driver
_constantly_ invalidate the CPU page table and in fact the CPU page table
do not have any valid pte for a vma that is an mmap of GPU device file
for most of the vma lifetime. Changing that would highly disrupt and
break GPU drivers. They need to control that, they need to control what
to do if another device tries to peer map some of their memory. Hence
why they need to implement the callback and decide on wether or not they
allow the peer mapping or use device memory for it (t

Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 01:44:09PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 12:44 p.m., Greg Kroah-Hartman wrote:
> > On Tue, Jan 29, 2019 at 11:24:09AM -0700, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2019-01-29 10:47 a.m., jgli...@redhat.com wrote:
> >>> +bool pci_test_p2p(struct device *devA, struct device *devB)
> >>> +{
> >>> + struct pci_dev *pciA, *pciB;
> >>> + bool ret;
> >>> + int tmp;
> >>> +
> >>> + /*
> >>> +  * For now we only support PCIE peer to peer but other inter-connect
> >>> +  * can be added.
> >>> +  */
> >>> + pciA = find_parent_pci_dev(devA);
> >>> + pciB = find_parent_pci_dev(devB);
> >>> + if (pciA == NULL || pciB == NULL) {
> >>> + ret = false;
> >>> + goto out;
> >>> + }
> >>> +
> >>> + tmp = upstream_bridge_distance(pciA, pciB, NULL);
> >>> + ret = tmp < 0 ? false : true;
> >>> +
> >>> +out:
> >>> + pci_dev_put(pciB);
> >>> + pci_dev_put(pciA);
> >>> + return false;
> >>> +}
> >>> +EXPORT_SYMBOL_GPL(pci_test_p2p);
> >>
> >> This function only ever returns false
> > 
> > I guess it was nevr actually tested :(
> > 
> > I feel really worried about passing random 'struct device' pointers into
> > the PCI layer.  Are we _sure_ it can handle this properly?
> 
> Yes, there are a couple of pci_p2pdma functions that take struct devices
> directly simply because it's way more convenient for the caller. That's
> what find_parent_pci_dev() takes care of (it returns false if the device
> is not a PCI device). Whether that's appropriate here is hard to say
> seeing we haven't seen any caller code.

Caller code as a reference (i already given that link in other part of
thread but just so that people don't have to follow all branches).

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-p2p=401a567696eafb1d4faf7054ab0d7c3a16a5ef06

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 12:32 p.m., Jason Gunthorpe wrote:
> > Jerome, I think it would be nice to have a helper scheme - I think the
> > simple case would be simple remapping of PCI BAR memory, so if we
> > could have, say something like:
> > 
> > static const struct vm_operations_struct my_ops {
> >   .p2p_map = p2p_ioremap_map_op,
> >   .p2p_unmap = p2p_ioremap_unmap_op,
> > }
> > 
> > struct ioremap_data {
> >   [..]
> > }
> > 
> > fops_mmap() {
> >vma->private_data = _priv->ioremap_data;
> >return p2p_ioremap_device_memory(vma, exporting_device, [..]);
> > }
> 
> This is roughly what I was expecting, except I don't see exactly what
> the p2p_map and p2p_unmap callbacks are for. The importing driver should
> see p2pdma/hmm struct pages and use the appropriate function to map
> them. It shouldn't be the responsibility of the exporting driver to
> implement the mapping. And I don't think we should have 'special' vma's
> for this (though we may need something to ensure we don't get mapping
> requests mixed with different types of pages...).

GPU driver must be in control and must be call to. Here there is 2 cases
in this patchset and i should have instead posted 2 separate patchset as
it seems that it is confusing things.

For the HMM page, the physical address of the page ie the pfn does not
correspond to anything ie there is nothing behind it. So the importing
device has no idea how to get a valid physical address from an HMM page
only the device driver exporting its memory with HMM device memory knows
that.


For the special vma ie mmap of a device file. GPU driver do manage their
BAR ie the GPU have a page table that map BAR page to GPU memory and the
driver _constantly_ update this page table, it is reflected by invalidating
the CPU mapping. In fact most of the time the CPU mapping of GPU object are
invalid they are valid only a small fraction of their lifetime. So you
_must_ have some call to inform the exporting device driver that another
device would like to map one of its vma. The exporting device can then
try to avoid as much churn as possible for the importing device. But this
has consequence and the exporting device driver must be allow to apply
policy and make decission on wether or not it authorize the other device
to peer map its memory. For GPU the userspace application have to call
specific API that translate into specific ioctl which themself set flags
on object (in the kernel struct tracking the user space object). The only
way to allow program predictability is if the application can ask and know
if it can peer export an object (ie is there enough BAR space left).

Moreover i would like to be able to use this API between GPUs that are
inter-connected between each other and for those the CPU page table are
just invalid and the physical address to use are only meaning full to the
exporting and importing device. So again here core kernel has no idea of
what the physical address would be.


So in both cases, at very least for GPU, we do want total control to be
given to the exporter.

> 
> I also figured there'd be a fault version of p2p_ioremap_device_memory()
> for when you are mapping P2P memory and you want to assign the pages
> lazily. Though, this can come later when someone wants to implement that.

For GPU the BAR address space is manage page by page and thus you do not
want to map a range of BAR addresses but you want to allow mapping of
multiple page of BAR address that are not adjacent to each other nor
ordered in anyway. But providing helper for simpler device does make sense.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 08:24:36PM +, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 02:50:55PM -0500, Jerome Glisse wrote:
> 
> > GPU driver do want more control :) GPU driver are moving things around
> > all the time and they have more memory than bar space (on newer platform
> > AMD GPU do resize the bar but it is not the rule for all GPUs). So
> > GPU driver do actualy manage their BAR address space and they map and
> > unmap thing there. They can not allow someone to just pin stuff there
> > randomly or this would disrupt their regular work flow. Hence they need
> > control and they might implement threshold for instance if they have
> > more than N pages of bar space map for peer to peer then they can decide
> > to fall back to main memory for any new peer mapping.
> 
> But this API doesn't seem to offer any control - I thought that
> control was all coming from the mm/hmm notifiers triggering p2p_unmaps?

The control is within the driver implementation of those callbacks. So
driver implementation can refuse to map by returning an error on p2p_map
or it can decide to use main memory by migrating its object to main memory
and populating the dma address array with dma_page_map() of the main memory
pages. Driver like GPU can have policy on top of that for instance they
will only allow p2p map to succeed for objects that have been tagged by the
userspace in some way ie the userspace application is in control of what
can be map to peer device. This is needed for GPU driver as we do want
userspace involvement on what object are allowed to have p2p access and
also so that we can report to userspace when we are running out of BAR
addresses for this to work as intended (ie not falling back to main memory)
so that application can take appropriate actions (like decide what to
prioritize).

For moving things around after a successful p2p_map yes the exporting
device have to call for instance zap_vma_ptes() or something similar.
This will trigger notifier call and the importing device will invalidate
its mapping. Once it is invalidated then the exporting device can
point new call of p2p_map (for the same range) to new memory (obviously
the exporting device have to synchronize any concurrent call to p2p_map
with the invalidation).

> 
> I would think that the importing driver can assume the BAR page is
> kept alive until it calls unmap (presumably triggered by notifiers)?
> 
> ie the exporting driver sees the BAR page as pinned until unmap.

The intention with this patchset is that it is not pin ie the importer
device _must_ abide by all mmu notifier invalidations and they can
happen at anytime. The importing device can however re-p2p_map the
same range after an invalidation.

I would like to restrict this to importer that can invalidate for
now because i believe all the first device to use can support the
invalidation.

Also when using HMM private device memory we _can not_ pin virtual
address to device memory as otherwise CPU access would have to SIGBUS
or SEGFAULT and we do not want that. So this was also a motivation to
keep thing consistent for the importer for both cases.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 02:56:38PM -0500, Alex Deucher wrote:
> On Tue, Jan 29, 2019 at 12:47 PM  wrote:
> >
> > From: Jérôme Glisse 
> >
> > device_test_p2p() return true if two devices can peer to peer to
> > each other. We add a generic function as different inter-connect
> > can support peer to peer and we want to genericaly test this no
> > matter what the inter-connect might be. However this version only
> > support PCIE for now.
> >
> 
> What about something like these patches:
> https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p=4fab9ff69cb968183f717551441b475fabce6c1c
> https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p=f90b12d41c277335d08c9dab62433f27c0fadbe5
> They are a bit more thorough.

Yes it would be better, i forgot about those. I can include them
next time i post. Thank you for reminding me about those :)

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 2/5] drivers/base: add a function to test peer to peer capability

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 08:46:05PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Jan 29, 2019 at 12:47:25PM -0500, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> > 
> > device_test_p2p() return true if two devices can peer to peer to
> > each other. We add a generic function as different inter-connect
> > can support peer to peer and we want to genericaly test this no
> > matter what the inter-connect might be. However this version only
> > support PCIE for now.
> 
> There is no defintion of "peer to peer" in the driver/device model, so
> why should this be in the driver core at all?
> 
> Especially as you only do this for PCI, why not just keep it in the PCI
> layer, that way you _know_ you are dealing with the right pointer types
> and there is no need to mess around with the driver core at all.

Ok i will drop the core device change. I wanted to allow other non
PCI to join latter on (ie allow PCI device to export to non PCI device)
but if that ever happen then we can update pci exporter at the same
time we introduce non pci importer.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 2/5] drivers/base: add a function to test peer to peer capability

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 11:26:01AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 10:47 a.m., jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> > 
> > device_test_p2p() return true if two devices can peer to peer to
> > each other. We add a generic function as different inter-connect
> > can support peer to peer and we want to genericaly test this no
> > matter what the inter-connect might be. However this version only
> > support PCIE for now.
> 
> This doesn't appear to be used in any of the further patches; so it's
> very confusing.
> 
> I'm not sure a struct device wrapper is really necessary...

I wanted to allow other non pci device to join in the fun but
yes right now i have only be doing this on pci devices.

Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 1/5] pci/p2p: add a function to test peer to peer capability

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 08:44:26PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Jan 29, 2019 at 11:24:09AM -0700, Logan Gunthorpe wrote:
> > 
> > 
> > On 2019-01-29 10:47 a.m., jgli...@redhat.com wrote:
> > > +bool pci_test_p2p(struct device *devA, struct device *devB)
> > > +{
> > > + struct pci_dev *pciA, *pciB;
> > > + bool ret;
> > > + int tmp;
> > > +
> > > + /*
> > > +  * For now we only support PCIE peer to peer but other inter-connect
> > > +  * can be added.
> > > +  */
> > > + pciA = find_parent_pci_dev(devA);
> > > + pciB = find_parent_pci_dev(devB);
> > > + if (pciA == NULL || pciB == NULL) {
> > > + ret = false;
> > > + goto out;
> > > + }
> > > +
> > > + tmp = upstream_bridge_distance(pciA, pciB, NULL);
> > > + ret = tmp < 0 ? false : true;
> > > +
> > > +out:
> > > + pci_dev_put(pciB);
> > > + pci_dev_put(pciA);
> > > + return false;
> > > +}
> > > +EXPORT_SYMBOL_GPL(pci_test_p2p);
> > 
> > This function only ever returns false
> 
> I guess it was nevr actually tested :(
> 
> I feel really worried about passing random 'struct device' pointers into
> the PCI layer.  Are we _sure_ it can handle this properly?
> 

Oh yes i fixed it on the test rig and forgot to patch
my local git tree. My bad.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 07:32:57PM +, Jason Gunthorpe wrote:
> On Tue, Jan 29, 2019 at 02:11:23PM -0500, Jerome Glisse wrote:
> > On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
> > > 
> > > 
> > > On 2019-01-29 10:47 a.m., jgli...@redhat.com wrote:
> > > 
> > > > +   /*
> > > > +* Optional for device driver that want to allow peer to peer 
> > > > (p2p)
> > > > +* mapping of their vma (which can be back by some device 
> > > > memory) to
> > > > +* another device.
> > > > +*
> > > > +* Note that the exporting device driver might not have map 
> > > > anything
> > > > +* inside the vma for the CPU but might still want to allow a 
> > > > peer
> > > > +* device to access the range of memory corresponding to a 
> > > > range in
> > > > +* that vma.
> > > > +*
> > > > +* FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE 
> > > > FOR A
> > > > +* DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL 
> > > > VALID
> > > > +* SHOULD ALSO BE SUCCESSFUL. Following this rule allow the 
> > > > importing
> > > > +* device to map once during setup and report any failure at 
> > > > that time
> > > > +* to the userspace. Further mapping of the same range might 
> > > > happen
> > > > +* after mmu notifier invalidation over the range. The 
> > > > exporting device
> > > > +* can use this to move things around (defrag BAR space for 
> > > > instance)
> > > > +* or do other similar task.
> > > > +*
> > > > +* IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL 
> > > > p2p_unmap()
> > > > +* WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT 
> > > > ANY
> > > > +* POINT IN TIME WITH NO LOCK HELD.
> > > > +*
> > > > +* In below function, the device argument is the importing 
> > > > device,
> > > > +* the exporting device is the device to which the vma belongs.
> > > > +*/
> > > > +   long (*p2p_map)(struct vm_area_struct *vma,
> > > > +   struct device *device,
> > > > +   unsigned long start,
> > > > +   unsigned long end,
> > > > +   dma_addr_t *pa,
> > > > +   bool write);
> > > > +   long (*p2p_unmap)(struct vm_area_struct *vma,
> > > > + struct device *device,
> > > > + unsigned long start,
> > > > + unsigned long end,
> > > > + dma_addr_t *pa);
> > > 
> > > I don't understand why we need new p2p_[un]map function pointers for
> > > this. In subsequent patches, they never appear to be set anywhere and
> > > are only called by the HMM code. I'd have expected it to be called by
> > > some core VMA code and set by HMM as that's what vm_operations_struct is
> > > for.
> > > 
> > > But the code as all very confusing, hard to follow and seems to be
> > > missing significant chunks. So I'm not really sure what is going on.
> > 
> > It is set by device driver when userspace do mmap(fd) where fd comes
> > from open("/dev/somedevicefile"). So it is set by device driver. HMM
> > has nothing to do with this. It must be set by device driver mmap
> > call back (mmap callback of struct file_operations). For this patch
> > you can completely ignore all the HMM patches. Maybe posting this as
> > 2 separate patchset would make it clearer.
> > 
> > For instance see [1] for how a non HMM driver can export its memory
> > by just setting those callback. Note that a proper implementation of
> > this should also include some kind of driver policy on what to allow
> > to map and what to not allow ... All this is driver specific in any
> > way.
> 
> I'm imagining that the RDMA drivers would use this interface on their
> per-process 'doorbell' BAR pages - we also wish to have P2P DMA to
> this memory. Also the entire VFIO PCI BAR mmap would be good to cover
> with this too.

Correct, you would set those callback on the mmap of your doorbell.

> 
> J

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 12:24:04PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 12:11 p.m., Jerome Glisse wrote:
> > On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2019-01-29 10:47 a.m., jgli...@redhat.com wrote:
> >>
> >>> + /*
> >>> +  * Optional for device driver that want to allow peer to peer (p2p)
> >>> +  * mapping of their vma (which can be back by some device memory) to
> >>> +  * another device.
> >>> +  *
> >>> +  * Note that the exporting device driver might not have map anything
> >>> +  * inside the vma for the CPU but might still want to allow a peer
> >>> +  * device to access the range of memory corresponding to a range in
> >>> +  * that vma.
> >>> +  *
> >>> +  * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> >>> +  * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> >>> +  * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> >>> +  * device to map once during setup and report any failure at that time
> >>> +  * to the userspace. Further mapping of the same range might happen
> >>> +  * after mmu notifier invalidation over the range. The exporting device
> >>> +  * can use this to move things around (defrag BAR space for instance)
> >>> +  * or do other similar task.
> >>> +  *
> >>> +  * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> >>> +  * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> >>> +  * POINT IN TIME WITH NO LOCK HELD.
> >>> +  *
> >>> +  * In below function, the device argument is the importing device,
> >>> +  * the exporting device is the device to which the vma belongs.
> >>> +  */
> >>> + long (*p2p_map)(struct vm_area_struct *vma,
> >>> + struct device *device,
> >>> + unsigned long start,
> >>> + unsigned long end,
> >>> + dma_addr_t *pa,
> >>> + bool write);
> >>> + long (*p2p_unmap)(struct vm_area_struct *vma,
> >>> +   struct device *device,
> >>> +   unsigned long start,
> >>> +   unsigned long end,
> >>> +   dma_addr_t *pa);
> >>
> >> I don't understand why we need new p2p_[un]map function pointers for
> >> this. In subsequent patches, they never appear to be set anywhere and
> >> are only called by the HMM code. I'd have expected it to be called by
> >> some core VMA code and set by HMM as that's what vm_operations_struct is
> >> for.
> >>
> >> But the code as all very confusing, hard to follow and seems to be
> >> missing significant chunks. So I'm not really sure what is going on.
> > 
> > It is set by device driver when userspace do mmap(fd) where fd comes
> > from open("/dev/somedevicefile"). So it is set by device driver. HMM
> > has nothing to do with this. It must be set by device driver mmap
> > call back (mmap callback of struct file_operations). For this patch
> > you can completely ignore all the HMM patches. Maybe posting this as
> > 2 separate patchset would make it clearer.
> > 
> > For instance see [1] for how a non HMM driver can export its memory
> > by just setting those callback. Note that a proper implementation of
> > this should also include some kind of driver policy on what to allow
> > to map and what to not allow ... All this is driver specific in any
> > way.
> 
> I'd suggest [1] should be a part of the patchset so we can actually see
> a user of the stuff you're adding.

I did not wanted to clutter patchset with device driver specific usage
of this. As the API can be reason about in abstract way.

> 
> But it still doesn't explain everything as without the HMM code nothing
> calls the new vm_ops. And there's still no callers for the p2p_test
> functions you added. And I still don't understand why we need the new
> vm_ops or who calls them and when. Why can't drivers use the existing
> 'fault' vm_op and call a new helper function to map p2p when appropriate
> or a different helper function to map a large range in its mmap
> operation? Just like regular mmap code...

HMM code is just one user, if you have a driver that use HMM mirror
then your driver get support for this for free. If you do not want to
use HMM then you can directly call this in your driver.

The flow is, devic

Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-01-29 10:47 a.m., jgli...@redhat.com wrote:
> 
> > +   /*
> > +* Optional for device driver that want to allow peer to peer (p2p)
> > +* mapping of their vma (which can be back by some device memory) to
> > +* another device.
> > +*
> > +* Note that the exporting device driver might not have map anything
> > +* inside the vma for the CPU but might still want to allow a peer
> > +* device to access the range of memory corresponding to a range in
> > +* that vma.
> > +*
> > +* FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
> > +* DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
> > +* SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
> > +* device to map once during setup and report any failure at that time
> > +* to the userspace. Further mapping of the same range might happen
> > +* after mmu notifier invalidation over the range. The exporting device
> > +* can use this to move things around (defrag BAR space for instance)
> > +* or do other similar task.
> > +*
> > +* IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
> > +* WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
> > +* POINT IN TIME WITH NO LOCK HELD.
> > +*
> > +* In below function, the device argument is the importing device,
> > +* the exporting device is the device to which the vma belongs.
> > +*/
> > +   long (*p2p_map)(struct vm_area_struct *vma,
> > +   struct device *device,
> > +   unsigned long start,
> > +   unsigned long end,
> > +   dma_addr_t *pa,
> > +   bool write);
> > +   long (*p2p_unmap)(struct vm_area_struct *vma,
> > + struct device *device,
> > + unsigned long start,
> > + unsigned long end,
> > + dma_addr_t *pa);
> 
> I don't understand why we need new p2p_[un]map function pointers for
> this. In subsequent patches, they never appear to be set anywhere and
> are only called by the HMM code. I'd have expected it to be called by
> some core VMA code and set by HMM as that's what vm_operations_struct is
> for.
> 
> But the code as all very confusing, hard to follow and seems to be
> missing significant chunks. So I'm not really sure what is going on.

It is set by device driver when userspace do mmap(fd) where fd comes
from open("/dev/somedevicefile"). So it is set by device driver. HMM
has nothing to do with this. It must be set by device driver mmap
call back (mmap callback of struct file_operations). For this patch
you can completely ignore all the HMM patches. Maybe posting this as
2 separate patchset would make it clearer.

For instance see [1] for how a non HMM driver can export its memory
by just setting those callback. Note that a proper implementation of
this should also include some kind of driver policy on what to allow
to map and what to not allow ... All this is driver specific in any
way.

Cheers,
Jérôme

[1] 
https://cgit.freedesktop.org/~glisse/linux/commit/?h=wip-p2p-showcase=964214dcd4df96f296e2214042e8cfce135ae3d4
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4 8/9] gpu/drm/i915: optimize out the case when a range is updated to read only

2019-01-29 Thread Jerome Glisse
On Tue, Jan 29, 2019 at 04:20:00PM +0200, Joonas Lahtinen wrote:
> Quoting Jerome Glisse (2019-01-24 17:30:32)
> > On Thu, Jan 24, 2019 at 02:09:12PM +0200, Joonas Lahtinen wrote:
> > > Hi Jerome,
> > > 
> > > This patch seems to have plenty of Cc:s, but none of the right ones :)
> > 
> > So sorry, i am bad with git commands.
> > 
> > > For further iterations, I guess you could use git option --cc to make
> > > sure everyone gets the whole series, and still keep the Cc:s in the
> > > patches themselves relevant to subsystems.
> > 
> > Will do.
> > 
> > > This doesn't seem to be on top of drm-tip, but on top of your previous
> > > patches(?) that I had some comments about. Could you take a moment to
> > > first address the couple of question I had, before proceeding to discuss
> > > what is built on top of that base.
> > 
> > It is on top of Linus tree so roughly ~ rc3 it does not depend on any
> > of the previous patch i posted.
> 
> You actually managed to race a point in time just when Chris rewrote much
> of the userptr code in drm-tip, which I didn't remember of. My bad.
> 
> Still interested to hearing replies to my questions in the previous
> thread, if the series is still relevant. Trying to get my head around
> how the different aspects of HMM pan out for devices without fault handling.

HMM mirror does not need page fault handling for everything and in fact
for user ptr you can use HMM mirror without page fault support in hardware.
Page fault requirement is more like a __very__ nice to have feature.

So sorry i missed that mail i must had it in a middle of bugzilla spam
and deleted it. So here is a paste of it and answer. This was for a
patch to convert i915 to use HMM mirror instead of having i915 does it
own thing with GUP (get_user_page).

> Bit late reply, but here goes :)
>
> We're working quite hard to avoid pinning any pages unless they're in
> the GPU page tables. And when they are in the GPU page tables, they must
> be pinned for whole of that duration, for the reason that our GPUs can
> not take a fault. And to avoid thrashing GPU page tables, we do leave
> objects in page tables with the expectation that smart userspace
> recycles buffers.

You do not need to pin the page because you obey to mmu notifier ie
it is perfectly fine for you to keep the page map into the GPU until
you get an mmu notifier call back for the range of virtual address.

The pin from GUP in fact does not protect you from anything. GUP is
really misleading, by the time GUP return the page you get might not
correspond to the memory backing the virtual address.

In i915 code this is not an issue because you synchronize against
mmu notifier call back.

So my intention in converting GPU driver from GUP to HMM mirror is
just to avoid the useless page pin. As long as you obey the mmu
notifier call back (or HMM sync page table call back) then you are
fine.

> So what I understand of your proposal, it wouldn't really make a
> difference for us in the amount of pinned pages (which I agree,
> we'd love to see going down). When we're unable to take a fault,
> the first use effectively forces us to pin any pages and keep them
> pinned to avoid thrashing GPU page tables.

With HMM there is no pin, we never pin the page ie we never increment
the refcount on the page as it is useless to do so if you abide by
mmu notifier. Again the pin GUP take is misleading it does not block
mm event.

However Without pin and still abiding to mmu notifier you will not
see any difference in thrashing ie number of time you will get a mmu
notifier call back. As really those call back happens for good reasons.
For instance running out of memory and kernel trying to reclaim or
because userspace did a syscall that affect the range of virtual address.

This should not happen in regular workload and when they happen the pin
from GUP will not inhibit those either. In the end you will get the exact
same amount of trashing but you will inhibit thing like memory compaction
or migration while HMM does not block those (ie HMM is a good citizen ;)
while GUP user are not).

Also we are in the process of changing GUP and GUP will now have more
profound impact to filesystem and mm (inhibiting and breaking some of
the filesystem behavior). Converting GPU driver to HMM will avoid those
adverse impact and it is one of the motivation behind my crusade to
convert all GUP user that abide by mmu notifier to use HMM instead.


> So from i915 perspective, it just seems to be mostly an exchange of
> an API to an another for getting the pages. You already mentioned
> the fast path is being worked on, which is an obvious difference.
> But is there some other improvement one would be expecting, beyond
> the page pinning?

So for HMM i have a bunch

Re: [PATCH v4 8/9] gpu/drm/i915: optimize out the case when a range is updated to read only

2019-01-24 Thread Jerome Glisse
On Thu, Jan 24, 2019 at 02:09:12PM +0200, Joonas Lahtinen wrote:
> Hi Jerome,
> 
> This patch seems to have plenty of Cc:s, but none of the right ones :)

So sorry, i am bad with git commands.

> For further iterations, I guess you could use git option --cc to make
> sure everyone gets the whole series, and still keep the Cc:s in the
> patches themselves relevant to subsystems.

Will do.

> This doesn't seem to be on top of drm-tip, but on top of your previous
> patches(?) that I had some comments about. Could you take a moment to
> first address the couple of question I had, before proceeding to discuss
> what is built on top of that base.

It is on top of Linus tree so roughly ~ rc3 it does not depend on any
of the previous patch i posted. I still intended to propose to remove
GUP from i915 once i get around to implement the equivalent of GUP_fast
for HMM and other bonus cookies with it.

The plan is once i have all mm bits properly upstream then i can propose
patches to individual driver against the proper driver tree ie following
rules of each individual device driver sub-system and Cc only people
there to avoid spamming the mm folks :)


> 
> My reply's Message-ID is:
> 154289518994.19402.3481838548028068...@jlahtine-desk.ger.corp.intel.com
> 
> Regards, Joonas
> 
> PS. Please keep me Cc:d in the following patches, I'm keen on
> understanding the motive and benefits.
> 
> Quoting jgli...@redhat.com (2019-01-24 00:23:14)
> > From: Jérôme Glisse 
> > 
> > When range of virtual address is updated read only and corresponding
> > user ptr object are already read only it is pointless to do anything.
> > Optimize this case out.
> > 
> > Signed-off-by: Jérôme Glisse 
> > Cc: Christian König 
> > Cc: Jan Kara 
> > Cc: Felix Kuehling 
> > Cc: Jason Gunthorpe 
> > Cc: Andrew Morton 
> > Cc: Matthew Wilcox 
> > Cc: Ross Zwisler 
> > Cc: Dan Williams 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Michal Hocko 
> > Cc: Ralph Campbell 
> > Cc: John Hubbard 
> > Cc: k...@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Arnd Bergmann 
> > ---
> >  drivers/gpu/drm/i915/i915_gem_userptr.c | 16 
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > index 9558582c105e..23330ac3d7ea 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > @@ -59,6 +59,7 @@ struct i915_mmu_object {
> > struct interval_tree_node it;
> > struct list_head link;
> > struct work_struct work;
> > +   bool read_only;
> > bool attached;
> >  };
> >  
> > @@ -119,6 +120,7 @@ static int 
> > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > container_of(_mn, struct i915_mmu_notifier, mn);
> > struct i915_mmu_object *mo;
> > struct interval_tree_node *it;
> > +   bool update_to_read_only;
> > LIST_HEAD(cancelled);
> > unsigned long end;
> >  
> > @@ -128,6 +130,8 @@ static int 
> > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > /* interval ranges are inclusive, but invalidate range is exclusive 
> > */
> > end = range->end - 1;
> >  
> > +   update_to_read_only = mmu_notifier_range_update_to_read_only(range);
> > +
> > spin_lock(>lock);
> > it = interval_tree_iter_first(>objects, range->start, end);
> > while (it) {
> > @@ -145,6 +149,17 @@ static int 
> > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >  * object if it is not in the process of being destroyed.
> >  */
> > mo = container_of(it, struct i915_mmu_object, it);
> > +
> > +   /*
> > +* If it is already read only and we are updating to
> > +* read only then we do not need to change anything.
> > +* So save time and skip this one.
> > +*/
> > +   if (update_to_read_only && mo->read_only) {
> > +   it = interval_tree_iter_next(it, range->start, end);
> > +   continue;
> > +   }
> > +
> > if (kref_get_unless_zero(>obj->base.refcount))
> > queue_work(mn->wq, >work);
> >  
> > @@ -270,6 +285,7 @@ i915_gem_userptr_init__mmu_notifier(struct 
> > drm_i915_gem_object *obj,
> > mo->mn = mn;
> > mo->obj = obj;
> > mo->it.start = obj->userptr.ptr;
> > +   mo->read_only = i915_gem_object_is_readonly(obj);
> > mo->it.last = obj->userptr.ptr + obj->base.size - 1;
> > INIT_WORK(>work, cancel_userptr);
> >  
> > -- 
> > 2.17.2
> > 
> > ___
> > dri-devel mailing list
> > dri-devel@lists.freedesktop.org
> > 

Re: [PATCH v4 0/9] mmu notifier provide context informations

2019-01-23 Thread Jerome Glisse
On Wed, Jan 23, 2019 at 02:54:40PM -0800, Dan Williams wrote:
> On Wed, Jan 23, 2019 at 2:23 PM  wrote:
> >
> > From: Jérôme Glisse 
> >
> > Hi Andrew, i see that you still have my event patch in you queue [1].
> > This patchset replace that single patch and is broken down in further
> > step so that it is easier to review and ascertain that no mistake were
> > made during mechanical changes. Here are the step:
> >
> > Patch 1 - add the enum values
> > Patch 2 - coccinelle semantic patch to convert all call site of
> >   mmu_notifier_range_init to default enum value and also
> >   to passing down the vma when it is available
> > Patch 3 - update many call site to more accurate enum values
> > Patch 4 - add the information to the mmu_notifier_range struct
> > Patch 5 - helper to test if a range is updated to read only
> >
> > All the remaining patches are update to various driver to demonstrate
> > how this new information get use by device driver. I build tested
> > with make all and make all minus everything that enable mmu notifier
> > ie building with MMU_NOTIFIER=no. Also tested with some radeon,amd
> > gpu and intel gpu.
> >
> > If they are no objections i believe best plan would be to merge the
> > the first 5 patches (all mm changes) through your queue for 5.1 and
> > then to delay driver update to each individual driver tree for 5.2.
> > This will allow each individual device driver maintainer time to more
> > thouroughly test this more then my own testing.
> >
> > Note that i also intend to use this feature further in nouveau and
> > HMM down the road. I also expect that other user like KVM might be
> > interested into leveraging this new information to optimize some of
> > there secondary page table invalidation.
> 
> "Down the road" users should introduce the functionality they want to
> consume. The common concern with preemptively including
> forward-looking infrastructure is realizing later that the
> infrastructure is not needed, or needs changing. If it has no current
> consumer, leave it out.

This patchset already show that this is useful, what more can i do ?
I know i will use this information, in nouveau for memory policy we
allocate our own structure for every vma the GPU ever accessed or that
userspace hinted we should set a policy for. Right now with existing
mmu notifier i _must_ free those structure because i do not know if
the invalidation is an munmap or something else. So i am loosing
important informations and unecessarly free struct that i will have
to re-allocate just couple jiffies latter. That's one way i am using
this. The other way is to optimize GPU page table update just like i
am doing with all the patches to RDMA/ODP and various GPU drivers.


> > Here is an explaination on the rational for this patchset:
> >
> >
> > CPU page table update can happens for many reasons, not only as a result
> > of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also
> > as a result of kernel activities (memory compression, reclaim, migration,
> > ...).
> >
> > This patch introduce a set of enums that can be associated with each of
> > the events triggering a mmu notifier. Latter patches take advantages of
> > those enum values.
> >
> > - UNMAP: munmap() or mremap()
> > - CLEAR: page table is cleared (migration, compaction, reclaim, ...)
> > - PROTECTION_VMA: change in access protections for the range
> > - PROTECTION_PAGE: change in access protections for page in the range
> > - SOFT_DIRTY: soft dirtyness tracking
> >
> > Being able to identify munmap() and mremap() from other reasons why the
> > page table is cleared is important to allow user of mmu notifier to
> > update their own internal tracking structure accordingly (on munmap or
> > mremap it is not longer needed to track range of virtual address as it
> > becomes invalid).
> 
> The only context information consumed in this patch set is
> MMU_NOTIFY_PROTECTION_VMA.
> 
> What is the practical benefit of these "optimize out the case when a
> range is updated to read only" optimizations? Any numbers to show this
> is worth the code thrash?

It depends on the workload for instance if you map to RDMA a file
read only like a log file for export, all write back that would
disrupt the RDMA mapping can be optimized out.

See above for more reasons why it is beneficial (knowing when it is
an munmap/mremap versus something else).

I would have not thought that passing down information as something
that controversial. Hopes this help you see the benefit of this.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v4 9/9] RDMA/umem_odp: optimize out the case when a range is updated to read only

2019-01-23 Thread Jerome Glisse
On Wed, Jan 23, 2019 at 10:32:00PM +, Jason Gunthorpe wrote:
> On Wed, Jan 23, 2019 at 05:23:15PM -0500, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> > 
> > When range of virtual address is updated read only and corresponding
> > user ptr object are already read only it is pointless to do anything.
> > Optimize this case out.
> > 
> > Signed-off-by: Jérôme Glisse 
> > Cc: Christian König 
> > Cc: Jan Kara 
> > Cc: Felix Kuehling 
> > Cc: Jason Gunthorpe 
> > Cc: Andrew Morton 
> > Cc: Matthew Wilcox 
> > Cc: Ross Zwisler 
> > Cc: Dan Williams 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Michal Hocko 
> > Cc: Ralph Campbell 
> > Cc: John Hubbard 
> > Cc: k...@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Arnd Bergmann 
> >  drivers/infiniband/core/umem_odp.c | 22 +++---
> >  include/rdma/ib_umem_odp.h |  1 +
> >  2 files changed, 20 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/infiniband/core/umem_odp.c 
> > b/drivers/infiniband/core/umem_odp.c
> > index a4ec43093cb3..fa4e7fdcabfc 100644
> > +++ b/drivers/infiniband/core/umem_odp.c
> > @@ -140,8 +140,15 @@ static void ib_umem_notifier_release(struct 
> > mmu_notifier *mn,
> >  static int invalidate_range_start_trampoline(struct ib_umem_odp *item,
> >  u64 start, u64 end, void *cookie)
> >  {
> > +   bool update_to_read_only = *((bool *)cookie);
> > +
> > ib_umem_notifier_start_account(item);
> > -   item->umem.context->invalidate_range(item, start, end);
> > +   /*
> > +* If it is already read only and we are updating to read only then we
> > +* do not need to change anything. So save time and skip this one.
> > +*/
> > +   if (!update_to_read_only || !item->read_only)
> > +   item->umem.context->invalidate_range(item, start, end);
> > return 0;
> >  }
> >  
> > @@ -150,6 +157,7 @@ static int 
> > ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> >  {
> > struct ib_ucontext_per_mm *per_mm =
> > container_of(mn, struct ib_ucontext_per_mm, mn);
> > +   bool update_to_read_only;
> >  
> > if (range->blockable)
> > down_read(_mm->umem_rwsem);
> > @@ -166,10 +174,13 @@ static int 
> > ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> > return 0;
> > }
> >  
> > +   update_to_read_only = mmu_notifier_range_update_to_read_only(range);
> > +
> > return rbt_ib_umem_for_each_in_range(_mm->umem_tree, range->start,
> >  range->end,
> >  invalidate_range_start_trampoline,
> > -range->blockable, NULL);
> > +range->blockable,
> > +_to_read_only);
> >  }
> >  
> >  static int invalidate_range_end_trampoline(struct ib_umem_odp *item, u64 
> > start,
> > @@ -363,6 +374,9 @@ struct ib_umem_odp *ib_alloc_odp_umem(struct 
> > ib_ucontext_per_mm *per_mm,
> > goto out_odp_data;
> > }
> >  
> > +   /* Assume read only at first, each time GUP is call this is updated. */
> > +   odp_data->read_only = true;
> > +
> > odp_data->dma_list =
> > vzalloc(array_size(pages, sizeof(*odp_data->dma_list)));
> > if (!odp_data->dma_list) {
> > @@ -619,8 +633,10 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp 
> > *umem_odp, u64 user_virt,
> > goto out_put_task;
> > }
> >  
> > -   if (access_mask & ODP_WRITE_ALLOWED_BIT)
> > +   if (access_mask & ODP_WRITE_ALLOWED_BIT) {
> > +   umem_odp->read_only = false;
> 
> No locking?

The mmu notitfier exclusion will ensure that it is not missed
ie it will be false before any mmu notifier might be call on
page GUPed with write flag which is what matter here. So lock
are useless here.

> 
> > flags |= FOLL_WRITE;
> > +   }
> >  
> > start_idx = (user_virt - ib_umem_start(umem)) >> page_shift;
> > k = start_idx;
> > diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
> > index 0b1446fe2fab..8256668c6170 100644
> > +++ b/include/rdma/ib_umem_odp.h
> > @@ -76,6 +76,7 @@ struct ib_umem_odp {
> > struct completion   notifier_completion;
> > int dying;
> > struct work_struct  work;
> > +   bool read_only;
> >  };
> 
> The ib_umem already has a writeable flag. This reflects if the user
> asked for write permission to be granted.. The tracking here is if any
> remote fault thus far has requested write, is this an important
> difference to justify the new flag?

I did that patch couple week ago and now i do not remember why
i did not use that, i remember thinking about it ... damm i need
to keep better notes. I will review the code again.

Cheers,
Jérôme
___
dri-devel mailing list

Re: [PATCH v2 1/3] mm/mmu_notifier: use structure for invalidate_range_start/end callback

2018-12-07 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 08:30:27PM -0700, Jason Gunthorpe wrote:
> On Wed, Dec 05, 2018 at 12:36:26AM -0500, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> > 
> > To avoid having to change many callback definition everytime we want
> > to add a parameter use a structure to group all parameters for the
> > mmu_notifier invalidate_range_start/end callback. No functional changes
> > with this patch.
> > 
> > Signed-off-by: Jérôme Glisse 
> > Cc: Andrew Morton 
> > Cc: Matthew Wilcox 
> > Cc: Ross Zwisler 
> > Cc: Jan Kara 
> > Cc: Dan Williams 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Michal Hocko 
> > Cc: Christian Koenig 
> > Cc: Felix Kuehling 
> > Cc: Ralph Campbell 
> > Cc: John Hubbard 
> > Cc: k...@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-fsde...@vger.kernel.org
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++--
> >  drivers/gpu/drm/i915/i915_gem_userptr.c | 14 
> >  drivers/gpu/drm/radeon/radeon_mn.c  | 16 -
> >  drivers/infiniband/core/umem_odp.c  | 20 +---
> >  drivers/infiniband/hw/hfi1/mmu_rb.c | 13 +++-
> >  drivers/misc/mic/scif/scif_dma.c| 11 ++-
> >  drivers/misc/sgi-gru/grutlbpurge.c  | 14 
> >  drivers/xen/gntdev.c| 12 +++
> >  include/linux/mmu_notifier.h| 14 +---
> >  mm/hmm.c| 23 ++---
> >  mm/mmu_notifier.c   | 21 ++--
> >  virt/kvm/kvm_main.c | 14 +++-
> >  12 files changed, 102 insertions(+), 113 deletions(-)
> 
> The changes to drivers/infiniband look mechanical and fine to me.
> 
> It even looks like this avoids merge conflicts with the other changes
> to these files :)
> 
> For infiniband:
> 
> Acked-by: Jason Gunthorpe 
> 
> I assume this will go through the mm tree?

Yes this is my exceptation as in the ends it touch more mm
stuff than anything else. Andrew already added v1 to its
patchset.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 3/3] mm/mmu_notifier: contextual information for event triggering invalidation

2018-12-06 Thread Jerome Glisse
Should be all fixed in v2 i built with and without mmu notifier and
did not had any issue in v2.

On Fri, Dec 07, 2018 at 05:19:21AM +0800, kbuild test robot wrote:
> Hi Jérôme,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.20-rc5]
> [cannot apply to next-20181206]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/jglisse-redhat-com/mmu-notifier-contextual-informations/20181207-031930
> config: x86_64-randconfig-x017-201848 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
>fs///proc/task_mmu.c: In function 'clear_refs_write':
>fs///proc/task_mmu.c:1099:29: error: storage size of 'range' isn't known
>   struct mmu_notifier_range range;
> ^
> >> fs///proc/task_mmu.c:1147:18: error: 'MMU_NOTIFY_SOFT_DIRTY' undeclared 
> >> (first use in this function); did you mean 'CLEAR_REFS_SOFT_DIRTY'?
>range.event = MMU_NOTIFY_SOFT_DIRTY;
>  ^
>  CLEAR_REFS_SOFT_DIRTY
>fs///proc/task_mmu.c:1147:18: note: each undeclared identifier is reported 
> only once for each function it appears in
>fs///proc/task_mmu.c:1099:29: warning: unused variable 'range' 
> [-Wunused-variable]
>   struct mmu_notifier_range range;
> ^
> 
> vim +1147 fs///proc/task_mmu.c
> 
>   1069
>   1070static ssize_t clear_refs_write(struct file *file, const char 
> __user *buf,
>   1071size_t count, loff_t *ppos)
>   1072{
>   1073struct task_struct *task;
>   1074char buffer[PROC_NUMBUF];
>   1075struct mm_struct *mm;
>   1076struct vm_area_struct *vma;
>   1077enum clear_refs_types type;
>   1078struct mmu_gather tlb;
>   1079int itype;
>   1080int rv;
>   1081
>   1082memset(buffer, 0, sizeof(buffer));
>   1083if (count > sizeof(buffer) - 1)
>   1084count = sizeof(buffer) - 1;
>   1085if (copy_from_user(buffer, buf, count))
>   1086return -EFAULT;
>   1087rv = kstrtoint(strstrip(buffer), 10, );
>   1088if (rv < 0)
>   1089return rv;
>   1090type = (enum clear_refs_types)itype;
>   1091if (type < CLEAR_REFS_ALL || type >= CLEAR_REFS_LAST)
>   1092return -EINVAL;
>   1093
>   1094task = get_proc_task(file_inode(file));
>   1095if (!task)
>   1096return -ESRCH;
>   1097mm = get_task_mm(task);
>   1098if (mm) {
> > 1099struct mmu_notifier_range range;
>   1100struct clear_refs_private cp = {
>   1101.type = type,
>   1102};
>   1103struct mm_walk clear_refs_walk = {
>   1104.pmd_entry = clear_refs_pte_range,
>   1105.test_walk = clear_refs_test_walk,
>   1106.mm = mm,
>   1107.private = ,
>   1108};
>   1109
>   1110if (type == CLEAR_REFS_MM_HIWATER_RSS) {
>   if (down_write_killable(>mmap_sem)) 
> {
>   1112count = -EINTR;
>   1113goto out_mm;
>   1114}
>   1115
>   1116/*
>   1117 * Writing 5 to /proc/pid/clear_refs 
> resets the peak
>   1118 * resident set size to this mm's 
> current rss value.
>   1119 */
>   1120reset_mm_hiwater_rss(mm);
>   1121up_write(>mmap_sem);
>   1122goto out_mm;
>   1123}
>   1124
>   1125down_read(>mmap_sem);
>   1126tlb_gather_mmu(, mm, 0, -1);
>   1127if (type == CLEAR_REFS_SOFT_DIRTY) {
>   1128for (vma = mm->mmap; vma; vma = 
> vma->vm_next) {
>   1129if (!(vma->vm_flags & 
> VM_SOFTDIRTY))
>   1130continue;
>   1131   

Re: [PATCH] dma-buf: fix debugfs versus rcu and fence dumping

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 04:08:12PM +, Koenig, Christian wrote:
> Am 06.12.18 um 16:21 schrieb Jerome Glisse:
> > On Thu, Dec 06, 2018 at 08:09:28AM +, Koenig, Christian wrote:
> >> Am 06.12.18 um 02:41 schrieb jgli...@redhat.com:
> >>> From: Jérôme Glisse 
> >>>
> >>> The debugfs take reference on fence without dropping them. Also the
> >>> rcu section are not well balance. Fix all that ...
> >>>
> >>> Signed-off-by: Jérôme Glisse 
> >>> Cc: Christian König 
> >>> Cc: Daniel Vetter 
> >>> Cc: Sumit Semwal 
> >>> Cc: linux-me...@vger.kernel.org
> >>> Cc: dri-devel@lists.freedesktop.org
> >>> Cc: linaro-mm-...@lists.linaro.org
> >>> Cc: Stéphane Marchesin 
> >>> Cc: sta...@vger.kernel.org
> >> Well NAK, you are now taking the RCU lock twice and dropping the RCU and
> >> still accessing fobj has a huge potential for accessing freed up memory.
> >>
> >> The only correct thing I can see here is to grab a reference to the
> >> fence before printing any info on it,
> >> Christian.
> > Hu ? That is exactly what i am doing, take reference under rcu,
> > rcu_unlock print the fence info, drop the fence reference, rcu
> > lock rinse and repeat ...
> >
> > Note that the fobj in _existing_ code is access outside the rcu
> > end that there is an rcu imbalance in that code ie a lonlely
> > rcu_unlock after the for loop.
> >
> > So that the existing code is broken.
> 
> No, the existing code is perfectly fine.
> 
> Please note the break in the loop before the rcu_unlock();
> > if (!read_seqcount_retry(>seq, seq))
> > break; <- HERE!
> > rcu_read_unlock();
> > }
> 
> So your patch breaks that and take the RCU read lock twice.

Ok missed that, i wonder if the refcount in balance explains
the crash that was reported to me ... i sent a patch just for
that.

Thank you for reviewing and pointing out the code i was
oblivious too :)

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] dma-buf: fix debugfs versus rcu and fence dumping

2018-12-06 Thread Jerome Glisse
On Thu, Dec 06, 2018 at 08:09:28AM +, Koenig, Christian wrote:
> Am 06.12.18 um 02:41 schrieb jgli...@redhat.com:
> > From: Jérôme Glisse 
> >
> > The debugfs take reference on fence without dropping them. Also the
> > rcu section are not well balance. Fix all that ...
> >
> > Signed-off-by: Jérôme Glisse 
> > Cc: Christian König 
> > Cc: Daniel Vetter 
> > Cc: Sumit Semwal 
> > Cc: linux-me...@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linaro-mm-...@lists.linaro.org
> > Cc: Stéphane Marchesin 
> > Cc: sta...@vger.kernel.org
> 
> Well NAK, you are now taking the RCU lock twice and dropping the RCU and 
> still accessing fobj has a huge potential for accessing freed up memory.
> 
> The only correct thing I can see here is to grab a reference to the 
> fence before printing any info on it,
> Christian.

Hu ? That is exactly what i am doing, take reference under rcu,
rcu_unlock print the fence info, drop the fence reference, rcu
lock rinse and repeat ...

Note that the fobj in _existing_ code is access outside the rcu
end that there is an rcu imbalance in that code ie a lonlely
rcu_unlock after the for loop.

So that the existing code is broken.

> 
> > ---
> >   drivers/dma-buf/dma-buf.c | 11 +--
> >   1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index 13884474d158..f6f4de42ac49 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -1051,24 +1051,31 @@ static int dma_buf_debug_show(struct seq_file *s, 
> > void *unused)
> > fobj = rcu_dereference(robj->fence);
> > shared_count = fobj ? fobj->shared_count : 0;
> > fence = rcu_dereference(robj->fence_excl);
> > +   fence = dma_fence_get_rcu(fence);
> > if (!read_seqcount_retry(>seq, seq))
> > break;
> > rcu_read_unlock();
> > }
> > -
> > -   if (fence)
> > +   if (fence) {
> > seq_printf(s, "\tExclusive fence: %s %s %ssignalled\n",
> >fence->ops->get_driver_name(fence),
> >fence->ops->get_timeline_name(fence),
> >dma_fence_is_signaled(fence) ? "" : "un");
> > +   dma_fence_put(fence);
> > +   }
> > +
> > +   rcu_read_lock();
> > for (i = 0; i < shared_count; i++) {
> > fence = rcu_dereference(fobj->shared[i]);
> > if (!dma_fence_get_rcu(fence))
> > continue;
> > +   rcu_read_unlock();
> > seq_printf(s, "\tShared fence: %s %s %ssignalled\n",
> >fence->ops->get_driver_name(fence),
> >fence->ops->get_timeline_name(fence),
> >dma_fence_is_signaled(fence) ? "" : "un");
> > +   dma_fence_put(fence);
> > +   rcu_read_lock();
> > }
> > rcu_read_unlock();
> >   
> 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 1/3] mm/mmu_notifier: use structure for invalidate_range_start/end callback

2018-12-05 Thread Jerome Glisse
On Wed, Dec 05, 2018 at 09:42:45PM +, Kuehling, Felix wrote:
> The amdgpu part looks good to me.
> 
> A minor nit-pick in mmu_notifier.c (inline).
> 
> Either way, the series is Acked-by: Felix Kuehling 
> 
> On 2018-12-05 12:36 a.m., jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> >
> > To avoid having to change many callback definition everytime we want
> > to add a parameter use a structure to group all parameters for the
> > mmu_notifier invalidate_range_start/end callback. No functional changes
> > with this patch.
> >
> > Signed-off-by: Jérôme Glisse 
> > Cc: Andrew Morton 
> > Cc: Matthew Wilcox 
> > Cc: Ross Zwisler 
> > Cc: Jan Kara 
> > Cc: Dan Williams 
> > Cc: Paolo Bonzini 
> > Cc: Radim Krčmář 
> > Cc: Michal Hocko 
> > Cc: Christian Koenig 
> > Cc: Felix Kuehling 
> > Cc: Ralph Campbell 
> > Cc: John Hubbard 
> > Cc: k...@vger.kernel.org
> > Cc: dri-devel@lists.freedesktop.org
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-fsde...@vger.kernel.org
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++--
> >  drivers/gpu/drm/i915/i915_gem_userptr.c | 14 
> >  drivers/gpu/drm/radeon/radeon_mn.c  | 16 -
> >  drivers/infiniband/core/umem_odp.c  | 20 +---
> >  drivers/infiniband/hw/hfi1/mmu_rb.c | 13 +++-
> >  drivers/misc/mic/scif/scif_dma.c| 11 ++-
> >  drivers/misc/sgi-gru/grutlbpurge.c  | 14 
> >  drivers/xen/gntdev.c| 12 +++
> >  include/linux/mmu_notifier.h| 14 +---
> >  mm/hmm.c| 23 ++---
> >  mm/mmu_notifier.c   | 21 ++--
> >  virt/kvm/kvm_main.c | 14 +++-
> >  12 files changed, 102 insertions(+), 113 deletions(-)
> >
> [snip]
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index 5119ff846769..5f6665ae3ee2 100644
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -178,14 +178,20 @@ int __mmu_notifier_invalidate_range_start(struct 
> > mm_struct *mm,
> >   unsigned long start, unsigned long end,
> >   bool blockable)
> >  {
> > +   struct mmu_notifier_range _range, *range = &_range;
> 
> I'm not sure why you need to access _range indirectly through a pointer.
> See below.
> 
> 
> > struct mmu_notifier *mn;
> > int ret = 0;
> > int id;
> >  
> > +   range->blockable = blockable;
> > +   range->start = start;
> > +   range->end = end;
> > +   range->mm = mm;
> 
> This could just assign _range.blockable, _range.start, etc. without the
> indirection. Or you could even use an initializer instead:
> 
> struct mmu_notifier_range range = {
>     .blockable = blockable,
>     .start = start,
>     ...
> };
> 
> 
> > +
> > id = srcu_read_lock();
> > hlist_for_each_entry_rcu(mn, >mmu_notifier_mm->list, hlist) {
> > if (mn->ops->invalidate_range_start) {
> > -   int _ret = mn->ops->invalidate_range_start(mn, mm, 
> > start, end, blockable);
> > +   int _ret = mn->ops->invalidate_range_start(mn, range);
> 
> This could just use &_range without the indirection.
> 
> Same in ..._invalidate_range_end below.

So explaination is that this is a temporary step all this code is
remove in the second patch. It was done this way in this patch to
minimize the diff within the next patch.

I did this because i wanted to do the convertion in 2 steps the
first step i convert all the listener of mmu notifier and in the
second step i convert all the call site that trigger a mmu notifer.

I did that to help people reviewing only the part they care about.

Apparently it end up confusing people more than it helped :)

Do people have strong feeling about getting this code that is
deleted in the second patch fix in the first patch anyway ?

I can respin if so but i don't see much value in formating code
that is deleted in the serie.

Thank you for reviewing

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 1/3] mm/mmu_notifier: use structure for invalidate_range_start/end callback

2018-12-05 Thread Jerome Glisse
On Wed, Dec 05, 2018 at 05:35:20PM +0100, Jan Kara wrote:
> On Wed 05-12-18 00:36:26, jgli...@redhat.com wrote:
> > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > index 5119ff846769..5f6665ae3ee2 100644
> > --- a/mm/mmu_notifier.c
> > +++ b/mm/mmu_notifier.c
> > @@ -178,14 +178,20 @@ int __mmu_notifier_invalidate_range_start(struct 
> > mm_struct *mm,
> >   unsigned long start, unsigned long end,
> >   bool blockable)
> >  {
> > +   struct mmu_notifier_range _range, *range = &_range;
> 
> Why these games with two variables?

This is a temporary step i dediced to do the convertion in 2 steps,
first i convert the callback to use the structure so that people
having mmu notifier callback only have to review this patch and do
not get distracted by the second step which update all the mm call
site that trigger invalidation.

In the final result this code disappear. I did it that way to make
the thing more reviewable. Sorry if that is a bit confusing.

> 
> > struct mmu_notifier *mn;
> > int ret = 0;
> > int id;
> >  
> > +   range->blockable = blockable;
> > +   range->start = start;
> > +   range->end = end;
> > +   range->mm = mm;
> > +
> 
> Use your init function for this?

This get remove in the next patch, i can respawn with the init
function but this is a temporary step like explain above.

> 
> > id = srcu_read_lock();
> > hlist_for_each_entry_rcu(mn, >mmu_notifier_mm->list, hlist) {
> > if (mn->ops->invalidate_range_start) {
> > -   int _ret = mn->ops->invalidate_range_start(mn, mm, 
> > start, end, blockable);
> > +   int _ret = mn->ops->invalidate_range_start(mn, range);
> > if (_ret) {
> > pr_info("%pS callback failed with %d in 
> > %sblockable context.\n",
> > 
> > mn->ops->invalidate_range_start, _ret,
> > @@ -205,9 +211,20 @@ void __mmu_notifier_invalidate_range_end(struct 
> > mm_struct *mm,
> >  unsigned long end,
> >  bool only_end)
> >  {
> > +   struct mmu_notifier_range _range, *range = &_range;
> > struct mmu_notifier *mn;
> > int id;
> >  
> > +   /*
> > +* The end call back will never be call if the start refused to go
> > +* through because of blockable was false so here assume that we
> > +* can block.
> > +*/
> > +   range->blockable = true;
> > +   range->start = start;
> > +   range->end = end;
> > +   range->mm = mm;
> > +
> 
> The same as above.
> 
> Otherwise the patch looks good to me.

Thank you for reviewing.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/3] mm/mmu_notifier: use structure for invalidate_range_start/end calls

2018-12-05 Thread Jerome Glisse
On Wed, Dec 05, 2018 at 12:04:16PM +0100, Jan Kara wrote:
> Hi Jerome!
> 
> On Mon 03-12-18 15:18:16, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> > 
> > To avoid having to change many call sites everytime we want to add a
> > parameter use a structure to group all parameters for the mmu_notifier
> > invalidate_range_start/end cakks. No functional changes with this
> > patch.
> 
> Two suggestions for the patch below:
> 
> > @@ -772,7 +775,8 @@ static void dax_entry_mkclean(struct address_space 
> > *mapping, pgoff_t index,
> >  * call mmu_notifier_invalidate_range_start() on our behalf
> >  * before taking any lock.
> >  */
> > -   if (follow_pte_pmd(vma->vm_mm, address, , , , 
> > , ))
> > +   if (follow_pte_pmd(vma->vm_mm, address, ,
> > +  , , ))
> > continue;
> 
> The change of follow_pte_pmd() arguments looks unexpected. Why should that
> care about mmu notifier range? I see it may be convenient but it doesn't look
> like a good API to me.

Saddly i do not see a way around that one this is because of fs/dax.c
which does the mmu_notifier_invalidate_range_end while follow_pte_pmd
do the mmu_notifier_invalidate_range_start

follow_pte_pmd does adjust the start and end address so that the dax
function does not have the logic to find those address. So instead of
duplicating that follow_pte_pmd inside the dax code i rather passed
around the range struct to follow_pte_pmd.

> 
> > @@ -1139,11 +1140,15 @@ static ssize_t clear_refs_write(struct file *file, 
> > const char __user *buf,
> > downgrade_write(>mmap_sem);
> > break;
> > }
> > -   mmu_notifier_invalidate_range_start(mm, 0, -1);
> > +
> > +   range.start = 0;
> > +   range.end = -1UL;
> > +   range.mm = mm;
> > +   mmu_notifier_invalidate_range_start();
> 
> Also how about providing initializer for struct mmu_notifier_range? Or
> something like DECLARE_MMU_NOTIFIER_RANGE? That will make sure that
> unused arguments for particular notification places have defined values and
> also if you add another mandatory argument (like you do in your third
> patch), you just add another argument to the initializer and that way
> the compiler makes sure you haven't missed any place. Finally the code will
> remain more compact that way (less lines needed to initialize the struct).

That is what i do in v2 :)

Thank you for looking to all this.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 3/3] mm/mmu_notifier: contextual information for event triggering invalidation

2018-12-04 Thread Jerome Glisse
On Tue, Dec 04, 2018 at 10:17:48AM +0200, Mike Rapoport wrote:
> On Mon, Dec 03, 2018 at 03:18:17PM -0500, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 

[...]

> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index cbeece8e47d4..3077d487be8b 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -25,10 +25,43 @@ struct mmu_notifier_mm {
> > spinlock_t lock;
> >  };
> > 
> > +/*
> > + * What event is triggering the invalidation:
> 
> Can you please make it kernel-doc comment?

Sorry should have done that in the first place, Andrew i will post a v2
with that and fixing my one stupid bug.



> > + *
> > + * MMU_NOTIFY_UNMAP
> > + *either munmap() that unmap the range or a mremap() that move the 
> > range
> > + *
> > + * MMU_NOTIFY_CLEAR
> > + *clear page table entry (many reasons for this like madvise() or 
> > replacing
> > + *a page by another one, ...).
> > + *
> > + * MMU_NOTIFY_PROTECTION_VMA
> > + *update is due to protection change for the range ie using the vma 
> > access
> > + *permission (vm_page_prot) to update the whole range is enough no 
> > need to
> > + *inspect changes to the CPU page table (mprotect() syscall)
> > + *
> > + * MMU_NOTIFY_PROTECTION_PAGE
> > + *update is due to change in read/write flag for pages in the range so 
> > to
> > + *mirror those changes the user must inspect the CPU page table (from 
> > the
> > + *end callback).
> > + *
> > + *
> > + * MMU_NOTIFY_SOFT_DIRTY
> > + *soft dirty accounting (still same page and same access flags)
> > + */
> > +enum mmu_notifier_event {
> > +   MMU_NOTIFY_UNMAP = 0,
> > +   MMU_NOTIFY_CLEAR,
> > +   MMU_NOTIFY_PROTECTION_VMA,
> > +   MMU_NOTIFY_PROTECTION_PAGE,
> > +   MMU_NOTIFY_SOFT_DIRTY,
> > +};
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/3] mm/mmu_notifier: use structure for invalidate_range_start/end calls

2018-12-03 Thread Jerome Glisse
On Mon, Dec 03, 2018 at 03:18:16PM -0500, jgli...@redhat.com wrote:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f7e4bfdc13b7..4896dd9d8b28 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2303,8 +2303,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   */
>  static void migrate_vma_collect(struct migrate_vma *migrate)
>  {
> + struct mmu_notifier_range range;
>   struct mm_walk mm_walk;
>  
> + range.start = migrate->start;
> + range.end = migrate->end;
> + range.mm = mm_walk.mm;

Andrew can you replace above line by:

+   range.mm = migrate->vma->vm_mm;

I made a mistake here when i was rebasing before posting. I checked
the patchset again and i believe this is the only mistake i made.

Do you want me to repost ?

Sorry for my stupid mistake.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v8 0/7] mm: Merge hmm into devm_memremap_pages, mark GPL-only

2018-12-03 Thread Jerome Glisse
On Wed, Nov 21, 2018 at 05:20:55PM -0800, Andrew Morton wrote:
> On Tue, 20 Nov 2018 15:12:49 -0800 Dan Williams  
> wrote:

[...]

> > I am also concerned that HMM was designed in a way to minimize further
> > engagement with the core-MM. That, with these hooks in place,
> > device-drivers are free to implement their own policies without much
> > consideration for whether and how the core-MM could grow to meet that
> > need. Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
> > core-MM should be allowed the opportunity and stimulus to change and
> > address these new use cases as first class functionality.
> > 
> 
> The arguments are compelling.  I apologize for not thinking of and/or
> not being made aware of them at the time.

So i wanted to comment on that part. Yes HMM is an impedence layer
that goes both way ie device driver are shielded from core mm and
core mm folks do not need to understand individual driver to modify
mm, they only need to understand what is provided to the driver by
HMM (and keeps HMM promise intact from driver POV no matter how it
is achieve). So this is intentional.

Nonetheless I want to grow core mm involvement in managing those
memory (see patchset i just posted about hbind() and heterogeneous
memory system). But i do not expect that core mm will be in full
control at least not for some time. The historical reasons is that
device like GPU are not only use for compute (which is where HMM
gets use) but also for graphics (simple desktop or even games).
Those are two differents workload using different API (CUDA/OpenCL
for compute, OpenGL/Vulkan for graphics) on the same underlying
hardware.

Those API expose the hardware in incompatible way when it comes to
memory management (especialy API like Vulkan). Managing memory page
wise is not well suited for graphics. The issues comes from the
fact that we do not want to exclude either workload from happening
concurrently (running your destkop while some compute job is running
in the background). So for this to work we need to keep the device
driver in control of its memory (hence why callback when page are
freed for instance). We also need to forbid things like pinning any
device memory pages ...


I still expect some commonality to emerge accross different hardware
so that we can grow more things and share more code into core mm but
i want to get their organicaly, not forcing everyone into a design
today. I expect this will happens by going from high level concept,
how things get use in userspace from end user POV, and working back-
ward from there to see what common API (if any) we can provided to
catter those common use case.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 06:40:03PM +0200, Michal Hocko wrote:
> On Fri 24-08-18 11:12:40, Jerome Glisse wrote:
> [...]
> > I am fine with Michal patch, i already said so couple month ago first time
> > this discussion did pop up, Michal you can add:
> > 
> > Reviewed-by: Jérôme Glisse 
> 
> So I guess the below is the patch you were talking about?
> 
> From f7ac75277d526dccd011f343818dc6af627af2af Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Fri, 24 Aug 2018 15:32:24 +0200
> Subject: [PATCH] mm, mmu_notifier: be explicit about range invalition
>  non-blocking mode
> 
> If invalidate_range_start is called for !blocking mode then all
> callbacks have to guarantee they will no block/sleep. The same obviously
> applies to invalidate_range_end because this operation pairs with the
> former and they are called from the same context. Make sure this is
> appropriately documented.

In my branch i already updated HMM to be like other existing user
ie all blocking operation in the start callback. But yes it would
be wise to added such comments.


> 
> Signed-off-by: Michal Hocko 
> ---
>  include/linux/mmu_notifier.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 133ba78820ee..698e371aafe3 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
>*
>* If blockable argument is set to false then the callback cannot
>* sleep and has to return with -EAGAIN. 0 should be returned
> -  * otherwise.
> +  * otherwise. Please note that if invalidate_range_start approves
> +  * a non-blocking behavior then the same applies to
> +  * invalidate_range_end.
>*
>*/
>   int (*invalidate_range_start)(struct mmu_notifier *mn,
> -- 
> 2.18.0
> 
> -- 
> Michal Hocko
> SUSE Labs
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 11:52:25PM +0900, Tetsuo Handa wrote:
> On 2018/08/24 22:32, Michal Hocko wrote:
> > On Fri 24-08-18 22:02:23, Tetsuo Handa wrote:
> >> I worry that (currently
> >> out-of-tree) users of this API are involving work / recursion.
> > 
> > I do not give a slightest about out-of-tree modules. They will have to
> > accomodate to the new API. I have no problems to extend the
> > documentation and be explicit about this expectation.
> 
> You don't need to care about out-of-tree modules. But you need to hear from
> mm/hmm.c authors/maintainers when making changes for mmu-notifiers.
> 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index 133ba78820ee..698e371aafe3 100644
> > --- a/include/linux/mmu_notifier.h
> > +++ b/include/linux/mmu_notifier.h
> > @@ -153,7 +153,9 @@ struct mmu_notifier_ops {
> >  *
> >  * If blockable argument is set to false then the callback cannot
> >  * sleep and has to return with -EAGAIN. 0 should be returned
> > -* otherwise.
> > +* otherwise. Please note that if invalidate_range_start approves
> > +* a non-blocking behavior then the same applies to
> > +* invalidate_range_end.
> 
> Prior to 93065ac753e44438 ("mm, oom: distinguish blockable mode for mmu
> notifiers"), whether to utilize MMU_INVALIDATE_DOES_NOT_BLOCK was up to
> mmu-notifiers users.
> 
>   -* If both of these callbacks cannot block, and invalidate_range
>   -* cannot block, mmu_notifier_ops.flags should have
>   -* MMU_INVALIDATE_DOES_NOT_BLOCK set.
>   +* If blockable argument is set to false then the callback 
> cannot
>   +* sleep and has to return with -EAGAIN. 0 should be returned
>   +* otherwise.
> 
> Even out-of-tree mmu-notifiers users had rights not to accommodate (i.e.
> make changes) immediately by not setting MMU_INVALIDATE_DOES_NOT_BLOCK.
> 
> Now we are in a merge window. And we noticed a possibility that out-of-tree
> mmu-notifiers users might have trouble with making changes immediately in 
> order
> to follow 93065ac753e44438 if expectation for mm/hmm.c changes immediately.
> And you are trying to ignore such possibility by just updating expected 
> behavior
> description instead of giving out-of-tree users a grace period to check and 
> update
> their code.

Intention is that 99% of HMM users will be upstream as long as they are
not people shouldn't worry. We have been working on nouveau to use it
for the last year or so. Many bits were added in 4.16, 4.17, 4.18 and i
hope it will all be there in 4.20/4.21 timeframe.

See my other mail for list of other users.

> 
> >> and keeps "all operations protected by hmm->mirrors_sem held for write are
> >> atomic". This suggests that "some operations protected by hmm->mirrors_sem 
> >> held
> >> for read will sleep (and in the worst case involves memory allocation
> >> dependency)".
> > 
> > Yes and so what? The clear expectation is that neither of the range
> > notifiers do not sleep in !blocking mode. I really fail to see what you
> > are trying to say.
> 
> I'm saying "Get ACK from Jérôme about mm/hmm.c changes".

I am fine with Michal patch, i already said so couple month ago first time
this discussion did pop up, Michal you can add:

Reviewed-by: Jérôme Glisse 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 02:33:41PM +0200, Michal Hocko wrote:
> On Fri 24-08-18 14:18:44, Christian König wrote:
> > Am 24.08.2018 um 14:03 schrieb Michal Hocko:
> > > On Fri 24-08-18 13:57:52, Christian König wrote:
> > > > Am 24.08.2018 um 13:52 schrieb Michal Hocko:
> > > > > On Fri 24-08-18 13:43:16, Christian König wrote:
> > > [...]
> > > > > > That won't work like this there might be multiple
> > > > > > invalidate_range_start()/invalidate_range_end() pairs open at the 
> > > > > > same time.
> > > > > > E.g. the lock might be taken recursively and that is illegal for a
> > > > > > rw_semaphore.
> > > > > I am not sure I follow. Are you saying that one invalidate_range might
> > > > > trigger another one from the same path?
> > > > No, but what can happen is:
> > > > 
> > > > invalidate_range_start(A,B);
> > > > invalidate_range_start(C,D);
> > > > ...
> > > > invalidate_range_end(C,D);
> > > > invalidate_range_end(A,B);
> > > > 
> > > > Grabbing the read lock twice would be illegal in this case.
> > > I am sorry but I still do not follow. What is the context the two are
> > > called from?
> > 
> > I don't have the slightest idea.
> > 
> > > Can you give me an example. I simply do not see it in the
> > > code, mostly because I am not familiar with it.
> > 
> > I'm neither.
> > 
> > We stumbled over that by pure observation and after discussing the problem
> > with Jerome came up with this solution.
> > 
> > No idea where exactly that case comes from, but I can confirm that it indeed
> > happens.
> 
> Thiking about it some more, I can imagine that a notifier callback which
> performs an allocation might trigger a memory reclaim and that in turn
> might trigger a notifier to be invoked and recurse. But notifier
> shouldn't really allocate memory. They are called from deep MM code
> paths and this would be extremely deadlock prone. Maybe Jerome can come
> up some more realistic scenario. If not then I would propose to simplify
> the locking here. We have lockdep to catch self deadlocks and it is
> always better to handle a specific issue rather than having a code
> without a clear indication how it can recurse.

Multiple concurrent mmu notifier, for overlapping range or not, is
common (each concurrent threads can trigger some). So you might have
multiple invalidate_range_start() in flight for same mm and thus might
complete in different order (invalidate_range_end()). IIRC this is
what this lock was trying to protect against.

I can't think of a reason for recursive mmu notifier call right now.
I will ponder see if i remember something about it.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-08-24 Thread Jerome Glisse
On Fri, Aug 24, 2018 at 07:54:19PM +0900, Tetsuo Handa wrote:
> Two more worries for this patch.

[...]

> 
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, 
> > struct mm_struct *mm)
> > up_write(>mirrors_sem);
> >  }
> > 
> > -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> > +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> >struct mm_struct *mm,
> >unsigned long start,
> > -  unsigned long end)
> > +  unsigned long end,
> > +  bool blockable)
> >  {
> > struct hmm *hmm = mm->hmm;
> > 
> > VM_BUG_ON(!hmm);
> > 
> > atomic_inc(>sequence);
> > +
> > +   return 0;
> >  }
> > 
> >  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> 
> This assumes that hmm_invalidate_range_end() does not have memory
> allocation dependency. But hmm_invalidate_range() from
> hmm_invalidate_range_end() involves
> 
> down_read(>mirrors_sem);
> list_for_each_entry(mirror, >mirrors, list)
> mirror->ops->sync_cpu_device_pagetables(mirror, action,
> start, end);
> up_read(>mirrors_sem);
> 
> sequence. What is surprising is that there is no in-tree user who assigns
> sync_cpu_device_pagetables field.
> 
>   $ grep -Fr sync_cpu_device_pagetables *
>   Documentation/vm/hmm.rst: /* sync_cpu_device_pagetables() - synchronize 
> page tables
>   include/linux/hmm.h: * will get callbacks through 
> sync_cpu_device_pagetables() operation (see
>   include/linux/hmm.h:/* sync_cpu_device_pagetables() - synchronize page 
> tables
>   include/linux/hmm.h:void (*sync_cpu_device_pagetables)(struct 
> hmm_mirror *mirror,
>   include/linux/hmm.h: * hmm_mirror_ops.sync_cpu_device_pagetables() 
> callback, so that CPU page
>   mm/hmm.c:   mirror->ops->sync_cpu_device_pagetables(mirror, 
> action,
> 
> That is, this API seems to be currently used by only out-of-tree users. Since
> we can't check that nobody has memory allocation dependency, I think that
> hmm_invalidate_range_start() should return -EAGAIN if blockable == false for 
> now.

So you can see update and user of this there:

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-intel-v00
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-nouveau-v01
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-radeon-v00

I am still working on Mellanox and AMD GPU patchset.

I will post the HMM changes that adapt to Michal shortly as anyway
thus have been sufficiently tested by now.

https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20=78785dcb5ba0924c2c5e7be027793f99ebbc39f3
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-4.20=4fc25571dc893f2b278e90cda9e71e139e01de70

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Jerome Glisse
On Fri, Jun 22, 2018 at 06:42:43PM +0200, Michal Hocko wrote:
> [Resnding with the CC list fixed]
> 
> On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> > On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > > Hi,
> > > > > > this is an RFC and not tested at all. I am not very familiar with 
> > > > > > the
> > > > > > mmu notifiers semantics very much so this is a crude attempt to 
> > > > > > achieve
> > > > > > what I need basically. It might be completely wrong but I would like
> > > > > > to discuss what would be a better way if that is the case.
> > > > > > 
> > > > > > get_maintainers gave me quite large list of people to CC so I had 
> > > > > > to trim
> > > > > > it down. If you think I have forgot somebody, please let me know
> > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > > > > > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object 
> > > > > > *mo)
> > > > > > mo->attached = false;
> > > > > >  }
> > > > > >  
> > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > > > mmu_notifier *_mn,
> > > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct 
> > > > > > mmu_notifier *_mn,
> > > > > >struct 
> > > > > > mm_struct *mm,
> > > > > >unsigned 
> > > > > > long start,
> > > > > > -  unsigned 
> > > > > > long end)
> > > > > > +  unsigned 
> > > > > > long end,
> > > > > > +  bool 
> > > > > > blockable)
> > > > > >  {
> > > > > > struct i915_mmu_notifier *mn =
> > > > > > container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > > @@ -124,7 +125,7 @@ static void 
> > > > > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > > LIST_HEAD(cancelled);
> > > > > >  
> > > > > > if (RB_EMPTY_ROOT(>objects.rb_root))
> > > > > > -   return;
> > > > > > +   return 0;
> > > > > 
> > > > > The principle wait here is for the HW (even after fixing all the locks
> > > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > > access).
> > > > 
> > > > Is this wait bound or it can take basically arbitrary amount of time?
> > > 
> > > Arbitrary amount of time but in desktop use case you can assume that
> > > it should never go above 16ms for a 60frame per second rendering of
> > > your desktop (in GPU compute case this kind of assumption does not
> > > hold). Is the process exit_state already updated by the time this mmu
> > > notifier callbacks happen ?
> > 
> > What do you mean? The process is killed (by SIGKILL) at the time but we
> > do not know much more than that. The task might be stuck anywhere in the
> > kernel before handling that signal.

I was wondering if another thread might still be dereferencing any of
the structure concurrently with the OOM mmu notifier callback. Saddly
yes, it would be simpler if we could make such assumption.

> > 
> > > > > The first pass would be then to not do anything here if
> > > > > !blockable.
> > > > 
> > > > something like this? (incremental diff)
> > > 
> > > What i wanted to do with HMM and mmu notifier is split the invalidation
> > > in 2 pass. First pass tell the drivers t

Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers

2018-06-22 Thread Jerome Glisse
On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > Hi,
> > > this is an RFC and not tested at all. I am not very familiar with the
> > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > what I need basically. It might be completely wrong but I would like
> > > to discuss what would be a better way if that is the case.
> > > 
> > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > it down. If you think I have forgot somebody, please let me know
> > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> > > b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > index 854bd51b9478..5285df9331fa 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > mo->attached = false;
> > >  }
> > >  
> > > -static void i915_gem_userptr_mn_invalidate_range_start(struct 
> > > mmu_notifier *_mn,
> > > +static int i915_gem_userptr_mn_invalidate_range_start(struct 
> > > mmu_notifier *_mn,
> > >struct mm_struct 
> > > *mm,
> > >unsigned long 
> > > start,
> > > -  unsigned long end)
> > > +  unsigned long end,
> > > +  bool blockable)
> > >  {
> > > struct i915_mmu_notifier *mn =
> > > container_of(_mn, struct i915_mmu_notifier, mn);
> > > @@ -124,7 +125,7 @@ static void 
> > > i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > LIST_HEAD(cancelled);
> > >  
> > > if (RB_EMPTY_ROOT(>objects.rb_root))
> > > -   return;
> > > +   return 0;
> > 
> > The principle wait here is for the HW (even after fixing all the locks
> > to be not so coarse, we still have to wait for the HW to finish its
> > access).
> 
> Is this wait bound or it can take basically arbitrary amount of time?

Arbitrary amount of time but in desktop use case you can assume that
it should never go above 16ms for a 60frame per second rendering of
your desktop (in GPU compute case this kind of assumption does not
hold). Is the process exit_state already updated by the time this mmu
notifier callbacks happen ?

> 
> > The first pass would be then to not do anything here if
> > !blockable.
> 
> something like this? (incremental diff)

What i wanted to do with HMM and mmu notifier is split the invalidation
in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
depends on the range and invalidate internal driver states (like clear
buffer object pages array in case of GPU but not GPU page table). While
the second callback would do the actual wait on the GPU to be done and
update the GPU page table.

Now in this scheme in case the task is already in some exit state and
that all CPU threads are frozen/kill then we can probably find a way to
do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
share userptr bo hence a uptr bo should only ever be access through
ioctl submited by the process.

The second call can then be delayed and ping from time to time to see
if GPU jobs are done.


Note that what you propose might still be useful as in case there is
no buffer object for a range then OOM can make progress in freeing a
range of memory. It is very likely that significant virtual address
range of a process and backing memory can be reclaim that way. This
assume OOM reclaim vma by vma or in some form of granularity like
reclaiming 1GB by 1GB. Or we could also update blocking callback to
return range that are blocking that way OOM can reclaim around.

Cheers,
Jérôme
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


  1   2   3   4   5   6   7   8   9   10   >