Re: [PATCH v4 04/34] pgtable: Create struct ptdesc

2023-06-20 Thread Jason Gunthorpe
On Tue, Jun 20, 2023 at 01:01:39PM -0700, Vishal Moola wrote:
> On Fri, Jun 16, 2023 at 5:38 AM Jason Gunthorpe  wrote:
> >
> > On Mon, Jun 12, 2023 at 02:03:53PM -0700, Vishal Moola (Oracle) wrote:
> > > Currently, page table information is stored within struct page. As part
> > > of simplifying struct page, create struct ptdesc for page table
> > > information.
> > >
> > > Signed-off-by: Vishal Moola (Oracle) 
> > > ---
> > >  include/linux/pgtable.h | 51 +
> > >  1 file changed, 51 insertions(+)
> > >
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index c5a51481bbb9..330de96ebfd6 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -975,6 +975,57 @@ static inline void ptep_modify_prot_commit(struct 
> > > vm_area_struct *vma,
> > >  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
> > >  #endif /* CONFIG_MMU */
> > >
> > > +
> > > +/**
> > > + * struct ptdesc - Memory descriptor for page tables.
> > > + * @__page_flags: Same as page flags. Unused for page tables.
> > > + * @pt_list: List of used page tables. Used for s390 and x86.
> > > + * @_pt_pad_1: Padding that aliases with page's compound head.
> > > + * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
> > > + * @_pt_s390_gaddr: Aliases with page's mapping. Used for s390 gmap only.
> > > + * @pt_mm: Used for x86 pgds.
> > > + * @pt_frag_refcount: For fragmented page table tracking. Powerpc and 
> > > s390 only.
> > > + * @ptl: Lock for the page table.
> > > + *
> > > + * This struct overlays struct page for now. Do not modify without a good
> > > + * understanding of the issues.
> > > + */
> > > +struct ptdesc {
> > > + unsigned long __page_flags;
> > > +
> > > + union {
> > > + struct list_head pt_list;
> > > + struct {
> > > + unsigned long _pt_pad_1;
> > > + pgtable_t pmd_huge_pte;
> > > + };
> > > + };
> > > + unsigned long _pt_s390_gaddr;
> > > +
> > > + union {
> > > + struct mm_struct *pt_mm;
> > > + atomic_t pt_frag_refcount;
> > > + };
> > > +
> > > +#if ALLOC_SPLIT_PTLOCKS
> > > + spinlock_t *ptl;
> > > +#else
> > > + spinlock_t ptl;
> > > +#endif
> > > +};
> >
> > I think you should include the memcg here too? It needs to be valid
> > for a ptdesc, even if we don't currently deref it through the ptdesc
> > type.
> 
> Yes, thanks for catching that! I'll add it to v5.
> 
> > Also, do you see a way to someday put a 'struct rcu_head' into here?
> 
> Eventually, when they're being dynamically allocated independent of
> struct page. Although at that point I'm not sure if we'll need one.

Sooner than dynamic struct page?

Probably it can overlap pt_list in alot of arches?

Jason



Re: [PATCH v4 04/34] pgtable: Create struct ptdesc

2023-06-16 Thread Jason Gunthorpe
On Mon, Jun 12, 2023 at 02:03:53PM -0700, Vishal Moola (Oracle) wrote:
> Currently, page table information is stored within struct page. As part
> of simplifying struct page, create struct ptdesc for page table
> information.
> 
> Signed-off-by: Vishal Moola (Oracle) 
> ---
>  include/linux/pgtable.h | 51 +
>  1 file changed, 51 insertions(+)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index c5a51481bbb9..330de96ebfd6 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -975,6 +975,57 @@ static inline void ptep_modify_prot_commit(struct 
> vm_area_struct *vma,
>  #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
>  #endif /* CONFIG_MMU */
>  
> +
> +/**
> + * struct ptdesc - Memory descriptor for page tables.
> + * @__page_flags: Same as page flags. Unused for page tables.
> + * @pt_list: List of used page tables. Used for s390 and x86.
> + * @_pt_pad_1: Padding that aliases with page's compound head.
> + * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
> + * @_pt_s390_gaddr: Aliases with page's mapping. Used for s390 gmap only.
> + * @pt_mm: Used for x86 pgds.
> + * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 
> only.
> + * @ptl: Lock for the page table.
> + *
> + * This struct overlays struct page for now. Do not modify without a good
> + * understanding of the issues.
> + */
> +struct ptdesc {
> + unsigned long __page_flags;
> +
> + union {
> + struct list_head pt_list;
> + struct {
> + unsigned long _pt_pad_1;
> + pgtable_t pmd_huge_pte;
> + };
> + };
> + unsigned long _pt_s390_gaddr;
> +
> + union {
> + struct mm_struct *pt_mm;
> + atomic_t pt_frag_refcount;
> + };
> +
> +#if ALLOC_SPLIT_PTLOCKS
> + spinlock_t *ptl;
> +#else
> + spinlock_t ptl;
> +#endif
> +};

I think you should include the memcg here too? It needs to be valid
for a ptdesc, even if we don't currently deref it through the ptdesc
type.

Also, do you see a way to someday put a 'struct rcu_head' into here?

Thanks,
Jason



Re: [PATCH v2 00/34] Split ptdesc from struct page

2023-05-18 Thread Jason Gunthorpe
On Mon, May 01, 2023 at 12:27:55PM -0700, Vishal Moola (Oracle) wrote:
> The MM subsystem is trying to shrink struct page. This patchset
> introduces a memory descriptor for page table tracking - struct ptdesc.
> 
> This patchset introduces ptdesc, splits ptdesc from struct page, and
> converts many callers of page table constructor/destructors to use ptdescs.

Lightly related food for future thought - based on some discussions at
LSF/MM it would be really nice if an end result of this was that a
rcu_head was always available in the ptdesc so we don't need to
allocate memory to free a page table.

Jason



Re: [PATCH v2] PCI/MSI: Correct use of can_mask in msi_add_msi_desc()

2022-08-26 Thread Jason Gunthorpe
On Fri, Aug 05, 2022 at 10:53:36AM -0500, Bjorn Helgaas wrote:
> On Fri, Aug 05, 2022 at 09:10:41AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 05, 2022 at 12:03:15PM +0200, Josef Johansson wrote:
> > > On 2/14/22 11:07, Josef Johansson wrote:
> > > > From: Josef Johansson 
> > > > 
> > > > PCI/MSI: Correct use of can_mask in msi_add_msi_desc()
> > > > Commit 71020a3c0dff4 ("PCI/MSI: Use msi_add_msi_desc()") modifies
> > > > the logic of checking msi_attrib.can_mask, without any reason.
> > > > This commits restores that logic.
> > > >
> > > > Fixes: 71020a3c0dff4 ("PCI/MSI: Use msi_add_msi_desc()")
> > > > Signed-off-by: Josef Johansson 
> > > > 
> > > > ---
> > > > v2: Changing subject line to fit earlier commits.
> > > > 
> > > > Trying to fix a NULL BUG in the NVMe MSIX implementation I stumbled 
> > > > upon this code,
> > > > which ironically was what my last MSI patch resulted into.
> > > > 
> > > > I don't see any reason why this logic was change, and it did not break 
> > > > anything
> > > > correcting the logic.
> > > > 
> > > > CC xen-devel since it very much relates to Xen kernel (via 
> > > > pci_msi_ignore_mask).
> > > > ---
> > > > 
> > > > diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
> > > > index c19c7ca58186..146e7b9a01cc 100644
> > > > --- a/drivers/pci/msi/msi.c
> > > > +++ b/drivers/pci/msi/msi.c
> > > > @@ -526,7 +526,7 @@ static int msix_setup_msi_descs(struct pci_dev 
> > > > *dev, void __iomem *base,
> > > > desc.pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
> > > >    
> > > > !desc.pci.msi_attrib.is_virtual;
> > > > -   if (!desc.pci.msi_attrib.can_mask) {
> > > > +   if (desc.pci.msi_attrib.can_mask) {
> > > > addr = pci_msix_desc_addr();
> > > > desc.pci.msix_ctrl = readl(addr + 
> > > > PCI_MSIX_ENTRY_VECTOR_CTRL);
> > > > }
> > > > 
> > 
> > Reviewed-by: Jason Gunthorpe 
> > 
> > Bjorn, please take it?
> 
> Thanks for the ping.  Since 71020a3c0dff4 is by Thomas, and he merged
> that along with a whole series of MSI work, I think I probably
> expected him to take care of this.
> 
> This looks like a simple typo, so I think the commit log should be
> reworded along that line, e.g., something like:
> 
>   71020a3c0dff4 ("PCI/MSI: Use msi_add_msi_desc()") inadvertently
>   reversed the sense of "msi_attrib.can_mask" in one use:
> 
> - if (entry->pci.msi_attrib.can_mask) {
> - addr = pci_msix_desc_addr(entry);
> - entry->pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
> + if (!desc.pci.msi_attrib.can_mask) {
> + addr = pci_msix_desc_addr();
> + desc.pci.msix_ctrl = readl(addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
> 
>   Restore the original test.
> 
> Thomas, do you want to take this?  I'm happy to merge it, but would
> like your reviewed-by or ack first.

At this point I think you should take it Bjorn..

Jason



Re: [PATCH v2] PCI/MSI: Correct use of can_mask in msi_add_msi_desc()

2022-08-05 Thread Jason Gunthorpe
On Fri, Aug 05, 2022 at 12:03:15PM +0200, Josef Johansson wrote:
> On 2/14/22 11:07, Josef Johansson wrote:
> > From: Josef Johansson 
> > 
> > PCI/MSI: Correct use of can_mask in msi_add_msi_desc()
> > Commit 71020a3c0dff4 ("PCI/MSI: Use msi_add_msi_desc()") modifies
> > the logic of checking msi_attrib.can_mask, without any reason.
> > This commits restores that logic.
> >
> > Fixes: 71020a3c0dff4 ("PCI/MSI: Use msi_add_msi_desc()")
> > Signed-off-by: Josef Johansson 
> > 
> > ---
> > v2: Changing subject line to fit earlier commits.
> > 
> > Trying to fix a NULL BUG in the NVMe MSIX implementation I stumbled upon 
> > this code,
> > which ironically was what my last MSI patch resulted into.
> > 
> > I don't see any reason why this logic was change, and it did not break 
> > anything
> > correcting the logic.
> > 
> > CC xen-devel since it very much relates to Xen kernel (via 
> > pci_msi_ignore_mask).
> > ---
> > 
> > diff --git a/drivers/pci/msi/msi.c b/drivers/pci/msi/msi.c
> > index c19c7ca58186..146e7b9a01cc 100644
> > --- a/drivers/pci/msi/msi.c
> > +++ b/drivers/pci/msi/msi.c
> > @@ -526,7 +526,7 @@ static int msix_setup_msi_descs(struct pci_dev *dev, 
> > void __iomem *base,
> > desc.pci.msi_attrib.can_mask = !pci_msi_ignore_mask &&
> >!desc.pci.msi_attrib.is_virtual;
> > -   if (!desc.pci.msi_attrib.can_mask) {
> > +   if (desc.pci.msi_attrib.can_mask) {
> > addr = pci_msix_desc_addr();
> > desc.pci.msix_ctrl = readl(addr + 
> > PCI_MSIX_ENTRY_VECTOR_CTRL);
> > }
> > 

Reviewed-by: Jason Gunthorpe 

Bjorn, please take it?

Jason



Re: blocking vs. non-blocking mmu notifiers

2022-03-23 Thread Jason Gunthorpe
On Wed, Mar 23, 2022 at 05:49:43PM +0100, Michal Hocko wrote:
> > The bug here is that prior to commit a81461b0546c ("xen/gntdev: update
> > to new mmu_notifier semantic") wired the mn_invl_range_start() which
> > takes a mutex to invalidate_page, which is defined to run in an atomic
> > context.
> 
> Yeah, we have already identified that but quickly realized that the
> whole mmu notifier overhaul which this fix depends on would be no no for
> backporting to our older code base. So we are trying to find our way
> around that.

IMHO you don't need everything, just commit 369ea8242c0f ("mm/rmap:
update to new mmu_notifier semantic v2") which adds the missing
start/end outside the lock for the page callbacks.

Then you can take safely a8146 into gntdev.

Jason



Re: blocking vs. non-blocking mmu notifiers

2022-03-23 Thread Jason Gunthorpe
On Wed, Mar 23, 2022 at 10:45:30AM +0100, Michal Hocko wrote:
> [Let me add more people to the CC list - I am not really sure who is the
>  most familiar with all the tricks that mmu notifiers might do]
> 
> On Wed 23-03-22 09:43:59, Juergen Gross wrote:
> > Hi,
> > 
> > during analysis of a customer's problem on a 4.12 based kernel
> > (deadlock due to a blocking mmu notifier in a Xen driver) I came
> > across upstream patches 93065ac753e4 ("mm, oom: distinguish
> > blockable mode for mmu notifiers") et al.
> > 
> > The backtrace of the blocked tasks was typically something like:
> > 
> >  #0 [c9004222f228] __schedule at 817223e2
> >  #1 [c9004222f2b8] schedule at 81722a02
> >  #2 [c9004222f2c8] schedule_preempt_disabled at 81722d0a
> >  #3 [c9004222f2d0] __mutex_lock at 81724104
> >  #4 [c9004222f360] mn_invl_range_start at c01fd398 [xen_gntdev]
> >  #5 [c9004222f398] __mmu_notifier_invalidate_page at 8123375a
> >  #6 [c9004222f3c0] try_to_unmap_one at 812112cb
> >  #7 [c9004222f478] rmap_walk_file at 812105cd
> >  #8 [c9004222f4d0] try_to_unmap at 81212450
> >  #9 [c9004222f508] shrink_page_list at 811e0755
> > #10 [c9004222f5c8] shrink_inactive_list at 811e13cf
> > #11 [c9004222f6a8] shrink_node_memcg at 811e241f
> > #12 [c9004222f790] shrink_node at 811e29c5
> > #13 [c9004222f808] do_try_to_free_pages at 811e2ee1
> > #14 [c9004222f868] try_to_free_pages at 811e3248
> > #15 [c9004222f8e8] __alloc_pages_slowpath at 81262c37
> > #16 [c9004222f9f0] __alloc_pages_nodemask at 8121afc1
> > #17 [c9004222fa48] alloc_pages_current at 8122f350
> > #18 [c9004222fa78] __get_free_pages at 8121685a
> > #19 [c9004222fa80] __pollwait at 8127e795
> > #20 [c9004222faa8] evtchn_poll at c00e802b [xen_evtchn]
> > #21 [c9004222fab8] do_sys_poll at 8127f953
> > #22 [c9004222fec8] sys_ppoll at 81280478
> > #23 [c9004222ff30] do_syscall_64 at 81004954
> > #24 [c9004222ff50] entry_SYSCALL_64_after_hwframe at 818000b6
> > 
> > It was found that the notifier of the Xen gntdev driver was using a
> > mutex resulting in the deadlock.

The bug here is that prior to commit a81461b0546c ("xen/gntdev: update
to new mmu_notifier semantic") wired the mn_invl_range_start() which
takes a mutex to invalidate_page, which is defined to run in an atomic
context.

> > Michal Hocko suggested that backporting above mentioned patch might
> > help, as the mmu notifier call is happening in atomic context.

IIRC "blocking" was not supposed to be about atomic context or not, but
more about time to completion/guarenteed completion as it is used from
a memory reclaim path.

> Just to be more explicit. The current upstream code calls
> mmu_notifier_invalidate_range while the page table locks are held.
> Are there any notifiers which could sleep in those? 

There should not be, that would be a bug that lockdep would find.

> In other words should we use the nonblock start/stop in
> try_to_unmap?

AFAICT, no.

Jason



Re: [PATCH] PCI/MSI: msix_setup_msi_descs: Restore logic for msi_attrib.can_mask

2022-02-10 Thread Jason Gunthorpe
On Thu, Feb 10, 2022 at 05:55:32PM -0600, Bjorn Helgaas wrote:
> > Commit 71020a3c0dff4 ("PCI/MSI: Use msi_add_msi_desc()") modifies
> > the logic of checking msi_attrib.can_mask, without any reason.
> > 
> > This commits restores that logic.
> 
> I agree, this looks like a typo in 71020a3c0dff4, but I might be
> missing something, so Thomas should take a look, and I added Jason
> since he reviewed it.

I concur

Jason



Re: [patch V3 27/35] PCI/MSI: Use __msi_get_virq() in pci_get_vector()

2021-12-13 Thread Jason Gunthorpe
On Fri, Dec 10, 2021 at 11:19:25PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> Use msi_get_vector() and handle the return value to be compatible.
> 
> No functional change intended.
> 
> Signed-off-by: Thomas Gleixner 
> Reviewed-by: Greg Kroah-Hartman 
> ---
> V2: Handle the INTx case directly instead of trying to be overly smart - Marc
> ---
>  drivers/pci/msi/msi.c |   25 +
>  1 file changed, 5 insertions(+), 20 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason



Re: [patch V3 26/35] genirq/msi: Provide interface to retrieve Linux interrupt number

2021-12-13 Thread Jason Gunthorpe
On Fri, Dec 10, 2021 at 11:19:23PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> This allows drivers to retrieve the Linux interrupt number instead of
> fiddling with MSI descriptors.
> 
> msi_get_virq() returns the Linux interrupt number or 0 in case that there
> is no entry for the given MSI index.
> 
> Signed-off-by: Thomas Gleixner 
> Reviewed-by: Greg Kroah-Hartman 
> ---
> V2: Simplify the implementation and let PCI deal with the PCI specialities - 
> Marc
> ---
>  include/linux/msi.h |2 ++
>  kernel/irq/msi.c|   36 
>  2 files changed, 38 insertions(+)

Reviewed-by: Jason Gunthorpe 

Jason



Re: [patch V3 25/35] powerpc/pseries/msi: Let core code check for contiguous entries

2021-12-13 Thread Jason Gunthorpe
On Fri, Dec 10, 2021 at 11:19:22PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> Set the domain info flag and remove the check.
> 
> Signed-off-by: Thomas Gleixner 
> Reviewed-by: Greg Kroah-Hartman 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: "Cédric Le Goater" 
> Cc: linuxppc-...@lists.ozlabs.org
> 
> ---
> V2: Remove it completely - Cedric
> ---
>  arch/powerpc/platforms/pseries/msi.c |   33 -
>  1 file changed, 8 insertions(+), 25 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason
 



Re: [patch V3 06/35] powerpc/pseries/msi: Use PCI device properties

2021-12-13 Thread Jason Gunthorpe
On Fri, Dec 10, 2021 at 11:18:52PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> instead of fiddling with MSI descriptors.
> 
> Signed-off-by: Thomas Gleixner 
> Cc: Michael Ellerman 
> Cc: linuxppc-...@lists.ozlabs.org
> ---
> V3: Use pci_dev->msix_enabled - Jason
> ---
>  arch/powerpc/platforms/pseries/msi.c |3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)

Reviewed-by: Jason Gunthorpe 
 
> --- a/arch/powerpc/platforms/pseries/msi.c
> +++ b/arch/powerpc/platforms/pseries/msi.c
> @@ -448,8 +448,7 @@ static int pseries_msi_ops_prepare(struc
>  int nvec, msi_alloc_info_t *arg)
>  {
>   struct pci_dev *pdev = to_pci_dev(dev);
> - struct msi_desc *desc = first_pci_msi_entry(pdev);
> - int type = desc->pci.msi_attrib.is_msix ? PCI_CAP_ID_MSIX : 
> PCI_CAP_ID_MSI;
> + int type = pdev->msix_enabled ? PCI_CAP_ID_MSIX : PCI_CAP_ID_MSI;

Long term it probably makes sense to change the msi_domain_ops so that
it has PCI versions of the ops to use in places like this that hard
assume PCI is the only kind of MSI at all.

If the non-PCI op isn't provided then things like IMS would be denied
- and the PCI op can directly pass in a pci_dev * so we don't have all
these to_pci_devs() in drivers.

Jason



Re: [patch V3 05/35] powerpc/cell/axon_msi: Use PCI device property

2021-12-13 Thread Jason Gunthorpe
On Fri, Dec 10, 2021 at 11:18:51PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> instead of fiddling with MSI descriptors.
> 
> Signed-off-by: Thomas Gleixner 
> Cc: Arnd Bergmann 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: linuxppc-...@lists.ozlabs.org
> ---
> V3: Use pci_dev property - Jason
> V2: Invoke the function with the correct number of arguments - Andy
> ---
>  arch/powerpc/platforms/cell/axon_msi.c |5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason



Re: [patch V3 04/35] genirq/msi: Use PCI device property

2021-12-13 Thread Jason Gunthorpe
On Fri, Dec 10, 2021 at 11:18:49PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> to determine whether this is MSI or MSIX instead of consulting MSI
> descriptors.
> 
> Signed-off-by: Thomas Gleixner 
> ---
> V2: Use PCI device property - Jason
> ---
>  kernel/irq/msi.c |   17 ++---
>  1 file changed, 2 insertions(+), 15 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason



Re: [patch V3 03/35] x86/apic/msi: Use PCI device MSI property

2021-12-13 Thread Jason Gunthorpe
On Fri, Dec 10, 2021 at 11:18:47PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> instead of fiddling with MSI descriptors.
> 
> Signed-off-by: Thomas Gleixner 
> ---
> V3: Use pci_dev->msix_enabled - Jason
> ---
>  arch/x86/kernel/apic/msi.c |5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason



Re: [patch V3 02/35] x86/pci/XEN: Use PCI device property

2021-12-13 Thread Jason Gunthorpe
On Fri, Dec 10, 2021 at 11:18:46PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> instead of fiddling with MSI descriptors.
> 
> Signed-off-by: Thomas Gleixner 
> Cc: Juergen Gross 
> Cc: xen-devel@lists.xenproject.org
> ---
> V3: Use pci_dev->msix_enabled.
> ---
>  arch/x86/pci/xen.c |9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason



Re: [patch V3 01/35] PCI/MSI: Set pci_dev::msi[x]_enabled early

2021-12-13 Thread Jason Gunthorpe
On Fri, Dec 10, 2021 at 11:18:44PM +0100, Thomas Gleixner wrote:
> There are quite some places which retrieve the first MSI descriptor to
> evaluate whether the setup is for MSI or MSI-X. That's required because
> pci_dev::msi[x]_enabled is only set when the setup completed successfully.
> 
> There is no real reason why msi[x]_enabled can't be set at the beginning of
> the setup sequence and cleared in case of a failure.
> 
> Implement that so the MSI descriptor evaluations can be converted to simple
> property queries.
> 
> Signed-off-by: Thomas Gleixner 
> ---
> V3: New patch
> ---
>  drivers/pci/msi/msi.c |   23 +--
>  1 file changed, 17 insertions(+), 6 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason



Re: [patch V2 00/31] genirq/msi, PCI/MSI: Spring cleaning - Part 3

2021-12-08 Thread Jason Gunthorpe
On Mon, Dec 06, 2021 at 11:51:02PM +0100, Thomas Gleixner wrote:
> This is the third part of [PCI]MSI refactoring which aims to provide the
> ability of expanding MSI-X vectors after enabling MSI-X.

I read through this and didn't have any substantive remarks

Reviewed-by: Jason Gunthorpe 

Jason



Re: [patch V2 02/31] genirq/msi: Add mutex for MSI list protection

2021-12-08 Thread Jason Gunthorpe
On Mon, Dec 06, 2021 at 11:51:05PM +0100, Thomas Gleixner wrote:
> +++ b/kernel/irq/msi.c
> @@ -127,12 +127,37 @@ int msi_setup_device_data(struct device
>   return -ENOMEM;
>  
>   INIT_LIST_HEAD(>list);
> + mutex_init(>mutex);
>   dev->msi.data = md;
>   devres_add(dev, md);
>   return 0;
>  }
>  
>  /**
> + * msi_lock_descs - Lock the MSI descriptor storage of a device
> + * @dev: Device to operate on
> + */
> +void msi_lock_descs(struct device *dev)
> +{
> + if (WARN_ON_ONCE(!dev->msi.data))
> + return;

Is this useful? Other places that call msi_lock_descs will continue on and deref
null dev->msi anyhow - is the dump from the WARN_ON that much better
than the oops from the null deref here:

> + mutex_lock(>msi.data->mutex);

?

Honestly, still a bit unclear on what the community consensus is for
using WARN_ON.

Jason



Re: [patch V2 19/36] PCI/MSI: Store properties in device::msi::data

2021-12-08 Thread Jason Gunthorpe
On Mon, Dec 06, 2021 at 11:39:26PM +0100, Thomas Gleixner wrote:
> Store the properties which are interesting for various places so the MSI
> descriptor fiddling can be removed.
> 
> Signed-off-by: Thomas Gleixner 
> ---
> V2: Use the setter function
> ---
>  drivers/pci/msi/msi.c |8 
>  1 file changed, 8 insertions(+)

I took more time to look at this, to summarize my remarks on the other
patches

I think we don't need properties. The info in the msi_desc can come
from the pci_dev which we have easy access to. This seems overall
clearer

The notable one is the sysfs, but that is probably better handled by
storing a

  const char *sysfs_label

in the dev->msi and emitting that instead of computing it.

Jason



Re: [patch V2 23/36] powerpc/cell/axon_msi: Use MSI device properties

2021-12-08 Thread Jason Gunthorpe
On Mon, Dec 06, 2021 at 11:39:33PM +0100, Thomas Gleixner wrote:

> @@ -209,10 +209,10 @@ static int setup_msi_msg_address(struct
>   return -ENODEV;
>   }
>  
> - entry = first_pci_msi_entry(dev);
> + is_64bit = msi_device_has_property(>dev, MSI_PROP_64BIT);

How about  !dev->no_64bit_msi ?

Jason



Re: [patch V2 20/36] x86/pci/XEN: Use device MSI properties

2021-12-08 Thread Jason Gunthorpe
On Mon, Dec 06, 2021 at 11:39:28PM +0100, Thomas Gleixner wrote:
> instead of fiddling with MSI descriptors.
> 
> Signed-off-by: Thomas Gleixner 
> Reviewed-by: Greg Kroah-Hartman 
> Reviewed-by: Jason Gunthorpe 
>  arch/x86/pci/xen.c |6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> +++ b/arch/x86/pci/xen.c
> @@ -399,9 +399,7 @@ static void xen_teardown_msi_irqs(struct
>  
>  static void xen_pv_teardown_msi_irqs(struct pci_dev *dev)
>  {
> - struct msi_desc *msidesc = first_pci_msi_entry(dev);
> -
> - if (msidesc->pci.msi_attrib.is_msix)
> + if (msi_device_has_property(>dev, MSI_PROP_PCI_MSIX))
>   xen_pci_frontend_disable_msix(dev);
>   else
>   xen_pci_frontend_disable_msi(dev);

Same remark as for power, we have a pci_dev, so can it be dev->msix_enabled?

> @@ -417,7 +415,7 @@ static int xen_msi_domain_alloc_irqs(str
>   if (WARN_ON_ONCE(!dev_is_pci(dev)))
>   return -EINVAL;
>  
> - if (first_msi_entry(dev)->pci.msi_attrib.is_msix)
> + if (msi_device_has_property(dev, MSI_PROP_PCI_MSIX))

And this WARNS if it is not a pci_dev, so same

Jason



Re: [patch V2 24/36] powerpc/pseries/msi: Use MSI device properties

2021-12-08 Thread Jason Gunthorpe
On Mon, Dec 06, 2021 at 11:39:34PM +0100, Thomas Gleixner wrote:
> instead of fiddling with MSI descriptors.
> 
> Signed-off-by: Thomas Gleixner 
> Reviewed-by: Greg Kroah-Hartman 
> Reviewed-by: Jason Gunthorpe 
>  arch/powerpc/platforms/pseries/msi.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> +++ b/arch/powerpc/platforms/pseries/msi.c
> @@ -447,9 +447,9 @@ static int rtas_prepare_msi_irqs(struct
>  static int pseries_msi_ops_prepare(struct irq_domain *domain, struct device 
> *dev,
>  int nvec, msi_alloc_info_t *arg)
>  {
> + bool is_msix = msi_device_has_property(dev, MSI_PROP_PCI_MSIX);
> + int type = is_msix ? PCI_CAP_ID_MSIX : PCI_CAP_ID_MSI;
>   struct pci_dev *pdev = to_pci_dev(dev);

We have the pci_dev here, why not just do something like

   bool is_msix = pdev->msix_enabled;

?

Jason



Re: [patch V2 21/36] x86/apic/msi: Use device MSI properties

2021-12-08 Thread Jason Gunthorpe
On Mon, Dec 06, 2021 at 11:39:29PM +0100, Thomas Gleixner wrote:
> instead of fiddling with MSI descriptors.
> 
> Signed-off-by: Thomas Gleixner 
> Reviewed-by: Greg Kroah-Hartman 
> Reviewed-by: Jason Gunthorpe 
> ---
>  arch/x86/kernel/apic/msi.c |5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> --- a/arch/x86/kernel/apic/msi.c
> +++ b/arch/x86/kernel/apic/msi.c
> @@ -160,11 +160,8 @@ static struct irq_chip pci_msi_controlle
>  int pci_msi_prepare(struct irq_domain *domain, struct device *dev, int nvec,
>   msi_alloc_info_t *arg)
>  {
> - struct pci_dev *pdev = to_pci_dev(dev);
> - struct msi_desc *desc = first_pci_msi_entry(pdev);
> -
>   init_irq_alloc_info(arg, NULL);
> - if (desc->pci.msi_attrib.is_msix) {
> + if (msi_device_has_property(dev, MSI_PROP_PCI_MSIX)) {
>   arg->type = X86_IRQ_ALLOC_TYPE_PCI_MSIX;
>   } else {
>   arg->type = X86_IRQ_ALLOC_TYPE_PCI_MSI;
>

Just thought for future

It looks like the only use of this is to link to the irq_remapping
which is only using it to get back to the physical device:

case X86_IRQ_ALLOC_TYPE_PCI_MSI:
case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
set_msi_sid(irte,
pci_real_dma_dev(msi_desc_to_pci_dev(info->desc)));

case X86_IRQ_ALLOC_TYPE_PCI_MSI:
case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
return get_device_id(msi_desc_to_dev(info->desc));

And this is super confusing:

static inline int get_device_id(struct device *dev)
{
int devid;

if (dev_is_pci(dev))
devid = get_pci_device_id(dev);
else
devid = get_acpihid_device_id(dev, NULL);

return devid;
}

How does an ACPI device have a *PCI* MSI or MSI-X ??

IMHO this makes more sense written as:

  struct device *origin_device = msi_desc_get_origin_dev(info->desc);

  if (dev_is_pci(origin_device)
  devid = get_pci_device_id(origin_device);
  else if (dev_is_acpi(origin_device))
  devid = get_acpihid_device_id(dev, NULL);

And similar in all places touching X86_IRQ_ALLOC_TYPE_PCI_MSI/X

Like this oddball thing in AMD too:

} else if (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI ||
   info->type == X86_IRQ_ALLOC_TYPE_PCI_MSIX) {
bool align = (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI);

index = alloc_irq_index(devid, nr_irqs, align,
msi_desc_to_pci_dev(info->desc));
} else {
index = alloc_irq_index(devid, nr_irqs, false, NULL);

This should just use a dev and inside alloc_irq_table do the dev_is_pci()
thing to guard the pci_for_each_dma_alias()

Then just call it X86_IRQ_ALLOC_TYPE_DEVICE (ie allocated for a struct device)

Jason



Re: [patch V2 20/23] PCI/MSI: Move msi_lock to struct pci_dev

2021-12-08 Thread Jason Gunthorpe
On Mon, Dec 06, 2021 at 11:27:56PM +0100, Thomas Gleixner wrote:
> It's only required for PCI/MSI. So no point in having it in every struct
> device.
> 
> Signed-off-by: Thomas Gleixner 
> ---
> V2: New patch
> ---
>  drivers/base/core.c|1 -
>  drivers/pci/msi/msi.c  |2 +-
>  drivers/pci/probe.c|4 +++-
>  include/linux/device.h |2 --
>  include/linux/pci.h|1 +
>  5 files changed, 5 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe 
 
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -2875,7 +2875,6 @@ void device_initialize(struct device *de
>   device_pm_init(dev);
>   set_dev_node(dev, NUMA_NO_NODE);
>  #ifdef CONFIG_GENERIC_MSI_IRQ
> - raw_spin_lock_init(>msi_lock);
>   INIT_LIST_HEAD(>msi_list);
>  #endif
>   INIT_LIST_HEAD(>links.consumers);
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -18,7 +18,7 @@ int pci_msi_ignore_mask;
>  
>  static noinline void pci_msi_update_mask(struct msi_desc *desc, u32 clear, 
> u32 set)
>  {
> - raw_spinlock_t *lock = >dev->msi_lock;
> + raw_spinlock_t *lock = _pci_dev(desc->dev)->msi_lock;
>   unsigned long flags;
>  
>   if (!desc->pci.msi_attrib.can_mask)

It looks like most of the time this is called by an irq_chip, except
for a few special cases list pci_msi_shutdown()

Is this something that should ideally go away someday and use some
lock in the irq_chip - not unlike what we've thought is needed for
IMS?

Jason



Re: [patch 00/22] genirq/msi, PCI/MSI: Spring cleaning - Part 1

2021-11-27 Thread Jason Gunthorpe
On Sat, Nov 27, 2021 at 02:18:34AM +0100, Thomas Gleixner wrote:
> The [PCI] MSI code has gained quite some warts over time. A recent
> discussion unearthed a shortcoming: the lack of support for expanding
> PCI/MSI-X vectors after initialization of MSI-X.
> 
> PCI/MSI-X has no requirement to setup all vectors when MSI-X is enabled in
> the device. The non-used vectors have just to be masked in the vector
> table. For PCI/MSI this is not possible because the number of vectors
> cannot be changed after initialization.
> 
> The PCI/MSI code, but also the core MSI irq domain code are built around
> the assumption that all required vectors are installed at initialization
> time and freed when the device is shut down by the driver.
> 
> Supporting dynamic expansion at least for MSI-X is important for VFIO so
> that the host side interrupts for passthrough devices can be installed on
> demand.
> 
> This is the first part of a large (total 101 patches) series which
> refactors the [PCI]MSI infrastructure to make runtime expansion of MSI-X
> vectors possible. The last part (10 patches) provide this functionality.
> 
> The first part is mostly a cleanup which consolidates code, moves the PCI
> MSI code into a separate directory and splits it up into several parts.
> 
> No functional change intended except for patch 2/N which changes the
> behaviour of pci_get_vector()/affinity() to get rid of the assumption that
> the provided index is the "index" into the descriptor list instead of using
> it as the actual MSI[X] index as seen by the hardware. This would break
> users of sparse allocated MSI-X entries, but non of them use these
> functions.

I don't know all the irqdomain stuff all that well anymore, but I read
through all the patches and only noticed a small spello

[patch 02/22] PCI/MSI: Fix pci_irq_vector()/pci_irq_get_attinity()
  ff

It all seems good, I especially like the splitting of msi.c and
removal of ops..

Reviewed-by: Jason Gunthorpe 

Thanks,
Jason



Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-23 Thread Jason Gunthorpe
On Fri, Nov 20, 2020 at 12:21:39PM -0600, Gustavo A. R. Silva wrote:

>   IB/hfi1: Fix fall-through warnings for Clang
>   IB/mlx4: Fix fall-through warnings for Clang
>   IB/qedr: Fix fall-through warnings for Clang
>   RDMA/mlx5: Fix fall-through warnings for Clang

I picked these four to the rdma tree, thanks

Jason



REGRESSION: Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

2020-11-12 Thread Jason Gunthorpe
On Wed, Aug 26, 2020 at 01:16:28PM +0200, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)
> based devices in a halfways architecture independent way.

Hi Thomas,

Our test team has been struggling with a regression on bare metal
SRIOV VFs since -rc1 that they were able to bisect to this series

This commit tests good:

5712c3ed549e ("Merge tag 'armsoc-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc")

This commit tests bad:

981aa1d366bf ("PCI: MSI: Fix Kconfig dependencies for PCI_MSI_ARCH_FALLBACKS")

They were unable to bisect further into the series because some of the
interior commits don't boot :(

When we try to load the mlx5 driver on a bare metal VF it gets this:

[Thu Oct 22 08:54:51 2020] DMAR: DRHD: handling fault status reg 2
[Thu Oct 22 08:54:51 2020] DMAR: [INTR-REMAP] Request device [42:00.2] fault 
index 1600 [fault reason 37] Blocked a compatibility format interrupt request
[Thu Oct 22 08:55:04 2020] mlx5_core :42:00.1 eth4: Link down
[Thu Oct 22 08:55:11 2020] mlx5_core :42:00.1 eth4: Link up
[Thu Oct 22 08:55:54 2020] mlx5_core :42:00.2: mlx5_cmd_eq_recover:264:(pid 
3390): Recovered 1 EQEs on cmd_eq
[Thu Oct 22 08:55:54 2020] mlx5_core :42:00.2: 
wait_func_handle_exec_timeout:1051:(pid 3390): cmd0: CREATE_EQ(0×301) 
recovered after timeout
[Thu Oct 22 08:55:54 2020] DMAR: DRHD: handling fault status reg 102
[Thu Oct 22 08:55:54 2020] DMAR: [INTR-REMAP] Request device [42:00.2] fault 
index 1600 [fault reason 37] Blocked a compatibility format interrupt request

If you have any idea Ziyad and Itay can run any debugging you like.

I suppose it is because this series is handing out compatability
addr/data pairs while the IOMMU is setup to only accept remap ones
from SRIOV VFs?

Thanks,
Jason



Re: [RFC] treewide: cleanup unreachable breaks

2020-10-19 Thread Jason Gunthorpe
On Mon, Oct 19, 2020 at 12:42:15PM -0700, Nick Desaulniers wrote:
> On Sat, Oct 17, 2020 at 10:43 PM Greg KH  wrote:
> >
> > On Sat, Oct 17, 2020 at 09:09:28AM -0700, t...@redhat.com wrote:
> > > From: Tom Rix 
> > >
> > > This is a upcoming change to clean up a new warning treewide.
> > > I am wondering if the change could be one mega patch (see below) or
> > > normal patch per file about 100 patches or somewhere half way by 
> > > collecting
> > > early acks.
> >
> > Please break it up into one-patch-per-subsystem, like normal, and get it
> > merged that way.
> >
> > Sending us a patch, without even a diffstat to review, isn't going to
> > get you very far...
> 
> Tom,
> If you're able to automate this cleanup, I suggest checking in a
> script that can be run on a directory.  Then for each subsystem you
> can say in your commit "I ran scripts/fix_whatever.py on this subdir."
>  Then others can help you drive the tree wide cleanup.  Then we can
> enable -Wunreachable-code-break either by default, or W=2 right now
> might be a good idea.

I remember using clang-modernize in the past to fix issues very
similar to this, if clang machinery can generate the warning, can't
something like clang-tidy directly generate the patch?

You can send me a patch for drivers/infiniband/* as well

Thanks,
Jason



Re: [patch V2 24/46] PCI: vmd: Mark VMD irqdomain with DOMAIN_BUS_VMD_MSI

2020-09-30 Thread Jason Gunthorpe
On Wed, Sep 30, 2020 at 01:08:27PM +, Derrick, Jonathan wrote:
> +Megha
> 
> On Wed, 2020-09-30 at 09:57 -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 30, 2020 at 12:45:30PM +, Derrick, Jonathan wrote:
> > > Hi Jason
> > > 
> > > On Mon, 2020-08-31 at 11:39 -0300, Jason Gunthorpe wrote:
> > > > On Wed, Aug 26, 2020 at 01:16:52PM +0200, Thomas Gleixner wrote:
> > > > > From: Thomas Gleixner 
> > > > > 
> > > > > Devices on the VMD bus use their own MSI irq domain, but it is not
> > > > > distinguishable from regular PCI/MSI irq domains. This is required
> > > > > to exclude VMD devices from getting the irq domain pointer set by
> > > > > interrupt remapping.
> > > > > 
> > > > > Override the default bus token.
> > > > > 
> > > > > Signed-off-by: Thomas Gleixner 
> > > > > Acked-by: Bjorn Helgaas 
> > > > >  drivers/pci/controller/vmd.c |6 ++
> > > > >  1 file changed, 6 insertions(+)
> > > > > 
> > > > > +++ b/drivers/pci/controller/vmd.c
> > > > > @@ -579,6 +579,12 @@ static int vmd_enable_domain(struct vmd_
> > > > >   return -ENODEV;
> > > > >   }
> > > > >  
> > > > > + /*
> > > > > +  * Override the irq domain bus token so the domain can be 
> > > > > distinguished
> > > > > +  * from a regular PCI/MSI domain.
> > > > > +  */
> > > > > + irq_domain_update_bus_token(vmd->irq_domain, 
> > > > > DOMAIN_BUS_VMD_MSI);
> > > > > +
> > > > 
> > > > Having the non-transparent-bridge hold a MSI table and
> > > > multiplex/de-multiplex IRQs looks like another good use case for
> > > > something close to pci_subdevice_msi_create_irq_domain()?
> > > > 
> > > > If each device could have its own internal MSI-X table programmed
> > > > properly things would work alot better. Disable capture/remap of the
> > > > MSI range in the NTB.
> > > We can disable the capture and remap in newer devices so we don't even
> > > need the irq domain.
> > 
> > You'd still need an irq domain, it just comes from
> > pci_subdevice_msi_create_irq_domain() instead of internal to this
> > driver.
> I have this set which disables remapping and bypasses the creation of
> the IRQ domain:
> https://patchwork.ozlabs.org/project/linux-pci/list/?series=192936

After Thomas's series VMD needs to supply a hierarchical IRQ domain to
get the correct PCI originator. This instead of the x86 patch in that
series. I think that domain should be created by
pci_subdevice_msi_create_irq_domain(), at least I'd start there.

Jason



Re: [patch V2 24/46] PCI: vmd: Mark VMD irqdomain with DOMAIN_BUS_VMD_MSI

2020-09-30 Thread Jason Gunthorpe
On Wed, Sep 30, 2020 at 12:45:30PM +, Derrick, Jonathan wrote:
> Hi Jason
> 
> On Mon, 2020-08-31 at 11:39 -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 26, 2020 at 01:16:52PM +0200, Thomas Gleixner wrote:
> > > From: Thomas Gleixner 
> > > 
> > > Devices on the VMD bus use their own MSI irq domain, but it is not
> > > distinguishable from regular PCI/MSI irq domains. This is required
> > > to exclude VMD devices from getting the irq domain pointer set by
> > > interrupt remapping.
> > > 
> > > Override the default bus token.
> > > 
> > > Signed-off-by: Thomas Gleixner 
> > > Acked-by: Bjorn Helgaas 
> > >  drivers/pci/controller/vmd.c |6 ++
> > >  1 file changed, 6 insertions(+)
> > > 
> > > +++ b/drivers/pci/controller/vmd.c
> > > @@ -579,6 +579,12 @@ static int vmd_enable_domain(struct vmd_
> > >   return -ENODEV;
> > >   }
> > >  
> > > + /*
> > > +  * Override the irq domain bus token so the domain can be distinguished
> > > +  * from a regular PCI/MSI domain.
> > > +  */
> > > + irq_domain_update_bus_token(vmd->irq_domain, DOMAIN_BUS_VMD_MSI);
> > > +
> > 
> > Having the non-transparent-bridge hold a MSI table and
> > multiplex/de-multiplex IRQs looks like another good use case for
> > something close to pci_subdevice_msi_create_irq_domain()?
> > 
> > If each device could have its own internal MSI-X table programmed
> > properly things would work alot better. Disable capture/remap of the
> > MSI range in the NTB.

> We can disable the capture and remap in newer devices so we don't even
> need the irq domain.

You'd still need an irq domain, it just comes from
pci_subdevice_msi_create_irq_domain() instead of internal to this
driver.

> I would however like to determine if the MSI data bits could be used
> for individual devices to have the IRQ domain subsystem demultiplex the
> virq from that and eliminate the IRQ list iteration.

Yes, exactly. This new pci_subdevice_msi_create_irq_domain() creates
*proper* fully functional interrupts, no need for any IRQ handler in
this driver.

> A concern I have about that scheme is virtualization as I think those
> bits are used to route to the virtual vector.

It should be fine with this patch series, consult with Megha how
virtualization is working with IDXD/etc

Jason



Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

2020-09-30 Thread Jason Gunthorpe
On Wed, Sep 30, 2020 at 08:41:48AM +0200, Thomas Gleixner wrote:
> On Tue, Sep 29 2020 at 16:03, Megha Dey wrote:
> > On 8/26/2020 4:16 AM, Thomas Gleixner wrote:
> >> #9 is obviously just for the folks interested in IMS
> >>
> >
> > I see that the tip tree (as of 9/29) has most of these patches but 
> > notice that the DEV_MSI related patches
> >
> > haven't made it. I have tested the tip tree(x86/irq branch) with your
> > DEV_MSI infra patches and our IMS patches with the IDXD driver and was
> 
> Your IMS patches? Why do you need something special again?
> 
> > wondering if we should push out those patches as part of our patchset?
> 
> As I don't have any hardware to test that, I was waiting for you and
> Jason to confirm that this actually works for the two different IMS
> implementations.

How urgently do you need this? The code looked good from what I
understood. It will be a while before we have all the parts to send an
actual patch though.

We might be able to put together a mockup just to prove it

Jason



Re: [patch V2 46/46] irqchip: Add IMS (Interrupt Message Storm) driver - NOT FOR MERGING

2020-08-31 Thread Jason Gunthorpe
On Wed, Aug 26, 2020 at 01:17:14PM +0200, Thomas Gleixner wrote:
> + * ims_queue_info - Information to create an IMS queue domain
> + * @queue_lock:  Callback which informs the device driver that
> + *   an interrupt management operation starts.
> + * @queue_sync_unlock:   Callback which informs the device driver that an
> + *   interrupt management operation ends.
> +
> + * @queue_get_shadow:   Callback to retrieve te shadow storage for a MSI
> + *   entry associated to a queue. The queue is
> + *   identified by the device struct which is used for
> + *   allocating interrupts and the msi entry index.
> + *
> + * @queue_lock() and @queue_sync_unlock() are only called for management
> + * operations on a particular interrupt: request, free, enable, disable,
> + * affinity setting.  These functions are never called from atomic context,
> + * like low level interrupt handling code. The purpose of these functions
> + * is to signal the device driver the start and end of an operation which
> + * affects the IMS queue shadow state. @queue_lock() allows the driver to
> + * do preperatory work, e.g. locking. Note, that @queue_lock() has to
> + * preserve the sleepable state on return. That means the driver cannot
> + * disable preemption and (soft)interrupts in @queue_lock and then undo
> + * that operation in @queue_sync_unlock() which restricts the lock types
> + * for eventual serialization of these operations to sleepable locks. Of
> + * course the driver can disable preemption and (soft)interrupts
> + * temporarily for internal work.
> + *
> + * On @queue_sync_unlock() the driver has to check whether the shadow state
> + * changed and issue a command to update the hardware state and wait for
> + * the command to complete. If the command fails or times out then the
> + * driver has to take care of the resulting mess as this is called from
> + * functions which have no return value and none of the callers can deal
> + * with the failure. The lock which is used by the driver to protect a
> + * operation sequence must obviously not be released before the command
> + * completes or fails. Otherwise new operations on the same interrupt line
> + * could take place and change the shadow state before the driver was able
> + * to compose the command.

I haven't looked through everything in detail, but this does look like
it is good for the mlx5 devices. Looked like it was only one small
update to the set_affinity, so not very disruptive?

Thanks,
Jason



Re: [patch V2 24/46] PCI: vmd: Mark VMD irqdomain with DOMAIN_BUS_VMD_MSI

2020-08-31 Thread Jason Gunthorpe
On Wed, Aug 26, 2020 at 01:16:52PM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner 
> 
> Devices on the VMD bus use their own MSI irq domain, but it is not
> distinguishable from regular PCI/MSI irq domains. This is required
> to exclude VMD devices from getting the irq domain pointer set by
> interrupt remapping.
> 
> Override the default bus token.
> 
> Signed-off-by: Thomas Gleixner 
> Acked-by: Bjorn Helgaas 
>  drivers/pci/controller/vmd.c |6 ++
>  1 file changed, 6 insertions(+)
> 
> +++ b/drivers/pci/controller/vmd.c
> @@ -579,6 +579,12 @@ static int vmd_enable_domain(struct vmd_
>   return -ENODEV;
>   }
>  
> + /*
> +  * Override the irq domain bus token so the domain can be distinguished
> +  * from a regular PCI/MSI domain.
> +  */
> + irq_domain_update_bus_token(vmd->irq_domain, DOMAIN_BUS_VMD_MSI);
> +

Having the non-transparent-bridge hold a MSI table and
multiplex/de-multiplex IRQs looks like another good use case for
something close to pci_subdevice_msi_create_irq_domain()?

If each device could have its own internal MSI-X table programmed
properly things would work alot better. Disable capture/remap of the
MSI range in the NTB.

Then each device could have a proper non-multiplexed interrupt
delivered to the CPU.. Affinity would work properly, no need to share
IRQs (eg vmd_irq() goes away)/etc.

Something for the VMD maintainers to think about at least.

As I hear more about NTB these days a full MSI scheme for NTB seems
interesting to have in the PCI-E core code..

Jason



Re: [patch V2 34/46] PCI/MSI: Make arch_.*_msi_irq[s] fallbacks selectable

2020-08-28 Thread Jason Gunthorpe
On Fri, Aug 28, 2020 at 01:47:59PM +0100, Marc Zyngier wrote:

> > So the arch_setup_msi_irq/etc is not really an arch hook, but some
> > infrastructure to support those 4 PCI root port drivers.
> 
> I happen to have a *really old* patch addressing Tegra [1], which
> I was never able to test (no HW). Rebasing it shouldn't be too hard,
> and maybe you can find someone internally willing to give it a spin?

Sure, that helps a bunch, I will ask internally if someone in that BU
can take a look.

Thanks,
Jason



Re: [patch V2 34/46] PCI/MSI: Make arch_.*_msi_irq[s] fallbacks selectable

2020-08-28 Thread Jason Gunthorpe
On Fri, Aug 28, 2020 at 12:21:42PM +0100, Lorenzo Pieralisi wrote:
> On Thu, Aug 27, 2020 at 01:20:40PM -0500, Bjorn Helgaas wrote:
> 
> [...]
> 
> > And I can't figure out what's special about tegra, rcar, and xilinx
> > that makes them need it as well.  Is there something I could grep for
> > to identify them?  Is there a way to convert them so they don't need
> > it?
> 
> I think DT binding and related firmware support are needed to setup the
> MSI IRQ domains correctly, there is nothing special about tegra, rcar
> and xilinx AFAIK (well, all native host controllers MSI handling is
> *special* just to be polite but let's gloss over this for the time
> being).
> 
> struct msi_controller, to answer the first question.
> 
> I have doubts about pci_mvebu too, they do allocate an msi_controller
> but without methods so it looks pretty much useless.

Oh, I did once know things about mvebu.. 

I suspect the msi controller pointer assignment is dead code at this
point. The only implementation of MSI with that PCI root port is
drivers/irqchip/irq-armada-370-xp.c which looks like it uses
irq_domain.

Actually looks like things are very close to eliminating
msi_controller.

This is dead code, can't find a setter for hw_pci->msi_ctrl:

arch/arm/include/asm/mach/pci.h:struct msi_controller *msi_ctrl;
arch/arm/kernel/bios32.c:   bridge->msi = 
hw->msi_ctrl;

This is probably just copying NULL from one place to another:

drivers/pci/controller/pci-mvebu.c: struct msi_controller *msi;

These need conversion to irq_domain (right?):

drivers/pci/controller/pci-hyperv.c:struct msi_controller msi_chip;
drivers/pci/controller/pci-tegra.c: struct msi_controller chip;
drivers/pci/controller/pcie-rcar-host.c:struct msi_controller chip;
drivers/pci/controller/pcie-xilinx.c:static struct msi_controller 
xilinx_pcie_msi_chip = {

Then the stuff in drivers/pci/msi.c can go away.

So the arch_setup_msi_irq/etc is not really an arch hook, but some
infrastructure to support those 4 PCI root port drivers.

Jason



Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

2020-08-22 Thread Jason Gunthorpe
On Sat, Aug 22, 2020 at 03:34:45AM +0200, Thomas Gleixner wrote:
> >> One question is whether the device can see partial updates to that
> >> memory due to the async 'swap' of context from the device CPU.
> >
> > It is worse than just partial updates.. The device operation is much
> > more like you'd imagine a CPU cache. There could be copies of the RAM
> > in the device for long periods of time, dirty data in the device that
> > will flush back to CPU RAM overwriting CPU changes, etc.
> 
> TBH, that's insane. You clearly want to think about this some
> more. If

I think this general design is around 15 years old, across a healthy
number of silicon generations, and rather a lager number of shipped
devices. People have thought about it :)

> you swap out device state and device control state then you definitly
> want to have regions which are read only from the device POV and never
> written back. 

It is not as useful as you'd think - the issue with atomicity of
update still largely prevents doing much useful from the CPU, and to
make any CPU side changes visible a device command would still be
needed to synchronize the internal state to that modified memory.

So, CPU centric updates would cover a very limited number of
operations, and a device command is required anyhow. Little is
actually gained.

> The MSI msg store clearly belongs into that category.
> But that's not restricted to the MSI msg store, there is certainly other
> stuff which never wants to be written back by the device.

To get a design where you'd be able to run everything from a CPU
atomic context that can't trigger a WQ..

New silicon would have to implement some MSI-only 'cache' that can
invalidate entries based on a simple MemWr TLP.

Then the affinity update would write to the host memory, then send a
MemWr to the device to trigger invalidate.

As a silicon design it might work, but it means existing devices can't
be used with this dev_msi. It is also the sort of thing that would
need a standard document to have any hope of multiple vendors fitting
into it. Eg at PCI-SIG or something.

> If you don't do that then you simply can't write to that space from the
> CPU and you have to transport this kind information always via command
> queues.

Yes, exactly. This is part of the architectural design of the device,
has been for a long time. Has positives and negatives.

> > I suppose the core code could provide this as a service? Sort of a
> > varient of the other lazy things above?
> 
> Kinda. That needs a lot of thought for the affinity setting stuff
> because it can be called from contexts which do not allow that. It's
> solvable though, but I clearly need to stare at the corner cases for a
> while.

If possible, this would be ideal, as we could use the dev_msi on a big
installed base of existing HW.

I suspect other HW can probably fit into this too as the basic
ingredients should be fairly widespread.

Even a restricted version for situations where affinity does not need
a device update would possibly be interesting (eg x86 IOMMU remap, ARM
GIC, etc)

> OTOH, in normal operation for MSI interrupts (edge type) masking is not
> used at all and just restricted to the startup teardown.

Yeah, at least this device doesn't need masking at runtime, just
startup/teardown and affinity update.

Thanks,
Jason



Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

2020-08-21 Thread Jason Gunthorpe
On Sat, Aug 22, 2020 at 01:47:12AM +0200, Thomas Gleixner wrote:
> On Fri, Aug 21 2020 at 17:17, Jason Gunthorpe wrote:
> > On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
> >> So if I understand correctly then the queue memory where the MSI
> >> descriptor sits is in RAM.
> >
> > Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
> > could have enough on-die memory then they could just use really big
> > MSI-X tables. Currently due to on-die memory constraints mlx5 is
> > limited to a few hundred MSI-X vectors.
> 
> Right, that's the limit of a particular device, but nothing prevents you
> to have a larger table on a new device.

Well, physics are a problem.. The SRAM to store the MSI vectors costs
die space and making the chip die larger is not an option. So the
question is what do you throw out of the chip to get a 10-20x increase
in MSI SRAM?

This is why using host memory is so appealing. It is
economically/functionally better.

I'm going to guess other HW is in the same situation, virtualization
is really pushing up the number of required IRQs.

> >> How is that supposed to work if interrupt remapping is disabled?
> >
> > The best we can do is issue a command to the device and spin/sleep
> > until completion. The device will serialize everything internally.
> >
> > If the device has died the driver has code to detect and trigger a
> > PCI function reset which will definitely stop the interrupt.
> 
> If that interrupt is gone into storm mode for some reason then this will
> render your machine unusable before you can do that.

Yes, but in general the HW design is to have one-shot interrupts, it
would have to be well off the rails to storm. The kind of off the
rails where it could also be doing crazy stuff on PCI-E that would be
very harmful.

> > So, the implementation of these functions would be to push any change
> > onto a command queue, trigger the device to DMA the command, spin/sleep
> > until the device returns a response and then continue on. If the
> > device doesn't return a response in a time window then trigger a WQ to
> > do a full device reset.
> 
> I really don't want to do that with the irq descriptor lock held or in
> case of affinity from the interrupt handler as we have to do with PCI
> MSI/MSI-X due to the horrors of the X86 interrupt delivery trainwreck.
> Also you cannot call into command queue code from interrupt disabled and
> interrupt descriptor lock held sections. You can try, but lockdep will
> yell at you immediately.

Yes, I wouldn't want to do this from an IRQ.

> One question is whether the device can see partial updates to that
> memory due to the async 'swap' of context from the device CPU.

It is worse than just partial updates.. The device operation is much
more like you'd imagine a CPU cache. There could be copies of the RAM
in the device for long periods of time, dirty data in the device that
will flush back to CPU RAM overwriting CPU changes, etc.

Without involving the device there is just no way to create data
consistency, and no way to change the data from the CPU. 

This is the down side of having device data in the RAM. It cannot be
so simple as 'just fetch it every time before you use it' as
performance would be horrible.

> irq chips have already a mechanism in place to deal with stuff which
> cannot be handled from within the irq descriptor spinlock held and
> interrupt disabled section.
> 
> The mechanism was invented to deal with interrupt chips which are
> connected to i2c, spi, etc.. The access to an interrupt chip control
> register has to queue stuff on the bus and wait for completion.
> Obviously not what you can do from interrupt disabled, raw spinlock held
> context either.

Ah intersting, sounds like the right parts! I didn't know about this..

> Now coming back to affinity setting. I'd love to avoid adding the bus
> lock magic to those interfaces because until now they can be called and
> are called from atomic contexts. And obviously none of the devices which
> use the buslock magic support affinity setting because they all deliver
> a single interrupt to a demultiplex interrupt and that one is usually
> sitting at the CPU level where interrupt steering works.
> 
> If we really can get away with atomically updating the message as
> outlined above and just let it happen at some point in the future then
> most problems are solved, except for the nastyness of CPU hotplug.

Since we can't avoid a device command, I'm think more along the lines
of having the affinity update trigger an async WQ to issue the command
from a thread context. Since it doesn't need to be synchronous it can
make it out 'eventually'.

I suppose the core code could provide this as a service? Sort of a
var

Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

2020-08-21 Thread Jason Gunthorpe
On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
> On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote:
> > On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:
> >> +static void ims_mask_irq(struct irq_data *data)
> >> +{
> >> +  struct msi_desc *desc = irq_data_get_msi_desc(data);
> >> +  struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
> >> +  u32 __iomem *ctrl = >ctrl;
> >> +
> >> +  iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);
> >
> > Just to be clear, this is exactly the sort of operation we can't do
> > with non-MSI interrupts. For a real PCI device to execute this it
> > would have to keep the data on die.
> 
> We means NVIDIA and your new device, right?

We'd like to use this in the current Mellanox NIC HW, eg the mlx5
driver. (NVIDIA acquired Mellanox recently)

> So if I understand correctly then the queue memory where the MSI
> descriptor sits is in RAM.

Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
could have enough on-die memory then they could just use really big
MSI-X tables. Currently due to on-die memory constraints mlx5 is
limited to a few hundred MSI-X vectors.

Since MSI-X tables are exposed via MMIO they can't be 'swapped' to
RAM.

Moving away from MSI-X's MMIO access model allows them to be swapped
to RAM. The cost is that accessing them for update is a
command/response operation not a MMIO operation.

The HW is already swapping the queues causing the interrupts to RAM,
so adding a bit of additional data to store the MSI addr/data is
reasonable.

To give some sense, a 'working set' for the NIC device in some cases
can be hundreds of megabytes of data. System RAM is used to store
this, and precious on-die memory holds some dynamic active set, much
like a processor cache.

> How is that supposed to work if interrupt remapping is disabled?

The best we can do is issue a command to the device and spin/sleep
until completion. The device will serialize everything internally.

If the device has died the driver has code to detect and trigger a
PCI function reset which will definitely stop the interrupt.

So, the implementation of these functions would be to push any change
onto a command queue, trigger the device to DMA the command, spin/sleep
until the device returns a response and then continue on. If the
device doesn't return a response in a time window then trigger a WQ to
do a full device reset.

The spin/sleep is only needed if the update has to be synchronous, so
things like rebalancing could just push the rebalancing work and
immediately return.

> If interrupt remapping is enabled then both are trivial because then the
> irq chip can delegate everything to the parent chip, i.e. the remapping
> unit.

I did like this notion that IRQ remapping could avoid the overhead of
spin/spleep. Most of the use cases we have for this will require the
IOMMU anyhow.

> > I saw the idxd driver was doing something like this, I assume it
> > avoids trouble because it is a fake PCI device integrated with the
> > CPU, not on a real PCI bus?
> 
> That's how it is implemented as far as I understood the patches. It's
> device memory therefore iowrite32().

I don't know anything about idxd.. Given the scale of interrupt need I
assumed the idxd HW had some hidden swapping to RAM. 

Since it is on-die with the CPU there are a bunch of ways I could
imagine Intel could make MMIO triggered swapping work that are not
available to a true PCI-E device.

Jason



Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

2020-08-21 Thread Jason Gunthorpe
On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:
> +static void ims_mask_irq(struct irq_data *data)
> +{
> + struct msi_desc *desc = irq_data_get_msi_desc(data);
> + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
> + u32 __iomem *ctrl = >ctrl;
> +
> + iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);

Just to be clear, this is exactly the sort of operation we can't do
with non-MSI interrupts. For a real PCI device to execute this it
would have to keep the data on die.

I saw the idxd driver was doing something like this, I assume it
avoids trouble because it is a fake PCI device integrated with the
CPU, not on a real PCI bus?

It is really nice to see irq_domain used properly in x86!

Thanks,
Jason



Re: [Xen-devel] [PATCH] xen/gntdev: Do not use mm notifiers with autotranslating guests

2020-01-28 Thread Jason Gunthorpe
On Tue, Jan 28, 2020 at 11:24:19AM -0500, Boris Ostrovsky wrote:
> Commit d3eeb1d77c5d ("xen/gntdev: use mmu_interval_notifier_insert")
> missed a test for use_ptemod when calling mmu_interval_read_begin(). Fix
> that.
> 
> Fixes: d3eeb1d77c5d ("xen/gntdev: use mmu_interval_notifier_insert")
> CC: sta...@vger.kernel.org # 5.5
> Reported-by: Ilpo Järvinen 
> Tested-by: Ilpo Järvinen 
> Signed-off-by: Boris Ostrovsky 
> ---
>  drivers/xen/gntdev.c | 24 
>  1 file changed, 12 insertions(+), 12 deletions(-)

Yes, this looks like it for the ptemod miss, thanks

Reviewed-by: Jason Gunthorpe 

Jason
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 02/14] mm/mmu_notifier: add an interval tree notifier

2019-11-23 Thread Jason Gunthorpe
On Fri, Nov 22, 2019 at 04:54:08PM -0800, Ralph Campbell wrote:

> Actually, I think you can remove the "need_wake" variable since it is
> unconditionally set to "true".

Oh, yes, thank you. An earlier revision had a different control flow
 
> Also, the comment in__mmu_interval_notifier_insert() says
> "mni->mr_invalidate_seq" and I think that should be
> "mni->invalidate_seq".

Got it.

I squashed this in:

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index b3a064b3b31807..30abbfdc25be55 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -129,7 +129,6 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
 {
struct mmu_interval_notifier *mni;
struct hlist_node *next;
-   bool need_wake = false;
 
spin_lock(_mm->lock);
if (--mmn_mm->active_invalidate_ranges ||
@@ -140,7 +139,6 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
 
/* Make invalidate_seq even */
mmn_mm->invalidate_seq++;
-   need_wake = true;
 
/*
 * The inv_end incorporates a deferred mechanism like rtnl_unlock().
@@ -160,8 +158,7 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
}
spin_unlock(_mm->lock);
 
-   if (need_wake)
-   wake_up_all(_mm->wq);
+   wake_up_all(_mm->wq);
 }
 
 /**
@@ -884,7 +881,7 @@ static int __mmu_interval_notifier_insert(
 * possibility for live lock, instead defer the add to
 * mn_itree_inv_end() so this algorithm is deterministic.
 *
-* In all cases the value for the mni->mr_invalidate_seq should be
+* In all cases the value for the mni->invalidate_seq should be
 * odd, see mmu_interval_read_begin()
 */
spin_lock(_mm->lock);

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v3 02/14] mm/mmu_notifier: add an interval tree notifier

2019-11-13 Thread Jason Gunthorpe
On Wed, Nov 13, 2019 at 05:59:52AM -0800, Christoph Hellwig wrote:
> > +int mmu_interval_notifier_insert(struct mmu_interval_notifier *mni,
> > + struct mm_struct *mm, unsigned long start,
> > + unsigned long length,
> > + const struct mmu_interval_notifier_ops 
> > *ops);
> > +int mmu_interval_notifier_insert_locked(
> > +   struct mmu_interval_notifier *mni, struct mm_struct *mm,
> > +   unsigned long start, unsigned long length,
> > +   const struct mmu_interval_notifier_ops *ops);
> 
> Very inconsistent indentation between these two related functions.

clang-format.. The kernel config is set to prefer a line up under the
( if all the arguments will fit within the 80 cols otherwise it does a
1 tab continuation indent.

> > +   /*
> > +* The inv_end incorporates a deferred mechanism like
> > +* rtnl_unlock(). Adds and removes are queued until the final inv_end
> > +* happens then they are progressed. This arrangement for tree updates
> > +* is used to avoid using a blocking lock during
> > +* invalidate_range_start.
> 
> Nitpick:  That comment can be condensed into one less line:

The rtnl_unlock can move up a line too. My editor is failing me on
this.

> > +   /*
> > +* TODO: Since we already have a spinlock above, this would be faster
> > +* as wake_up_q
> > +*/
> > +   if (need_wake)
> > +   wake_up_all(_mm->wq);
> 
> So why is this important enough for a TODO comment, but not important
> enough to do right away?

Lets drop the comment, I'm noto sure wake_up_q is even a function this
layer should be calling.
 
> > +* release semantics on the initialization of the mmu_notifier_mm's
> > + * contents are provided for unlocked readers.  acquire can only be
> > + * used while holding the mmgrab or mmget, and is safe because once
> > + * created the mmu_notififer_mm is not freed until the mm is
> > + * destroyed.  As above, users holding the mmap_sem or one of the
> > + * mm_take_all_locks() do not need to use acquire semantics.
> >  */
> 
> Some spaces instead of tabs here.

Got it

> > +static int __mmu_interval_notifier_insert(
> > +   struct mmu_interval_notifier *mni, struct mm_struct *mm,
> > +   struct mmu_notifier_mm *mmn_mm, unsigned long start,
> > +   unsigned long length, const struct mmu_interval_notifier_ops *ops)
> 
> Odd indentation - we usuall do two tabs (my preference) or align after
> the opening brace.

This is one tab. I don't think one tab is odd, it seems pretty popular
even just in mm/.

But two tabs is considered usual? I didn't even know that.

> > + * This function must be paired with mmu_interval_notifier_insert(). It 
> > cannot be
> 
> line > 80 chars.

got it, was missed during the rename

> Otherwise this looks good and very similar to what I reviewed earlier:
> 
> Reviewed-by: Christoph Hellwig 

Thanks a bunch, your comments have been very helpful on this series!

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v3 14/14] xen/gntdev: use mmu_interval_notifier_insert

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

gntdev simply wants to monitor a specific VMA for any notifier events,
this can be done straightforwardly using mmu_interval_notifier_insert()
over the VMA's VA range.

The notifier should be attached until the original VMA is destroyed.

It is unclear if any of this is even sane, but at least a lot of duplicate
code is removed.

Reviewed-by: Boris Ostrovsky 
Signed-off-by: Jason Gunthorpe 
---
 drivers/xen/gntdev-common.h |   8 +-
 drivers/xen/gntdev.c| 179 ++--
 2 files changed, 49 insertions(+), 138 deletions(-)

diff --git a/drivers/xen/gntdev-common.h b/drivers/xen/gntdev-common.h
index 2f8b949c3eeb14..91e44c04f7876c 100644
--- a/drivers/xen/gntdev-common.h
+++ b/drivers/xen/gntdev-common.h
@@ -21,15 +21,8 @@ struct gntdev_dmabuf_priv;
 struct gntdev_priv {
/* Maps with visible offsets in the file descriptor. */
struct list_head maps;
-   /*
-* Maps that are not visible; will be freed on munmap.
-* Only populated if populate_freeable_maps == 1
-*/
-   struct list_head freeable_maps;
/* lock protects maps and freeable_maps. */
struct mutex lock;
-   struct mm_struct *mm;
-   struct mmu_notifier mn;
 
 #ifdef CONFIG_XEN_GRANT_DMA_ALLOC
/* Device for which DMA memory is allocated. */
@@ -49,6 +42,7 @@ struct gntdev_unmap_notify {
 };
 
 struct gntdev_grant_map {
+   struct mmu_interval_notifier notifier;
struct list_head next;
struct vm_area_struct *vma;
int index;
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 81401f386c9ce0..a04ddf2a68afa5 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -63,7 +63,6 @@ MODULE_PARM_DESC(limit, "Maximum number of grants that may be 
mapped by "
 static atomic_t pages_mapped = ATOMIC_INIT(0);
 
 static int use_ptemod;
-#define populate_freeable_maps use_ptemod
 
 static int unmap_grant_pages(struct gntdev_grant_map *map,
 int offset, int pages);
@@ -249,12 +248,6 @@ void gntdev_put_map(struct gntdev_priv *priv, struct 
gntdev_grant_map *map)
evtchn_put(map->notify.event);
}
 
-   if (populate_freeable_maps && priv) {
-   mutex_lock(>lock);
-   list_del(>next);
-   mutex_unlock(>lock);
-   }
-
if (map->pages && !use_ptemod)
unmap_grant_pages(map, 0, map->count);
gntdev_free_map(map);
@@ -444,16 +437,9 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
 
pr_debug("gntdev_vma_close %p\n", vma);
if (use_ptemod) {
-   /* It is possible that an mmu notifier could be running
-* concurrently, so take priv->lock to ensure that the vma won't
-* vanishing during the unmap_grant_pages call, since we will
-* spin here until that completes. Such a concurrent call will
-* not do any unmapping, since that has been done prior to
-* closing the vma, but it may still iterate the unmap_ops list.
-*/
-   mutex_lock(>lock);
+   WARN_ON(map->vma != vma);
+   mmu_interval_notifier_remove(>notifier);
map->vma = NULL;
-   mutex_unlock(>lock);
}
vma->vm_private_data = NULL;
gntdev_put_map(priv, map);
@@ -475,109 +461,44 @@ static const struct vm_operations_struct gntdev_vmops = {
 
 /* -- */
 
-static bool in_range(struct gntdev_grant_map *map,
- unsigned long start, unsigned long end)
-{
-   if (!map->vma)
-   return false;
-   if (map->vma->vm_start >= end)
-   return false;
-   if (map->vma->vm_end <= start)
-   return false;
-
-   return true;
-}
-
-static int unmap_if_in_range(struct gntdev_grant_map *map,
- unsigned long start, unsigned long end,
- bool blockable)
+static bool gntdev_invalidate(struct mmu_interval_notifier *mn,
+ const struct mmu_notifier_range *range,
+ unsigned long cur_seq)
 {
+   struct gntdev_grant_map *map =
+   container_of(mn, struct gntdev_grant_map, notifier);
unsigned long mstart, mend;
int err;
 
-   if (!in_range(map, start, end))
-   return 0;
+   if (!mmu_notifier_range_blockable(range))
+   return false;
 
-   if (!blockable)
-   return -EAGAIN;
+   /*
+* If the VMA is split or otherwise changed the notifier is not
+* updated, but we don't want to process VA's outside the modified
+* VMA. FIXME: It would be much more understandable to just prevent
+

[Xen-devel] [PATCH v3 05/14] RDMA/odp: Use mmu_interval_notifier_insert()

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Replace the internal interval tree based mmu notifier with the new common
mmu_interval_notifier_insert() API. This removes a lot of code and fixes a
deadlock that can be triggered in ODP:

 zap_page_range()
  mmu_notifier_invalidate_range_start()
   [..]
ib_umem_notifier_invalidate_range_start()
   down_read(_mm->umem_rwsem)
  unmap_single_vma()
[..]
  __split_huge_page_pmd()
mmu_notifier_invalidate_range_start()
[..]
   ib_umem_notifier_invalidate_range_start()
  down_read(_mm->umem_rwsem)   // DEADLOCK

mmu_notifier_invalidate_range_end()
   up_read(_mm->umem_rwsem)
  mmu_notifier_invalidate_range_end()
 up_read(_mm->umem_rwsem)

The umem_rwsem is held across the range_start/end as the ODP algorithm for
invalidate_range_end cannot tolerate changes to the interval
tree. However, due to the nested invalidation regions the second
down_read() can deadlock if there are competing writers. The new core code
provides an alternative scheme to solve this problem.

Fixes: ca748c39ea3f ("RDMA/umem: Get rid of per_mm->notifier_count")
Tested-by: Artemy Kovalyov 
Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/core/device.c |   1 -
 drivers/infiniband/core/umem_odp.c   | 303 ---
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   7 +-
 drivers/infiniband/hw/mlx5/mr.c  |   3 +-
 drivers/infiniband/hw/mlx5/odp.c |  50 ++---
 include/rdma/ib_umem_odp.h   |  68 ++
 include/rdma/ib_verbs.h  |   2 -
 7 files changed, 82 insertions(+), 352 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 2dd2cfe9b56136..ac7924b3c73abe 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2617,7 +2617,6 @@ void ib_set_device_ops(struct ib_device *dev, const 
struct ib_device_ops *ops)
SET_DEVICE_OP(dev_ops, get_vf_config);
SET_DEVICE_OP(dev_ops, get_vf_stats);
SET_DEVICE_OP(dev_ops, init_port);
-   SET_DEVICE_OP(dev_ops, invalidate_range);
SET_DEVICE_OP(dev_ops, iw_accept);
SET_DEVICE_OP(dev_ops, iw_add_ref);
SET_DEVICE_OP(dev_ops, iw_connect);
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index d7d5fadf0899ad..e42d44e501fd54 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -48,197 +48,33 @@
 
 #include "uverbs.h"
 
-static void ib_umem_notifier_start_account(struct ib_umem_odp *umem_odp)
+static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
+  const struct mmu_interval_notifier_ops *ops)
 {
-   mutex_lock(_odp->umem_mutex);
-   if (umem_odp->notifiers_count++ == 0)
-   /*
-* Initialize the completion object for waiting on
-* notifiers. Since notifier_count is zero, no one should be
-* waiting right now.
-*/
-   reinit_completion(_odp->notifier_completion);
-   mutex_unlock(_odp->umem_mutex);
-}
-
-static void ib_umem_notifier_end_account(struct ib_umem_odp *umem_odp)
-{
-   mutex_lock(_odp->umem_mutex);
-   /*
-* This sequence increase will notify the QP page fault that the page
-* that is going to be mapped in the spte could have been freed.
-*/
-   ++umem_odp->notifiers_seq;
-   if (--umem_odp->notifiers_count == 0)
-   complete_all(_odp->notifier_completion);
-   mutex_unlock(_odp->umem_mutex);
-}
-
-static void ib_umem_notifier_release(struct mmu_notifier *mn,
-struct mm_struct *mm)
-{
-   struct ib_ucontext_per_mm *per_mm =
-   container_of(mn, struct ib_ucontext_per_mm, mn);
-   struct rb_node *node;
-
-   down_read(_mm->umem_rwsem);
-   if (!per_mm->mn.users)
-   goto out;
-
-   for (node = rb_first_cached(_mm->umem_tree); node;
-node = rb_next(node)) {
-   struct ib_umem_odp *umem_odp =
-   rb_entry(node, struct ib_umem_odp, interval_tree.rb);
-
-   /*
-* Increase the number of notifiers running, to prevent any
-* further fault handling on this MR.
-*/
-   ib_umem_notifier_start_account(umem_odp);
-   complete_all(_odp->notifier_completion);
-   umem_odp->umem.ibdev->ops.invalidate_range(
-   umem_odp, ib_umem_start(umem_odp),
-   ib_umem_end(umem_odp));
-   }
-
-out:
-   up_read(_mm->umem_rwsem);
-}
-
-static int invalidate_range_start_trampoline(struct ib_umem_odp *item,
-u64 start, u64 end, void *cookie)
-{
-   ib_umem_notifier_start_account(item);
-   item->

[Xen-devel] [PATCH v3 11/14] drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Remove the interval tree in the driver and rely on the tree maintained by
the mmu_notifier for delivering mmu_notifier invalidation callbacks.

For some reason amdgpu has a very complicated arrangement where it tries
to prevent duplicate entries in the interval_tree, this is not necessary,
each amdgpu_bo can be its own stand alone entry. interval_tree already
allows duplicates and overlaps in the tree.

Also, there is no need to remove entries upon a release callback, the
mmu_interval API safely allows objects to remain registered beyond the
lifetime of the mm. The driver only has to stop touching the pages during
release.

Reviewed-by: Philip Yang 
Tested-by: Philip Yang 
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   2 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c| 333 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h|   4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|  13 +-
 6 files changed, 77 insertions(+), 281 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index bd37df5dd6d048..60591a5d420021 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1006,6 +1006,8 @@ struct amdgpu_device {
struct mutex  lock_reset;
struct amdgpu_doorbell_index doorbell_index;
 
+   struct mutexnotifier_lock;
+
int asic_reset_res;
struct work_struct  xgmi_reset_work;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 6d021ecc8d598f..47700302a08b7f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -481,8 +481,7 @@ static void remove_kgd_mem_from_kfd_bo_list(struct kgd_mem 
*mem,
  *
  * Returns 0 for success, negative errno for errors.
  */
-static int init_user_pages(struct kgd_mem *mem, struct mm_struct *mm,
-  uint64_t user_addr)
+static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr)
 {
struct amdkfd_process_info *process_info = mem->process_info;
struct amdgpu_bo *bo = mem->bo;
@@ -1195,7 +1194,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
add_kgd_mem_to_kfd_bo_list(*mem, avm->process_info, user_addr);
 
if (user_addr) {
-   ret = init_user_pages(*mem, current->mm, user_addr);
+   ret = init_user_pages(*mem, user_addr);
if (ret)
goto allocate_init_user_pages_failed;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5a1939dbd4e3e6..38f97998aaddb2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2633,6 +2633,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
mutex_init(>virt.vf_errors.lock);
hash_init(adev->mn_hash);
mutex_init(>lock_reset);
+   mutex_init(>notifier_lock);
mutex_init(>virt.dpm_mutex);
mutex_init(>psp.mutex);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 31d4deb5d29484..9fe1c31ce17a30 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -50,66 +50,6 @@
 #include "amdgpu.h"
 #include "amdgpu_amdkfd.h"
 
-/**
- * struct amdgpu_mn_node
- *
- * @it: interval node defining start-last of the affected address range
- * @bos: list of all BOs in the affected address range
- *
- * Manages all BOs which are affected of a certain range of address space.
- */
-struct amdgpu_mn_node {
-   struct interval_tree_node   it;
-   struct list_headbos;
-};
-
-/**
- * amdgpu_mn_destroy - destroy the HMM mirror
- *
- * @work: previously sheduled work item
- *
- * Lazy destroys the notifier from a work item
- */
-static void amdgpu_mn_destroy(struct work_struct *work)
-{
-   struct amdgpu_mn *amn = container_of(work, struct amdgpu_mn, work);
-   struct amdgpu_device *adev = amn->adev;
-   struct amdgpu_mn_node *node, *next_node;
-   struct amdgpu_bo *bo, *next_bo;
-
-   mutex_lock(>mn_lock);
-   down_write(>lock);
-   hash_del(>node);
-   rbtree_postorder_for_each_entry_safe(node, next_node,
->objects.rb_root, it.rb) {
-   list_for_each_entry_safe(bo, next_bo, >bos, mn_list) {
-   bo->mn = NULL;
-   list_del_init(>mn_list);
-   }
-   kfree(node);
-   }
-   up_write(>lock);
-   mutex_unlock(>mn_lock);
-
-   hmm_mirror_unregister(>mirror);
- 

[Xen-devel] [PATCH v3 04/14] mm/hmm: define the pre-processor related parts of hmm.h even if disabled

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Only the function calls are stubbed out with static inlines that always
fail. This is the standard way to write a header for an optional component
and makes it easier for drivers that only optionally need HMM_MIRROR.

Reviewed-by: Jérôme Glisse 
Tested-by: Ralph Campbell 
Signed-off-by: Jason Gunthorpe 
---
 include/linux/hmm.h | 59 -
 kernel/fork.c   |  1 -
 2 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index fbb35c78637e57..cb69bf10dc788c 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -62,8 +62,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_HMM_MIRROR
-
 #include 
 #include 
 #include 
@@ -374,6 +372,15 @@ struct hmm_mirror {
struct list_headlist;
 };
 
+/*
+ * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that case.
+ */
+#define HMM_FAULT_ALLOW_RETRY  (1 << 0)
+
+/* Don't fault in missing PTEs, just snapshot the current state. */
+#define HMM_FAULT_SNAPSHOT (1 << 1)
+
+#ifdef CONFIG_HMM_MIRROR
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
@@ -383,14 +390,6 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror);
 void hmm_range_unregister(struct hmm_range *range);
 
-/*
- * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that case.
- */
-#define HMM_FAULT_ALLOW_RETRY  (1 << 0)
-
-/* Don't fault in missing PTEs, just snapshot the current state. */
-#define HMM_FAULT_SNAPSHOT (1 << 1)
-
 long hmm_range_fault(struct hmm_range *range, unsigned int flags);
 
 long hmm_range_dma_map(struct hmm_range *range,
@@ -401,6 +400,44 @@ long hmm_range_dma_unmap(struct hmm_range *range,
 struct device *device,
 dma_addr_t *daddrs,
 bool dirty);
+#else
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+   return -EOPNOTSUPP;
+}
+
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+}
+
+int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror)
+{
+   return -EOPNOTSUPP;
+}
+
+void hmm_range_unregister(struct hmm_range *range)
+{
+}
+
+static inline long hmm_range_fault(struct hmm_range *range, unsigned int flags)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline long hmm_range_dma_map(struct hmm_range *range,
+struct device *device, dma_addr_t *daddrs,
+unsigned int flags)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline long hmm_range_dma_unmap(struct hmm_range *range,
+  struct device *device,
+  dma_addr_t *daddrs, bool dirty)
+{
+   return -EOPNOTSUPP;
+}
+#endif
 
 /*
  * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
@@ -411,6 +448,4 @@ long hmm_range_dma_unmap(struct hmm_range *range,
  */
 #define HMM_RANGE_DEFAULT_TIMEOUT 1000
 
-#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-
 #endif /* LINUX_HMM_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index bcdf5312521036..ca39cfc404e3db 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -40,7 +40,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
2.24.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v3 02/14] mm/mmu_notifier: add an interval tree notifier

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Of the 13 users of mmu_notifiers, 8 of them use only
invalidate_range_start/end() and immediately intersect the
mmu_notifier_range with some kind of internal list of VAs.  4 use an
interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
of some kind (scif_dma, vhost, gntdev, hmm)

And the remaining 5 either don't use invalidate_range_start() or do some
special thing with it.

It turns out that building a correct scheme with an interval tree is
pretty complicated, particularly if the use case is synchronizing against
another thread doing get_user_pages().  Many of these implementations have
various subtle and difficult to fix races.

This approach puts the interval tree as common code at the top of the mmu
notifier call tree and implements a shareable locking scheme.

It includes:
 - An interval tree tracking VA ranges, with per-range callbacks
 - A read/write locking scheme for the interval tree that avoids
   sleeping in the notifier path (for OOM killer)
 - A sequence counter based collision-retry locking scheme to tell
   device page fault that a VA range is being concurrently invalidated.

This is based on various ideas:
- hmm accumulates invalidated VA ranges and releases them when all
  invalidates are done, via active_invalidate_ranges count.
  This approach avoids having to intersect the interval tree twice (as
  umem_odp does) at the potential cost of a longer device page fault.

- kvm/umem_odp use a sequence counter to drive the collision retry,
  via invalidate_seq

- a deferred work todo list on unlock scheme like RTNL, via deferred_list.
  This makes adding/removing interval tree members more deterministic

- seqlock, except this version makes the seqlock idea multi-holder on the
  write side by protecting it with active_invalidate_ranges and a spinlock

To minimize MM overhead when only the interval tree is being used, the
entire SRCU and hlist overheads are dropped using some simple
branches. Similarly the interval tree overhead is dropped when in hlist
mode.

The overhead from the mandatory spinlock is broadly the same as most of
existing users which already had a lock (or two) of some sort on the
invalidation path.

Acked-by: Christian König 
Tested-by: Philip Yang 
Tested-by: Ralph Campbell 
Reviewed-by: John Hubbard 
Signed-off-by: Jason Gunthorpe 
---
 include/linux/mmu_notifier.h | 101 +++
 mm/Kconfig   |   1 +
 mm/mmu_notifier.c| 552 +--
 3 files changed, 628 insertions(+), 26 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 12bd603d318ce7..9e6caa8ecd1938 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -6,10 +6,12 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mmu_notifier_mm;
 struct mmu_notifier;
 struct mmu_notifier_range;
+struct mmu_interval_notifier;
 
 /**
  * enum mmu_notifier_event - reason for the mmu notifier callback
@@ -32,6 +34,9 @@ struct mmu_notifier_range;
  * access flags). User should soft dirty the page in the end callback to make
  * sure that anyone relying on soft dirtyness catch pages that might be written
  * through non CPU mappings.
+ *
+ * @MMU_NOTIFY_RELEASE: used during mmu_interval_notifier invalidate to signal
+ * that the mm refcount is zero and the range is no longer accessible.
  */
 enum mmu_notifier_event {
MMU_NOTIFY_UNMAP = 0,
@@ -39,6 +44,7 @@ enum mmu_notifier_event {
MMU_NOTIFY_PROTECTION_VMA,
MMU_NOTIFY_PROTECTION_PAGE,
MMU_NOTIFY_SOFT_DIRTY,
+   MMU_NOTIFY_RELEASE,
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
@@ -222,6 +228,26 @@ struct mmu_notifier {
unsigned int users;
 };
 
+/**
+ * struct mmu_interval_notifier_ops
+ * @invalidate: Upon return the caller must stop using any SPTEs within this
+ *  range. This function can sleep. Return false only if sleeping
+ *  was required but mmu_notifier_range_blockable(range) is false.
+ */
+struct mmu_interval_notifier_ops {
+   bool (*invalidate)(struct mmu_interval_notifier *mni,
+  const struct mmu_notifier_range *range,
+  unsigned long cur_seq);
+};
+
+struct mmu_interval_notifier {
+   struct interval_tree_node interval_tree;
+   const struct mmu_interval_notifier_ops *ops;
+   struct mm_struct *mm;
+   struct hlist_node deferred_item;
+   unsigned long invalidate_seq;
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 #ifdef CONFIG_LOCKDEP
@@ -263,6 +289,81 @@ extern int __mmu_notifier_register(struct mmu_notifier *mn,
   struct mm_struct *mm);
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
struct mm_struct *mm);
+
+unsigned long mmu_interval_read_begin(struct mmu_interval_notifier *mni);
+int mmu_interval_notifier_insert(struct mmu_interval_notifie

[Xen-devel] [PATCH v3 12/14] drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Convert the collision-retry lock around hmm_range_fault to use the one now
provided by the mmu_interval notifier.

Although this driver does not seem to use the collision retry lock that
hmm provides correctly, it can still be converted over to use the
mmu_interval_notifier api instead of hmm_mirror without too much trouble.

This also deletes another place where a driver is associating additional
data (struct amdgpu_mn) with a mmu_struct.

Signed-off-by: Philip Yang 
Reviewed-by: Philip Yang 
Tested-by: Philip Yang 
Signed-off-by: Jason Gunthorpe 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c|  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c| 148 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h|  49 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   | 116 --
 5 files changed, 94 insertions(+), 237 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 47700302a08b7f..1bcedb9b477dce 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1738,6 +1738,10 @@ static int update_invalid_user_pages(struct 
amdkfd_process_info *process_info,
return ret;
}
 
+   /*
+* FIXME: Cannot ignore the return code, must hold
+* notifier_lock
+*/
amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
 
/* Mark the BO as valid unless it was invalidated
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 82823d9a8ba887..22c989bca7514c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -603,8 +603,6 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
e->tv.num_shared = 2;
 
amdgpu_bo_list_get_list(p->bo_list, >validated);
-   if (p->bo_list->first_userptr != p->bo_list->num_entries)
-   p->mn = amdgpu_mn_get(p->adev, AMDGPU_MN_TYPE_GFX);
 
INIT_LIST_HEAD();
amdgpu_vm_get_pd_bo(>vm, >validated, >vm_pd);
@@ -1287,11 +1285,11 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
if (r)
goto error_unlock;
 
-   /* No memory allocation is allowed while holding the mn lock.
-* p->mn is hold until amdgpu_cs_submit is finished and fence is added
-* to BOs.
+   /* No memory allocation is allowed while holding the notifier lock.
+* The lock is held until amdgpu_cs_submit is finished and fence is
+* added to BOs.
 */
-   amdgpu_mn_lock(p->mn);
+   mutex_lock(>adev->notifier_lock);
 
/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
@@ -1334,13 +1332,13 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
amdgpu_vm_move_to_lru_tail(p->adev, >vm);
 
ttm_eu_fence_buffer_objects(>ticket, >validated, p->fence);
-   amdgpu_mn_unlock(p->mn);
+   mutex_unlock(>adev->notifier_lock);
 
return 0;
 
 error_abort:
drm_sched_job_cleanup(>base);
-   amdgpu_mn_unlock(p->mn);
+   mutex_unlock(>adev->notifier_lock);
 
 error_unlock:
amdgpu_job_free(job);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 9fe1c31ce17a30..828b5167ff128f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -50,28 +50,6 @@
 #include "amdgpu.h"
 #include "amdgpu_amdkfd.h"
 
-/**
- * amdgpu_mn_lock - take the write side lock for this notifier
- *
- * @mn: our notifier
- */
-void amdgpu_mn_lock(struct amdgpu_mn *mn)
-{
-   if (mn)
-   down_write(>lock);
-}
-
-/**
- * amdgpu_mn_unlock - drop the write side lock for this notifier
- *
- * @mn: our notifier
- */
-void amdgpu_mn_unlock(struct amdgpu_mn *mn)
-{
-   if (mn)
-   up_write(>lock);
-}
-
 /**
  * amdgpu_mn_invalidate_gfx - callback to notify about mm change
  *
@@ -94,6 +72,9 @@ static bool amdgpu_mn_invalidate_gfx(struct 
mmu_interval_notifier *mni,
return false;
 
mutex_lock(>notifier_lock);
+
+   mmu_interval_set_seq(mni, cur_seq);
+
r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, true, false,
  MAX_SCHEDULE_TIMEOUT);
mutex_unlock(>notifier_lock);
@@ -127,6 +108,9 @@ static bool amdgpu_mn_invalidate_hsa(struct 
mmu_interval_notifier *mni,
return false;
 
mutex_lock(>notifier_lock);
+
+   mmu_interval_set_seq(mni, cur_seq);
+
amdgpu_amdkfd_evict_userptr(bo->kfd_

[Xen-devel] [PATCH v3 09/14] nouveau: use mmu_interval_notifier instead of hmm_mirror

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Remove the hmm_mirror object and use the mmu_interval_notifier API instead
for the range, and use the normal mmu_notifier API for the general
invalidation callback.

While here re-organize the pagefault path so the locking pattern is clear.

nouveau is the only driver that uses a temporary range object and instead
forwards nearly every invalidation range directly to the HW. While this is
not how the mmu_interval_notifier was intended to be used, the overheads on
the pagefaulting path are similar to the existing hmm_mirror version.
Particularly since the interval tree will be small.

Tested-by: Ralph Campbell 
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 179 ++
 1 file changed, 99 insertions(+), 80 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 577f8811925a59..df9bf1fd1bc0be 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -96,8 +96,6 @@ struct nouveau_svmm {
} unmanaged;
 
struct mutex mutex;
-
-   struct hmm_mirror mirror;
 };
 
 #define SVMM_DBG(s,f,a...) 
\
@@ -293,23 +291,11 @@ static const struct mmu_notifier_ops nouveau_mn_ops = {
.free_notifier = nouveau_svmm_free_notifier,
 };
 
-static int
-nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
-   const struct mmu_notifier_range *update)
-{
-   return 0;
-}
-
-static const struct hmm_mirror_ops nouveau_svmm = {
-   .sync_cpu_device_pagetables = nouveau_svmm_sync_cpu_device_pagetables,
-};
-
 void
 nouveau_svmm_fini(struct nouveau_svmm **psvmm)
 {
struct nouveau_svmm *svmm = *psvmm;
if (svmm) {
-   hmm_mirror_unregister(>mirror);
mutex_lock(>mutex);
svmm->vmm = NULL;
mutex_unlock(>mutex);
@@ -357,15 +343,10 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
goto out_free;
 
down_write(>mm->mmap_sem);
-   svmm->mirror.ops = _svmm;
-   ret = hmm_mirror_register(>mirror, current->mm);
-   if (ret)
-   goto out_mm_unlock;
-
svmm->notifier.ops = _mn_ops;
ret = __mmu_notifier_register(>notifier, current->mm);
if (ret)
-   goto out_hmm_unregister;
+   goto out_mm_unlock;
/* Note, ownership of svmm transfers to mmu_notifier */
 
cli->svm.svmm = svmm;
@@ -374,8 +355,6 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
mutex_unlock(>mutex);
return 0;
 
-out_hmm_unregister:
-   hmm_mirror_unregister(>mirror);
 out_mm_unlock:
up_write(>mm->mmap_sem);
 out_free:
@@ -503,43 +482,90 @@ nouveau_svm_fault_cache(struct nouveau_svm *svm,
fault->inst, fault->addr, fault->access);
 }
 
-static inline bool
-nouveau_range_done(struct hmm_range *range)
+struct svm_notifier {
+   struct mmu_interval_notifier notifier;
+   struct nouveau_svmm *svmm;
+};
+
+static bool nouveau_svm_range_invalidate(struct mmu_interval_notifier *mni,
+const struct mmu_notifier_range *range,
+unsigned long cur_seq)
 {
-   bool ret = hmm_range_valid(range);
+   struct svm_notifier *sn =
+   container_of(mni, struct svm_notifier, notifier);
 
-   hmm_range_unregister(range);
-   return ret;
+   /*
+* serializes the update to mni->invalidate_seq done by caller and
+* prevents invalidation of the PTE from progressing while HW is being
+* programmed. This is very hacky and only works because the normal
+* notifier that does invalidation is always called after the range
+* notifier.
+*/
+   if (mmu_notifier_range_blockable(range))
+   mutex_lock(>svmm->mutex);
+   else if (!mutex_trylock(>svmm->mutex))
+   return false;
+   mmu_interval_set_seq(mni, cur_seq);
+   mutex_unlock(>svmm->mutex);
+   return true;
 }
 
-static int
-nouveau_range_fault(struct nouveau_svmm *svmm, struct hmm_range *range)
+static const struct mmu_interval_notifier_ops nouveau_svm_mni_ops = {
+   .invalidate = nouveau_svm_range_invalidate,
+};
+
+static int nouveau_range_fault(struct nouveau_svmm *svmm,
+  struct nouveau_drm *drm, void *data, u32 size,
+  u64 *pfns, struct svm_notifier *notifier)
 {
+   unsigned long timeout =
+   jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+   /* Have HMM fault pages within the fault window to the GPU. */
+   struct hmm_range range = {
+   .notifier = >notifier,
+   .start = notifier->notifier.interval_tree.start,

[Xen-devel] [PATCH v3 13/14] mm/hmm: remove hmm_mirror and related

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

The only two users of this are now converted to use mmu_interval_notifier,
delete all the code and update hmm.rst.

Reviewed-by: Jérôme Glisse 
Tested-by: Ralph Campbell 
Signed-off-by: Jason Gunthorpe 
---
 Documentation/vm/hmm.rst | 105 ---
 include/linux/hmm.h  | 183 +
 mm/Kconfig   |   1 -
 mm/hmm.c | 285 ++-
 4 files changed, 34 insertions(+), 540 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 0a5960beccf76d..893a8ba0e9fefb 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -147,49 +147,16 @@ Address space mirroring implementation and API
 Address space mirroring's main objective is to allow duplication of a range of
 CPU page table into a device page table; HMM helps keep both synchronized. A
 device driver that wants to mirror a process address space must start with the
-registration of an hmm_mirror struct::
-
- int hmm_mirror_register(struct hmm_mirror *mirror,
- struct mm_struct *mm);
-
-The mirror struct has a set of callbacks that are used
-to propagate CPU page tables::
-
- struct hmm_mirror_ops {
- /* release() - release hmm_mirror
-  *
-  * @mirror: pointer to struct hmm_mirror
-  *
-  * This is called when the mm_struct is being released.  The callback
-  * must ensure that all access to any pages obtained from this mirror
-  * is halted before the callback returns. All future access should
-  * fault.
-  */
- void (*release)(struct hmm_mirror *mirror);
-
- /* sync_cpu_device_pagetables() - synchronize page tables
-  *
-  * @mirror: pointer to struct hmm_mirror
-  * @update: update information (see struct mmu_notifier_range)
-  * Return: -EAGAIN if update.blockable false and callback need to
-  * block, 0 otherwise.
-  *
-  * This callback ultimately originates from mmu_notifiers when the CPU
-  * page table is updated. The device driver must update its page table
-  * in response to this callback. The update argument tells what action
-  * to perform.
-  *
-  * The device driver must not return from this callback until the device
-  * page tables are completely updated (TLBs flushed, etc); this is a
-  * synchronous call.
-  */
- int (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
-   const struct hmm_update *update);
- };
-
-The device driver must perform the update action to the range (mark range
-read only, or fully unmap, etc.). The device must complete the update before
-the driver callback returns.
+registration of a mmu_interval_notifier::
+
+ mni->ops = _ops;
+ int mmu_interval_notifier_insert(struct mmu_interval_notifier *mni,
+ unsigned long start, unsigned long length,
+ struct mm_struct *mm);
+
+During the driver_ops->invalidate() callback the device driver must perform
+the update action to the range (mark range read only, or fully unmap,
+etc.). The device must complete the update before the driver callback returns.
 
 When the device driver wants to populate a range of virtual addresses, it can
 use::
@@ -216,70 +183,46 @@ The usage pattern is::
   struct hmm_range range;
   ...
 
+  range.notifier = 
   range.start = ...;
   range.end = ...;
   range.pfns = ...;
   range.flags = ...;
   range.values = ...;
   range.pfn_shift = ...;
-  hmm_range_register(, mirror);
 
-  /*
-   * Just wait for range to be valid, safe to ignore return value as we
-   * will use the return value of hmm_range_fault() below under the
-   * mmap_sem to ascertain the validity of the range.
-   */
-  hmm_range_wait_until_valid(, TIMEOUT_IN_MSEC);
+  if (!mmget_not_zero(mni->notifier.mm))
+  return -EFAULT;
 
  again:
+  range.notifier_seq = mmu_interval_read_begin();
   down_read(>mmap_sem);
   ret = hmm_range_fault(, HMM_RANGE_SNAPSHOT);
   if (ret) {
   up_read(>mmap_sem);
-  if (ret == -EBUSY) {
-/*
- * No need to check hmm_range_wait_until_valid() return value
- * on retry we will get proper error with hmm_range_fault()
- */
-hmm_range_wait_until_valid(, TIMEOUT_IN_MSEC);
-goto again;
-  }
-  hmm_range_unregister();
+  if (ret == -EBUSY)
+ goto again;
   return ret;
   }
+  up_read(>mmap_sem);
+
   take_lock(driver->update);
-  if (!hmm_range_valid()) {
+  if (mmu_interval_read_retry(, range.notifier_seq) {
   release_lock(driver->update);
-  up_read(>mmap_sem);
   goto again;
   }
 
-  // Use pfns array content to update device page table
+  /* Use pfns arr

[Xen-devel] [PATCH v3 10/14] drm/amdgpu: Call find_vma under mmap_sem

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

find_vma() must be called under the mmap_sem, reorganize this code to
do the vma check after entering the lock.

Further, fix the unlocked use of struct task_struct's mm, instead use
the mm from hmm_mirror which has an active mm_grab. Also the mm_grab
must be converted to a mm_get before acquiring mmap_sem or calling
find_vma().

Fixes: 66c45500bfdc ("drm/amdgpu: use new HMM APIs and helpers")
Fixes: 0919195f2b0d ("drm/amdgpu: Enable amdgpu_ttm_tt_get_user_pages in worker 
threads")
Acked-by: Christian König 
Reviewed-by: Felix Kuehling 
Reviewed-by: Philip Yang 
Tested-by: Philip Yang 
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 37 ++---
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index dff41d0a85fe96..c0e41f1f0c2365 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -788,7 +789,7 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, 
struct page **pages)
struct hmm_mirror *mirror = bo->mn ? >mn->mirror : NULL;
struct ttm_tt *ttm = bo->tbo.ttm;
struct amdgpu_ttm_tt *gtt = (void *)ttm;
-   struct mm_struct *mm = gtt->usertask->mm;
+   struct mm_struct *mm;
unsigned long start = gtt->userptr;
struct vm_area_struct *vma;
struct hmm_range *range;
@@ -796,25 +797,14 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, 
struct page **pages)
uint64_t *pfns;
int r = 0;
 
-   if (!mm) /* Happens during process shutdown */
-   return -ESRCH;
-
if (unlikely(!mirror)) {
DRM_DEBUG_DRIVER("Failed to get hmm_mirror\n");
-   r = -EFAULT;
-   goto out;
+   return -EFAULT;
}
 
-   vma = find_vma(mm, start);
-   if (unlikely(!vma || start < vma->vm_start)) {
-   r = -EFAULT;
-   goto out;
-   }
-   if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
-   vma->vm_file)) {
-   r = -EPERM;
-   goto out;
-   }
+   mm = mirror->hmm->mmu_notifier.mm;
+   if (!mmget_not_zero(mm)) /* Happens during process shutdown */
+   return -ESRCH;
 
range = kzalloc(sizeof(*range), GFP_KERNEL);
if (unlikely(!range)) {
@@ -847,6 +837,17 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, 
struct page **pages)
hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT);
 
down_read(>mmap_sem);
+   vma = find_vma(mm, start);
+   if (unlikely(!vma || start < vma->vm_start)) {
+   r = -EFAULT;
+   goto out_unlock;
+   }
+   if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
+   vma->vm_file)) {
+   r = -EPERM;
+   goto out_unlock;
+   }
+
r = hmm_range_fault(range, 0);
up_read(>mmap_sem);
 
@@ -865,15 +866,19 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, 
struct page **pages)
}
 
gtt->range = range;
+   mmput(mm);
 
return 0;
 
+out_unlock:
+   up_read(>mmap_sem);
 out_free_pfns:
hmm_range_unregister(range);
kvfree(pfns);
 out_free_ranges:
kfree(range);
 out:
+   mmput(mm);
return r;
 }
 
-- 
2.24.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v3 06/14] RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

This converts one of the two users of mmu_notifiers to use the new API.
The conversion is fairly straightforward, however the existing use of
notifiers here seems to be racey.

Tested-by: Dennis Dalessandro 
Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/hw/hfi1/file_ops.c |   2 +-
 drivers/infiniband/hw/hfi1/hfi.h  |   2 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c | 146 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.h |   3 +-
 4 files changed, 60 insertions(+), 93 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/file_ops.c 
b/drivers/infiniband/hw/hfi1/file_ops.c
index f9a7e9d29c8ba2..7c5e3fb224139a 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -1138,7 +1138,7 @@ static int get_ctxt_info(struct hfi1_filedata *fd, 
unsigned long arg, u32 len)
HFI1_CAP_UGET_MASK(uctxt->flags, MASK) |
HFI1_CAP_KGET_MASK(uctxt->flags, K2U);
/* adjust flag if this fd is not able to cache */
-   if (!fd->handler)
+   if (!fd->use_mn)
cinfo.runtime_flags |= HFI1_CAP_TID_UNMAP; /* no caching */
 
cinfo.num_active = hfi1_count_active_units();
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index fa45350a9a1d32..fc10d65fc3e13c 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1444,7 +1444,7 @@ struct hfi1_filedata {
/* for cpu affinity; -1 if none */
int rec_cpu_num;
u32 tid_n_pinned;
-   struct mmu_rb_handler *handler;
+   bool use_mn;
struct tid_rb_node **entry_to_rb;
spinlock_t tid_lock; /* protect tid_[limit,used] counters */
u32 tid_limit;
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c 
b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index 3592a9ec155e85..75a378162162d3 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -59,11 +59,11 @@ static int set_rcvarray_entry(struct hfi1_filedata *fd,
  struct tid_user_buf *tbuf,
  u32 rcventry, struct tid_group *grp,
  u16 pageidx, unsigned int npages);
-static int tid_rb_insert(void *arg, struct mmu_rb_node *node);
 static void cacheless_tid_rb_remove(struct hfi1_filedata *fdata,
struct tid_rb_node *tnode);
-static void tid_rb_remove(void *arg, struct mmu_rb_node *node);
-static int tid_rb_invalidate(void *arg, struct mmu_rb_node *mnode);
+static bool tid_rb_invalidate(struct mmu_interval_notifier *mni,
+ const struct mmu_notifier_range *range,
+ unsigned long cur_seq);
 static int program_rcvarray(struct hfi1_filedata *fd, struct tid_user_buf *,
struct tid_group *grp,
unsigned int start, u16 count,
@@ -73,10 +73,8 @@ static int unprogram_rcvarray(struct hfi1_filedata *fd, u32 
tidinfo,
  struct tid_group **grp);
 static void clear_tid_node(struct hfi1_filedata *fd, struct tid_rb_node *node);
 
-static struct mmu_rb_ops tid_rb_ops = {
-   .insert = tid_rb_insert,
-   .remove = tid_rb_remove,
-   .invalidate = tid_rb_invalidate
+static const struct mmu_interval_notifier_ops tid_mn_ops = {
+   .invalidate = tid_rb_invalidate,
 };
 
 /*
@@ -87,7 +85,6 @@ static struct mmu_rb_ops tid_rb_ops = {
 int hfi1_user_exp_rcv_init(struct hfi1_filedata *fd,
   struct hfi1_ctxtdata *uctxt)
 {
-   struct hfi1_devdata *dd = uctxt->dd;
int ret = 0;
 
spin_lock_init(>tid_lock);
@@ -109,20 +106,7 @@ int hfi1_user_exp_rcv_init(struct hfi1_filedata *fd,
fd->entry_to_rb = NULL;
return -ENOMEM;
}
-
-   /*
-* Register MMU notifier callbacks. If the registration
-* fails, continue without TID caching for this context.
-*/
-   ret = hfi1_mmu_rb_register(fd, fd->mm, _rb_ops,
-  dd->pport->hfi1_wq,
-  >handler);
-   if (ret) {
-   dd_dev_info(dd,
-   "Failed MMU notifier registration %d\n",
-   ret);
-   ret = 0;
-   }
+   fd->use_mn = true;
}
 
/*
@@ -139,7 +123,7 @@ int hfi1_user_exp_rcv_init(struct hfi1_filedata *fd,
 * init.
 */
spin_lock(>tid_lock);
-   if (uctxt->subctxt_cnt && fd->handler) {
+   if (uctxt->subctxt_cnt && fd->use_mn) {
u16 remainder;
 
fd->tid_limit = uctxt->expected_c

[Xen-devel] [PATCH v3 08/14] nouveau: use mmu_notifier directly for invalidate_range_start

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

There is no reason to get the invalidate_range_start() callback via an
indirection through hmm_mirror, just register a normal notifier directly.

Tested-by: Ralph Campbell 
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 95 ++-
 1 file changed, 63 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 668d4bd0c118f1..577f8811925a59 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -88,6 +88,7 @@ nouveau_ivmm_find(struct nouveau_svm *svm, u64 inst)
 }
 
 struct nouveau_svmm {
+   struct mmu_notifier notifier;
struct nouveau_vmm *vmm;
struct {
unsigned long start;
@@ -96,7 +97,6 @@ struct nouveau_svmm {
 
struct mutex mutex;
 
-   struct mm_struct *mm;
struct hmm_mirror mirror;
 };
 
@@ -251,10 +251,11 @@ nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 
start, u64 limit)
 }
 
 static int
-nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
-   const struct mmu_notifier_range *update)
+nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
+   const struct mmu_notifier_range *update)
 {
-   struct nouveau_svmm *svmm = container_of(mirror, typeof(*svmm), mirror);
+   struct nouveau_svmm *svmm =
+   container_of(mn, struct nouveau_svmm, notifier);
unsigned long start = update->start;
unsigned long limit = update->end;
 
@@ -264,6 +265,9 @@ nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror 
*mirror,
SVMM_DBG(svmm, "invalidate %016lx-%016lx", start, limit);
 
mutex_lock(>mutex);
+   if (unlikely(!svmm->vmm))
+   goto out;
+
if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
if (start < svmm->unmanaged.start) {
nouveau_svmm_invalidate(svmm, start,
@@ -273,19 +277,31 @@ nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror 
*mirror,
}
 
nouveau_svmm_invalidate(svmm, start, limit);
+
+out:
mutex_unlock(>mutex);
return 0;
 }
 
-static void
-nouveau_svmm_release(struct hmm_mirror *mirror)
+static void nouveau_svmm_free_notifier(struct mmu_notifier *mn)
+{
+   kfree(container_of(mn, struct nouveau_svmm, notifier));
+}
+
+static const struct mmu_notifier_ops nouveau_mn_ops = {
+   .invalidate_range_start = nouveau_svmm_invalidate_range_start,
+   .free_notifier = nouveau_svmm_free_notifier,
+};
+
+static int
+nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
+   const struct mmu_notifier_range *update)
 {
+   return 0;
 }
 
-static const struct hmm_mirror_ops
-nouveau_svmm = {
+static const struct hmm_mirror_ops nouveau_svmm = {
.sync_cpu_device_pagetables = nouveau_svmm_sync_cpu_device_pagetables,
-   .release = nouveau_svmm_release,
 };
 
 void
@@ -294,7 +310,10 @@ nouveau_svmm_fini(struct nouveau_svmm **psvmm)
struct nouveau_svmm *svmm = *psvmm;
if (svmm) {
hmm_mirror_unregister(>mirror);
-   kfree(*psvmm);
+   mutex_lock(>mutex);
+   svmm->vmm = NULL;
+   mutex_unlock(>mutex);
+   mmu_notifier_put(>notifier);
*psvmm = NULL;
}
 }
@@ -320,7 +339,7 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
mutex_lock(>mutex);
if (cli->svm.cli) {
ret = -EBUSY;
-   goto done;
+   goto out_free;
}
 
/* Allocate a new GPU VMM that can support SVM (managed by the
@@ -335,24 +354,33 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
.fault_replay = true,
}, sizeof(struct gp100_vmm_v0), >svm.vmm);
if (ret)
-   goto done;
+   goto out_free;
 
-   /* Enable HMM mirroring of CPU address-space to VMM. */
-   svmm->mm = get_task_mm(current);
-   down_write(>mm->mmap_sem);
+   down_write(>mm->mmap_sem);
svmm->mirror.ops = _svmm;
-   ret = hmm_mirror_register(>mirror, svmm->mm);
-   if (ret == 0) {
-   cli->svm.svmm = svmm;
-   cli->svm.cli = cli;
-   }
-   up_write(>mm->mmap_sem);
-   mmput(svmm->mm);
+   ret = hmm_mirror_register(>mirror, current->mm);
+   if (ret)
+   goto out_mm_unlock;
 
-done:
+   svmm->notifier.ops = _mn_ops;
+   ret = __mmu_notifier_register(>notifier, current->mm);
if (ret)
-   nouveau_svmm_fini();
+   goto out_hmm_unregister;
+   /* Note, ownership of svmm transfers to 

[Xen-devel] [PATCH v3 07/14] drm/radeon: use mmu_interval_notifier_insert

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

The new API is an exact match for the needs of radeon.

For some reason radeon tries to remove overlapping ranges from the
interval tree, but interval trees (and mmu_interval_notifier_insert())
support overlapping ranges directly. Simply delete all this code.

Since this driver is missing a invalidate_range_end callback, but
still calls get_user_pages(), it cannot be correct against all races.

Reviewed-by: Christian König 
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/radeon/radeon.h|   9 +-
 drivers/gpu/drm/radeon/radeon_mn.c | 218 ++---
 2 files changed, 51 insertions(+), 176 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index d59b004f669583..30e32adc1fc666 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -68,6 +68,10 @@
 #include 
 #include 
 
+#ifdef CONFIG_MMU_NOTIFIER
+#include 
+#endif
+
 #include 
 #include 
 #include 
@@ -509,8 +513,9 @@ struct radeon_bo {
struct ttm_bo_kmap_obj  dma_buf_vmap;
pid_t   pid;
 
-   struct radeon_mn*mn;
-   struct list_headmn_list;
+#ifdef CONFIG_MMU_NOTIFIER
+   struct mmu_interval_notifiernotifier;
+#endif
 };
 #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base)
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c 
b/drivers/gpu/drm/radeon/radeon_mn.c
index dbab9a3a969b9e..f93829f08a4dc1 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -36,131 +36,51 @@
 
 #include "radeon.h"
 
-struct radeon_mn {
-   struct mmu_notifier mn;
-
-   /* objects protected by lock */
-   struct mutexlock;
-   struct rb_root_cached   objects;
-};
-
-struct radeon_mn_node {
-   struct interval_tree_node   it;
-   struct list_headbos;
-};
-
 /**
- * radeon_mn_invalidate_range_start - callback to notify about mm change
+ * radeon_mn_invalidate - callback to notify about mm change
  *
  * @mn: our notifier
- * @mn: the mm this callback is about
- * @start: start of updated range
- * @end: end of updated range
+ * @range: the VMA under invalidation
  *
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
-   const struct mmu_notifier_range *range)
+static bool radeon_mn_invalidate(struct mmu_interval_notifier *mn,
+const struct mmu_notifier_range *range,
+unsigned long cur_seq)
 {
-   struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
+   struct radeon_bo *bo = container_of(mn, struct radeon_bo, notifier);
struct ttm_operation_ctx ctx = { false, false };
-   struct interval_tree_node *it;
-   unsigned long end;
-   int ret = 0;
-
-   /* notification is exclusive, but interval is inclusive */
-   end = range->end - 1;
-
-   /* TODO we should be able to split locking for interval tree and
-* the tear down.
-*/
-   if (mmu_notifier_range_blockable(range))
-   mutex_lock(>lock);
-   else if (!mutex_trylock(>lock))
-   return -EAGAIN;
-
-   it = interval_tree_iter_first(>objects, range->start, end);
-   while (it) {
-   struct radeon_mn_node *node;
-   struct radeon_bo *bo;
-   long r;
-
-   if (!mmu_notifier_range_blockable(range)) {
-   ret = -EAGAIN;
-   goto out_unlock;
-   }
-
-   node = container_of(it, struct radeon_mn_node, it);
-   it = interval_tree_iter_next(it, range->start, end);
+   long r;
 
-   list_for_each_entry(bo, >bos, mn_list) {
+   if (!bo->tbo.ttm || bo->tbo.ttm->state != tt_bound)
+   return true;
 
-   if (!bo->tbo.ttm || bo->tbo.ttm->state != tt_bound)
-   continue;
+   if (!mmu_notifier_range_blockable(range))
+   return false;
 
-   r = radeon_bo_reserve(bo, true);
-   if (r) {
-   DRM_ERROR("(%ld) failed to reserve user bo\n", 
r);
-   continue;
-   }
-
-   r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv,
-   true, false, MAX_SCHEDULE_TIMEOUT);
-   if (r <= 0)
-   DRM_ERROR("(%ld) failed to wait for user bo\n", 
r);
-
-   radeon_ttm_placement_from_domain(bo, 
RADEON_GEM_DOMAIN_CPU);
-   r = ttm_bo_validate(>tbo, >placement, );
-

[Xen-devel] [PATCH v3 01/14] mm/mmu_notifier: define the header pre-processor parts even if disabled

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Now that we have KERNEL_HEADER_TEST all headers are generally compile
tested, so relying on makefile tricks to avoid compiling code that depends
on CONFIG_MMU_NOTIFIER is more annoying.

Instead follow the usual pattern and provide most of the header with only
the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
ensures code compiles no matter what the config setting is.

While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.

Reviewed-by: Jérôme Glisse 
Tested-by: Ralph Campbell 
Reviewed-by: John Hubbard 
Signed-off-by: Jason Gunthorpe 
---
 include/linux/mmu_notifier.h | 46 +---
 mm/mmu_notifier.c| 13 ++
 2 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 1bd8e6a09a3c27..12bd603d318ce7 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -7,8 +7,9 @@
 #include 
 #include 
 
+struct mmu_notifier_mm;
 struct mmu_notifier;
-struct mmu_notifier_ops;
+struct mmu_notifier_range;
 
 /**
  * enum mmu_notifier_event - reason for the mmu notifier callback
@@ -40,36 +41,8 @@ enum mmu_notifier_event {
MMU_NOTIFY_SOFT_DIRTY,
 };
 
-#ifdef CONFIG_MMU_NOTIFIER
-
-#ifdef CONFIG_LOCKDEP
-extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
-#endif
-
-/*
- * The mmu notifier_mm structure is allocated and installed in
- * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
- * critical section and it's released only when mm_count reaches zero
- * in mmdrop().
- */
-struct mmu_notifier_mm {
-   /* all mmu notifiers registerd in this mm are queued in this list */
-   struct hlist_head list;
-   /* to serialize the list modifications and hlist_unhashed */
-   spinlock_t lock;
-};
-
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
 
-struct mmu_notifier_range {
-   struct vm_area_struct *vma;
-   struct mm_struct *mm;
-   unsigned long start;
-   unsigned long end;
-   unsigned flags;
-   enum mmu_notifier_event event;
-};
-
 struct mmu_notifier_ops {
/*
 * Called either by mmu_notifier_unregister or when the mm is
@@ -249,6 +222,21 @@ struct mmu_notifier {
unsigned int users;
 };
 
+#ifdef CONFIG_MMU_NOTIFIER
+
+#ifdef CONFIG_LOCKDEP
+extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
+#endif
+
+struct mmu_notifier_range {
+   struct vm_area_struct *vma;
+   struct mm_struct *mm;
+   unsigned long start;
+   unsigned long end;
+   unsigned flags;
+   enum mmu_notifier_event event;
+};
+
 static inline int mm_has_notifiers(struct mm_struct *mm)
 {
return unlikely(mm->mmu_notifier_mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 7fde88695f35d6..367670cfd02b7b 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -27,6 +27,19 @@ struct lockdep_map __mmu_notifier_invalidate_range_start_map 
= {
 };
 #endif
 
+/*
+ * The mmu notifier_mm structure is allocated and installed in
+ * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
+ * critical section and it's released only when mm_count reaches zero
+ * in mmdrop().
+ */
+struct mmu_notifier_mm {
+   /* all mmu notifiers registered in this mm are queued in this list */
+   struct hlist_head list;
+   /* to serialize the list modifications and hlist_unhashed */
+   spinlock_t lock;
+};
+
 /*
  * This function can't run concurrently against mmu_notifier_register
  * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
-- 
2.24.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v3 03/14] mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

hmm_mirror's handling of ranges does not use a sequence count which
results in this bug:

 CPU0   CPU1
 hmm_range_wait_until_valid(range)
 valid == true
 hmm_range_fault(range)
hmm_invalidate_range_start()
   range->valid = false
hmm_invalidate_range_end()
   range->valid = true
 hmm_range_valid(range)
  valid == true

Where the hmm_range_valid() should not have succeeded.

Adding the required sequence count would make it nearly identical to the
new mmu_interval_notifier. Instead replace the hmm_mirror stuff with
mmu_interval_notifier.

Co-existence of the two APIs is the first step.

Reviewed-by: Jérôme Glisse 
Tested-by: Philip Yang 
Tested-by: Ralph Campbell 
Signed-off-by: Jason Gunthorpe 
---
 include/linux/hmm.h |  5 +
 mm/hmm.c| 25 +++--
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 3fec513b9c00f1..fbb35c78637e57 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -145,6 +145,9 @@ enum hmm_pfn_value_e {
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
+ * @notifier: an optional mmu_interval_notifier
+ * @notifier_seq: when notifier is used this is the result of
+ *mmu_interval_read_begin()
  * @hmm: the core HMM structure this range is active against
  * @vma: the vm area struct for the range
  * @list: all range lock are on a list
@@ -159,6 +162,8 @@ enum hmm_pfn_value_e {
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
 struct hmm_range {
+   struct mmu_interval_notifier *notifier;
+   unsigned long   notifier_seq;
struct hmm  *hmm;
struct list_headlist;
unsigned long   start;
diff --git a/mm/hmm.c b/mm/hmm.c
index 6b0136665407a3..8d060c5dabe37b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -858,6 +858,14 @@ void hmm_range_unregister(struct hmm_range *range)
 }
 EXPORT_SYMBOL(hmm_range_unregister);
 
+static bool needs_retry(struct hmm_range *range)
+{
+   if (range->notifier)
+   return mmu_interval_check_retry(range->notifier,
+   range->notifier_seq);
+   return !range->valid;
+}
+
 static const struct mm_walk_ops hmm_walk_ops = {
.pud_entry  = hmm_vma_walk_pud,
.pmd_entry  = hmm_vma_walk_pmd,
@@ -898,18 +906,23 @@ long hmm_range_fault(struct hmm_range *range, unsigned 
int flags)
const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
unsigned long start = range->start, end;
struct hmm_vma_walk hmm_vma_walk;
-   struct hmm *hmm = range->hmm;
+   struct mm_struct *mm;
struct vm_area_struct *vma;
int ret;
 
-   lockdep_assert_held(>mmu_notifier.mm->mmap_sem);
+   if (range->notifier)
+   mm = range->notifier->mm;
+   else
+   mm = range->hmm->mmu_notifier.mm;
+
+   lockdep_assert_held(>mmap_sem);
 
do {
/* If range is no longer valid force retry. */
-   if (!range->valid)
+   if (needs_retry(range))
return -EBUSY;
 
-   vma = find_vma(hmm->mmu_notifier.mm, start);
+   vma = find_vma(mm, start);
if (vma == NULL || (vma->vm_flags & device_vma))
return -EFAULT;
 
@@ -939,7 +952,7 @@ long hmm_range_fault(struct hmm_range *range, unsigned int 
flags)
start = hmm_vma_walk.last;
 
/* Keep trying while the range is valid. */
-   } while (ret == -EBUSY && range->valid);
+   } while (ret == -EBUSY && !needs_retry(range));
 
if (ret) {
unsigned long i;
@@ -997,7 +1010,7 @@ long hmm_range_dma_map(struct hmm_range *range, struct 
device *device,
continue;
 
/* Check if range is being invalidated */
-   if (!range->valid) {
+   if (needs_retry(range)) {
ret = -EBUSY;
goto unmap;
}
-- 
2.24.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH hmm v3 00/14] Consolidate the mmu notifier interval_tree and locking

2019-11-12 Thread Jason Gunthorpe
From: Jason Gunthorpe 

8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
they only use invalidate_range_start/end and immediately check the
invalidating range against some driver data structure to tell if the
driver is interested. Half of them use an interval_tree, the others are
simple linear search lists.

Of the ones I checked they largely seem to have various kinds of races,
bugs and poor implementation. This is a result of the complexity in how
the notifier interacts with get_user_pages(). It is extremely difficult to
use it correctly.

Consolidate all of this code together into the core mmu_notifier and
provide a locking scheme similar to hmm_mirror that allows the user to
safely use get_user_pages() and reliably know if the page list still
matches the mm.

This new arrangment plays nicely with the !blockable mode for
OOM. Scanning the interval tree is done such that the intersection test
will always succeed, and since there is no invalidate_range_end exposed to
drivers the scheme safely allows multiple drivers to be subscribed.

Four places are converted as an example of how the new API is used.
Four are left for future patches:
 - i915_gem has complex locking around destruction of a registration,
   needs more study
 - hfi1 (2nd user) needs access to the rbtree
 - scif_dma has a complicated logic flow
 - vhost's mmu notifiers are already being rewritten

This is already in linux-next, a git tree is available here:

 https://github.com/jgunthorpe/linux/commits/mmu_notifier

v3:
- Rename mmu_range_notifier to mmu_interval_notifier for clarity
  Avoids confusion with struct mmu_notifier_range
- Fix bugs in odp, amdgpu and xen gntdev from testing
- Make ops an argument to mmu_interval_notifier_insert() to make it
  harder to misuse
- Update many comments
- Add testing of mm_count during insertion

v2: https://lore.kernel.org/r/20191028201032.6352-1-...@ziepe.ca
v1: https://lore.kernel.org/r/20191015181242.8343-1-...@ziepe.ca

Absent any new discussion I think this will go to Linus at the next merge
window.

Thanks to everyone to helped!

Jason Gunthorpe (14):
  mm/mmu_notifier: define the header pre-processor parts even if
disabled
  mm/mmu_notifier: add an interval tree notifier
  mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or
hmm_mirror
  mm/hmm: define the pre-processor related parts of hmm.h even if
disabled
  RDMA/odp: Use mmu_interval_notifier_insert()
  RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv
  drm/radeon: use mmu_interval_notifier_insert
  nouveau: use mmu_notifier directly for invalidate_range_start
  nouveau: use mmu_interval_notifier instead of hmm_mirror
  drm/amdgpu: Call find_vma under mmap_sem
  drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror
  drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror
  mm/hmm: remove hmm_mirror and related
  xen/gntdev: use mmu_interval_notifier_insert

 Documentation/vm/hmm.rst  | 105 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   2 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c|  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c| 443 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h|  53 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|  13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   | 145 +++--
 drivers/gpu/drm/nouveau/nouveau_svm.c | 230 ---
 drivers/gpu/drm/radeon/radeon.h   |   9 +-
 drivers/gpu/drm/radeon/radeon_mn.c| 218 ++-
 drivers/infiniband/core/device.c  |   1 -
 drivers/infiniband/core/umem_odp.c| 303 ++
 drivers/infiniband/hw/hfi1/file_ops.c |   2 +-
 drivers/infiniband/hw/hfi1/hfi.h  |   2 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c | 146 ++---
 drivers/infiniband/hw/hfi1/user_exp_rcv.h |   3 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |   7 +-
 drivers/infiniband/hw/mlx5/mr.c   |   3 +-
 drivers/infiniband/hw/mlx5/odp.c  |  50 +-
 drivers/xen/gntdev-common.h   |   8 +-
 drivers/xen/gntdev.c  | 179 ++
 include/linux/hmm.h   | 195 +-
 include/linux/mmu_notifier.h  | 147 -
 include/rdma/ib_umem_odp.h|  68 +--
 include/rdma/ib_verbs.h   |   2 -
 kernel/fork.c |   1 -
 mm/Kconfig|   2 +-
 mm/hmm.c  | 276 +
 mm/mmu_notifier.c | 565 +-
 31 files changed, 1271 insertions(+), 1931 deletions(-)

-- 
2.24.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org

Re: [Xen-devel] [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-08 Thread Jason Gunthorpe
On Thu, Nov 07, 2019 at 09:00:34PM -0500, Jerome Glisse wrote:
> On Fri, Nov 08, 2019 at 12:32:25AM +0000, Jason Gunthorpe wrote:
> > On Thu, Nov 07, 2019 at 04:04:08PM -0500, Jerome Glisse wrote:
> > > On Thu, Nov 07, 2019 at 08:11:06PM +0000, Jason Gunthorpe wrote:
> > > > On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> > > > 
> > > > > > 
> > > > > > Extra credit: IMHO, this clearly deserves to all be in a new 
> > > > > > mmu_range_notifier.h
> > > > > > header file, but I know that's extra work. Maybe later as a 
> > > > > > follow-up patch,
> > > > > > if anyone has the time.
> > > > > 
> > > > > The range notifier should get the event too, it would be a waste, i 
> > > > > think it is
> > > > > an oversight here. The release event is fine so NAK to you separate 
> > > > > event. Event
> > > > > is really an helper for notifier i had a set of patch for nouveau to 
> > > > > leverage
> > > > > this i need to resucite them. So no need to split thing, i would just 
> > > > > forward
> > > > > the event ie add event to mmu_range_notifier_ops.invalidate() i 
> > > > > failed to catch
> > > > > that in v1 sorry.
> > > > 
> > > > I think what you mean is already done?
> > > > 
> > > > struct mmu_range_notifier_ops {
> > > > bool (*invalidate)(struct mmu_range_notifier *mrn,
> > > >const struct mmu_notifier_range *range,
> > > >unsigned long cur_seq);
> > > 
> > > Yes it is sorry, i got confuse with mmu_range_notifier and 
> > > mmu_notifier_range :)
> > > It is almost a palyndrome structure ;)
> > 
> > Lets change the name then, this is clearly not working. I'll reflow
> > everything tomorrow
> 
> Semantic patch to do that run from your linux kernel directory with your patch
> applied (you can run it one patch after the other and the git commit -a 
> --fixup HEAD)
> 
> spatch --sp-file name-of-the-file-below --dir . --all-includes --in-place
> 
> %< --
> @@
> @@
> struct
> -mmu_range_notifier
> +mmu_interval_notifier
> 
> @@
> @@
> struct
> -mmu_range_notifier
> +mmu_interval_notifier
> {...};
> 
> // Change mrn name to mmu_in
> @@
> struct mmu_interval_notifier *mrn;
> @@
> -mrn
> +mmu_in
> 
> @@
> identifier fn;
> @@
> fn(..., 
> -struct mmu_interval_notifier *mrn,
> +struct mmu_interval_notifier *mmu_in,
> ...) {...}
> 
> You need coccinelle (which provides spatch). It is untested but it should work
> also i could not come up with a nice name to update mrn as min is way too
> confusing. If you have better name feel free to use it.

I used 'mni' as we already use 'mn' to refer to the notifier, and
'mmu_in' looks like some input parameter or something

It mostly worked, lots of comments to fix manually though:

https://github.com/jgunthorpe/linux/commits/mmu_notifier

Thanks,
Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-08 Thread Jason Gunthorpe
On Thu, Nov 07, 2019 at 12:53:56PM -0800, John Hubbard wrote:
> > > > +/**
> > > > + * struct mmu_range_notifier_ops
> > > > + * @invalidate: Upon return the caller must stop using any SPTEs 
> > > > within this
> > > > + *  range, this function can sleep. Return false if 
> > > > blocking was
> > > > + *  required but range is non-blocking
> > > > + */
> > > 
> > > How about this (I'm not sure I fully understand the return value, though):
> > > 
> > > /**
> > >   * struct mmu_range_notifier_ops
> > >   * @invalidate: Upon return the caller must stop using any SPTEs within 
> > > this
> > >   *   range.
> > >   *
> > >   *   This function is permitted to sleep.
> > >   *
> > >   *   @Return: false if blocking was required, but @range is
> > >   *   non-blocking.
> > >   *
> > >   */
> > 
> > Is this kdoc format for function pointers?
> 
> heh, I'm sort of winging it, I'm not sure how function pointers are supposed
> to be documented in kdoc. Actually the only key take-away here is to write
> 
> "This function can sleep"
> 
> as a separate sentence..

Sure

> > This odd duality has already cause some confusion, but names here are
> > hard.  mmu_interval_notifier is the best alternative I've heard.
> > 
> > Changing this name is a lot of work - are we happy
> > 'mmu_interval_notifier' is the right choice? 
> 
> Yes, it's my favorite too. I'd vote for going with that.

Okay, lets give it a go

> Very nice, would you be open to putting that into (any) one of the comment
> headers? That's an unusually clear and concise description:

Yep, done

> > > > +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> > > > + unsigned long start, unsigned long length,
> > > > + struct mm_struct *mm)
> > > > +{
> > > > +   struct mmu_notifier_mm *mmn_mm;
> > > > +   int ret;
> > > 
> > > Hmmm, I think a later patch improperly changes the above to "int ret = 
> > > 0;".
> > > I'll check on that. It's correct here, though.
> > 
> > Looks OK in my tree?
> 
> Nope, that's how I found it. The top of your mmu_notifier branch has this:
> 
> int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
> {
> struct mmu_notifier_mm *mmn_mm = range->mm->mmu_notifier_mm;
> int ret = 0;
> 
> if (mmn_mm->has_interval) {
> ret = mn_itree_invalidate(mmn_mm, range);
> if (ret)
> return ret;
> }
> if (!hlist_empty(_mm->list))
> return mn_hlist_invalidate_range_start(mmn_mm, range);
> return 0;
> }

Ah, that is a different function :) Fixed

> Looks good. We're just polishing up minor points now, so you can add:
> 
> Reviewed-by: John Hubbard 

Great, thanks, I'll post a v3 with the rename

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert

2019-11-08 Thread Jason Gunthorpe
On Thu, Nov 07, 2019 at 05:54:52PM -0500, Boris Ostrovsky wrote:
> On 11/7/19 3:36 PM, Jason Gunthorpe wrote:
> > On Tue, Nov 05, 2019 at 10:16:46AM -0500, Boris Ostrovsky wrote:
> >
> >>> So, I suppose it can be relaxed to a null test and a WARN_ON that it
> >>> hasn't changed?
> >> You mean
> >>
> >> if (use_ptemod) {
> >>     WARN_ON(map->vma != vma);
> >>     ...
> >>
> >>
> >> Yes, that sounds good.
> > I amended my copy of the patch with the above, has this rework shown
> > signs of working?
> 
> Yes, it works fine.
> 
> But please don't forget notifier ops initialization.
> 
> With those two changes,
> 
> Reviewed-by: Boris Ostrovsky 

Thanks, I got both things. I'll forward this toward linux-next and
repost a v3 

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-07 Thread Jason Gunthorpe
On Thu, Nov 07, 2019 at 04:04:08PM -0500, Jerome Glisse wrote:
> On Thu, Nov 07, 2019 at 08:11:06PM +0000, Jason Gunthorpe wrote:
> > On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> > 
> > > > 
> > > > Extra credit: IMHO, this clearly deserves to all be in a new 
> > > > mmu_range_notifier.h
> > > > header file, but I know that's extra work. Maybe later as a follow-up 
> > > > patch,
> > > > if anyone has the time.
> > > 
> > > The range notifier should get the event too, it would be a waste, i think 
> > > it is
> > > an oversight here. The release event is fine so NAK to you separate 
> > > event. Event
> > > is really an helper for notifier i had a set of patch for nouveau to 
> > > leverage
> > > this i need to resucite them. So no need to split thing, i would just 
> > > forward
> > > the event ie add event to mmu_range_notifier_ops.invalidate() i failed to 
> > > catch
> > > that in v1 sorry.
> > 
> > I think what you mean is already done?
> > 
> > struct mmu_range_notifier_ops {
> > bool (*invalidate)(struct mmu_range_notifier *mrn,
> >const struct mmu_notifier_range *range,
> >unsigned long cur_seq);
> 
> Yes it is sorry, i got confuse with mmu_range_notifier and mmu_notifier_range 
> :)
> It is almost a palyndrome structure ;)

Lets change the name then, this is clearly not working. I'll reflow
everything tomorrow

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert

2019-11-07 Thread Jason Gunthorpe
On Tue, Nov 05, 2019 at 10:16:46AM -0500, Boris Ostrovsky wrote:

> > So, I suppose it can be relaxed to a null test and a WARN_ON that it
> > hasn't changed?
> 
> You mean
> 
> if (use_ptemod) {
>     WARN_ON(map->vma != vma);
>     ...
> 
> 
> Yes, that sounds good.

I amended my copy of the patch with the above, has this rework shown
signs of working?

@@ -436,7 +436,8 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
struct gntdev_priv *priv = file->private_data;
 
pr_debug("gntdev_vma_close %p\n", vma);
-   if (use_ptemod && map->vma == vma) {
+   if (use_ptemod) {
+   WARN_ON(map->vma != vma);
mmu_range_notifier_remove(>notifier);
map->vma = NULL;
}

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-07 Thread Jason Gunthorpe
On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:

> > 
> > Extra credit: IMHO, this clearly deserves to all be in a new 
> > mmu_range_notifier.h
> > header file, but I know that's extra work. Maybe later as a follow-up patch,
> > if anyone has the time.
> 
> The range notifier should get the event too, it would be a waste, i think it 
> is
> an oversight here. The release event is fine so NAK to you separate event. 
> Event
> is really an helper for notifier i had a set of patch for nouveau to leverage
> this i need to resucite them. So no need to split thing, i would just forward
> the event ie add event to mmu_range_notifier_ops.invalidate() i failed to 
> catch
> that in v1 sorry.

I think what you mean is already done?

struct mmu_range_notifier_ops {
bool (*invalidate)(struct mmu_range_notifier *mrn,
   const struct mmu_notifier_range *range,
   unsigned long cur_seq);

> No it is always odd, you must call mmu_range_set_seq() only from the
> op->invalidate_range() callback at which point the seq is odd. As well
> when mrn is added and its seq first set it is set to an odd value
> always. Maybe the comment, should read:
> 
>  * mrn->invalidate_seq is always, yes always, set to an odd value. This 
> ensures
> 
> To stress that it is not an error.

I went with this:

/*
 * mrn->invalidate_seq must always be set to an odd value via
 * mmu_range_set_seq() using the provided cur_seq from
 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
 * will always clear the below sleep in some reasonable time as
 * mmn_mm->invalidate_seq is even in the idle state.
 */

> > > + spin_lock(_mm->lock);
> > > + if (mmn_mm->active_invalidate_ranges) {
> > > + if (mn_itree_is_invalidating(mmn_mm))
> > > + hlist_add_head(>deferred_item,
> > > +_mm->deferred_list);
> > > + else {
> > > + mmn_mm->invalidate_seq |= 1;
> > > + interval_tree_insert(>interval_tree,
> > > +  _mm->itree);
> > > + }
> > > + mrn->invalidate_seq = mmn_mm->invalidate_seq;
> > > + } else {
> > > + WARN_ON(mn_itree_is_invalidating(mmn_mm));
> > > + mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
> > 
> > Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
> > for seq numbers here?  I'm acutely unhappy trying to figure this out.
> > I suspect it's another unfortunate side effect of trying to use the
> > lower bit of the seq number (even/odd) for something else.
> 
> If there is no mmn_mm->active_invalidate_ranges then it means that
> mmn_mm->invalidate_seq is even and thus mmn_mm->invalidate_seq - 1
> is an odd number which means that mrn->invalidate_seq is initialized
> to odd value and if you follow the rule for calling mmu_range_set_seq()
> then it will _always_ be an odd number and this close the loop with
> the above comments :)

The key thing is that it is an odd value that will take a long time
before mmn_mm->invalidate seq reaches it

> > > + might_lock(>mmap_sem);
> > > +
> > > + mmn_mm = smp_load_acquire(>mmu_notifier_mm);
> > 
> > What does the above pair with? Should have a comment that specifies that.
> 
> It was discussed in v1 but maybe a comment of what was said back then would
> be helpful. Something like:
> 
> /*
>  * We need to insure that all writes to mm->mmu_notifier_mm are visible before
>  * any checks we do on mmn_mm below as otherwise CPU might re-order write done
>  * by another CPU core to mm->mmu_notifier_mm structure fields after the read
>  * belows.
>  */

This comment made it, just at the store side:

/*
 * Serialize the update against mmu_notifier_unregister. A
 * side note: mmu_notifier_release can't run concurrently with
 * us because we hold the mm_users pin (either implicitly as
 * current->mm or explicitly with get_task_mm() or similar).
 * We can't race against any other mmu notifier method either
 * thanks to mm_take_all_locks().
 *
 * release semantics on the initialization of the mmu_notifier_mm's
 * contents are provided for unlocked readers.  acquire can only be
 * used while holding the mmgrab or mmget, and is safe because once
 * created the mmu_notififer_mm is not freed until the mm is
 * destroyed.  As above, users holding the mmap_sem or one of the
 * mm_take_all_locks() do not need to use acquire semantics.
 */
if (mmu_notifier_mm)
smp_store_release(>mmu_notifier_mm, mmu_notifier_mm);

Which I think is really overly belaboring the typical smp
store/release pattern, but people do seem unfamiliar with them...

Thanks,
Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org

Re: [Xen-devel] [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-11-07 Thread Jason Gunthorpe
On Wed, Nov 06, 2019 at 04:23:21PM -0800, John Hubbard wrote:
 
> Nice design, I love the seq foundation! So far, I'm not able to spot anything
> actually wrong with the implementation, sorry about that. 

Alas :( I feel there must be a bug in here still, but onwards!

One of the main sad points was it didn't make sense to use the
existing seqlock/seqcount primitives as they have both the wrong write
concurrancy model and extra barriers that are not needed when it is
always manipulated under a spinlock
 
> 1. There is a rather severe naming overlap (not technically a naming conflict,
> but still) with existing mmn work, which already has, for example:
> 
> struct mmu_notifier_range
> 
> ...and you're adding:
> 
> struct mmu_range_notifier
> 
> ...so I'll try to help sort that out.

Yes, I've been sad about this too.

> So this should read:
> 
> enum mmu_range_notifier_event {
>   MMU_NOTIFY_RELEASE,
> };
> 
> ...assuming that we stay with "mmu_range_notifier" as a core name for this 
> whole thing.
> 
> Also, it is best moved down to be next to the new MNR structs, so that all the
> MNR stuff is in one group.

I agree with Jerome, this enum is part of the 'struct
mmu_notifier_range' (ie the description of the invalidation) and it
doesn't really matter that only these new notifiers can be called with
this type, it is still part of the mmu_notifier_range.

The comment already says it only applies to the mmu_range_notifier
scheme..

> >  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> > @@ -222,6 +228,26 @@ struct mmu_notifier {
> > unsigned int users;
> >  };
> >  
> 
> That should also be moved down, next to the new structs.

Which this?

> > +/**
> > + * struct mmu_range_notifier_ops
> > + * @invalidate: Upon return the caller must stop using any SPTEs within 
> > this
> > + *  range, this function can sleep. Return false if blocking 
> > was
> > + *  required but range is non-blocking
> > + */
> 
> How about this (I'm not sure I fully understand the return value, though):
> 
> /**
>  * struct mmu_range_notifier_ops
>  * @invalidate: Upon return the caller must stop using any SPTEs within this
>  *range.
>  *
>  *This function is permitted to sleep.
>  *
>  *@Return: false if blocking was required, but @range is
>  *non-blocking.
>  *
>  */

Is this kdoc format for function pointers?
 
> 
> > +struct mmu_range_notifier_ops {
> > +   bool (*invalidate)(struct mmu_range_notifier *mrn,
> > +  const struct mmu_notifier_range *range,
> > +  unsigned long cur_seq);
> > +};
> > +
> > +struct mmu_range_notifier {
> > +   struct interval_tree_node interval_tree;
> > +   const struct mmu_range_notifier_ops *ops;
> > +   struct hlist_node deferred_item;
> > +   unsigned long invalidate_seq;
> > +   struct mm_struct *mm;
> > +};
> > +
> 
> Again, now we have the new struct mmu_range_notifier, and the old 
> struct mmu_notifier_range, and it's not good.
> 
> Ideas:
> 
> a) Live with it.
> 
> b) (Discarded, too many callers): rename old one. Nope.
> 
> c) Rename new one. Ideas:
> 
> struct mmu_interval_notifier
> struct mmu_range_intersection
> ...other ideas?

This odd duality has already cause some confusion, but names here are
hard.  mmu_interval_notifier is the best alternative I've heard.

Changing this name is a lot of work - are we happy
'mmu_interval_notifier' is the right choice?

> > +/**
> > + * mmu_range_set_seq - Save the invalidation sequence
> 
> How about:
> 
>  * mmu_range_set_seq - Set the .invalidate_seq to a new value.

It is not a 'new value', it is a value that is provided to the
invalidate callback

> 
> > + * @mrn - The mrn passed to invalidate
> > + * @cur_seq - The cur_seq passed to invalidate
> > + *
> > + * This must be called unconditionally from the invalidate callback of a
> > + * struct mmu_range_notifier_ops under the same lock that is used to call
> > + * mmu_range_read_retry(). It updates the sequence number for later use by
> > + * mmu_range_read_retry().
> > + *
> > + * If the user does not call mmu_range_read_begin() or 
> > mmu_range_read_retry()
> 
> nit: "caller" is better than "user", when referring to...well, callers. 
> "user" 
> most often refers to user space, whereas a call stack and function calling is 
> clearly what you're referring to here (and in other places, especially "user 
> lock").

Done

> > +/**
> > + * mmu_range_check_retry - Test if a collision has occurred
> > + * mrn: The range under lock
> > + * seq: The return of the matching mmu_range_read_begin()
> > + *
> > + * This can be used in the critical section between mmu_range_read_begin() 
> > and
> > + * mmu_range_read_retry().  A return of true indicates an invalidation has
> > + * collided with this lock and a future mmu_range_read_retry() will return
> > + * true.
> > + *
> > + * False is not reliable and only suggests a collision has not happened. It
> 
> 

Re: [Xen-devel] [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled

2019-11-06 Thread Jason Gunthorpe
On Tue, Nov 05, 2019 at 01:23:46PM -0800, John Hubbard wrote:
> On 10/28/19 1:10 PM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe 
> > 
> > Now that we have KERNEL_HEADER_TEST all headers are generally compile
> > tested, so relying on makefile tricks to avoid compiling code that depends
> > on CONFIG_MMU_NOTIFIER is more annoying.
> > 
> > Instead follow the usual pattern and provide most of the header with only
> > the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
> > ensures code compiles no matter what the config setting is.
> > 
> > While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.
> 
> and correct a minor spelling error in a comment. Good. :)
> 
> > 
> > Reviewed-by: Jérôme Glisse 
> > Signed-off-by: Jason Gunthorpe 
> >  include/linux/mmu_notifier.h | 46 +---
> >  mm/mmu_notifier.c| 13 ++
> >  2 files changed, 30 insertions(+), 29 deletions(-)
> > 
> 
> Because this is correct as-is, you can add:
> 
> Reviewed-by: John Hubbard 
> 

Thanks

> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 2b7485919ecf..107f9406a92d 100644
> +++ b/mm/mmu_notifier.c
> @@ -47,6 +47,16 @@ struct mmu_notifier_mm {
> struct hlist_head deferred_list;
>  };
>  
> +int mm_has_notifiers(struct mm_struct *mm)
> +{
> +   return unlikely(mm->mmu_notifier_mm);
> +}

This inline is performance sensitive, it needs to stay inlined..

Jason
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert

2019-11-04 Thread Jason Gunthorpe
On Mon, Nov 04, 2019 at 05:03:31PM -0500, Boris Ostrovsky wrote:
> On 10/28/19 4:10 PM, Jason Gunthorpe wrote:
> > @@ -445,17 +438,9 @@ static void gntdev_vma_close(struct vm_area_struct 
> > *vma)
> > struct gntdev_priv *priv = file->private_data;
> >  
> > pr_debug("gntdev_vma_close %p\n", vma);
> > -   if (use_ptemod) {
> > -   /* It is possible that an mmu notifier could be running
> > -* concurrently, so take priv->lock to ensure that the vma won't
> > -* vanishing during the unmap_grant_pages call, since we will
> > -* spin here until that completes. Such a concurrent call will
> > -* not do any unmapping, since that has been done prior to
> > -* closing the vma, but it may still iterate the unmap_ops list.
> > -*/
> > -   mutex_lock(>lock);
> > +   if (use_ptemod && map->vma == vma) {
> 
> 
> Is it possible for map->vma not to be equal to vma?

It could be NULL at least if use_ptemod is not set.

Otherwise, I'm not sure, the confusing bit is that the map comes from
here:

map = gntdev_find_map_index(priv, index, count);

It looks like the intent is that the map->vma is always set to the
only vma that has the map as private_data.

So, I suppose it can be relaxed to a null test and a WARN_ON that it
hasn't changed?

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking

2019-11-04 Thread Jason Gunthorpe
On Fri, Nov 01, 2019 at 01:54:45PM -0700, Ralph Campbell wrote:
> You can add my Tested-by for the mm and nouveau changes.
> IOW, patches 1-4, 10-11, and 15.
> 
> Tested-by: Ralph Campbell 

Got it, thanks

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking

2019-11-01 Thread Jason Gunthorpe
On Mon, Oct 28, 2019 at 05:10:17PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> they only use invalidate_range_start/end and immediately check the
> invalidating range against some driver data structure to tell if the
> driver is interested. Half of them use an interval_tree, the others are
> simple linear search lists.

Now that we have the most of the driver changes tested and reviewed
I'm going to move this series into linux-next via the hmm tree, minus
the xen gntdev patches as they are not working yet.

I will keep collecting acks and any additional changes.

Thanks,
Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-11-01 Thread Jason Gunthorpe
On Fri, Nov 01, 2019 at 07:45:22PM +, Yang, Philip wrote:

> > This must be done inside the notifier_lock, after checking
> > mmu_range_read_retry(), all handling of the struct page must be
> > structured like that.
> > 
> Below change will fix this, then driver will call mmu_range_read_retry 
> second time using same range->notifier_seq to check if range is 
> invalidated inside amdgpu_cs_submit, this looks ok for me.

Lets defer this to some patch trying to fix it, I find it hard to
follow..

> @@ -868,6 +869,13 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo 
> *bo, struct page **pages)
>  goto out_free_pfns;
>  }
> 
> +   mutex_lock(>notifier_lock);
> +
> +   if (mmu_range_read_retry(>notifier, range->notifier_seq)) {
> +   mutex_unlock(>notifier_lock);
> +   goto retry;
> +   }
> +
>  for (i = 0; i < ttm->num_pages; i++) {
>  pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
>  if (unlikely(!pages[i])) {
> @@ -875,10 +883,12 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo 
> *bo, struct page **pages)
> i, range->pfns[i]);
>  r = -ENOMEM;
> 
> +   mutex_unlock(>notifier_lock);
>  goto out_free_pfns;
>  }
>  }

Well, maybe? 

The question now is what happens to 'pages' ? With this arrangment the
driver cannot touch 'pages' without also again going under the lock
and checking retry. 

If it doesn't touch it, then lets just move this device_entry_to_page
to a more appropriate place?

I'd prefer it if the driver could be structured in the normal way,
with a clear locked region where the page list is handled..

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert

2019-11-01 Thread Jason Gunthorpe
On Fri, Nov 01, 2019 at 02:51:46PM -0400, Boris Ostrovsky wrote:
> On 11/1/19 1:48 PM, Jason Gunthorpe wrote:
> > On Wed, Oct 30, 2019 at 12:55:37PM -0400, Boris Ostrovsky wrote:
> >> On 10/28/19 4:10 PM, Jason Gunthorpe wrote:
> >>> From: Jason Gunthorpe 
> >>>
> >>> gntdev simply wants to monitor a specific VMA for any notifier events,
> >>> this can be done straightforwardly using mmu_range_notifier_insert() over
> >>> the VMA's VA range.
> >>>
> >>> The notifier should be attached until the original VMA is destroyed.
> >>>
> >>> It is unclear if any of this is even sane, but at least a lot of duplicate
> >>> code is removed.
> >> I didn't have a chance to look at the patch itself yet but as a heads-up
> > Thanks Boris. I spent a bit of time and got a VM running with a xen
> > 4.9 hypervisor and a kernel with this patch series. It a ubuntu bionic
> > VM with the distro's xen stuff.
> >
> > Can you give some guidance how you made it crash? 
> 
> It crashes trying to dereference mrn->ops->invalidate in
> mn_itree_invalidate() when a guest exits.
> 
> I don't think you've initialized notifier ops. I don't see you using
> gntdev_mmu_ops anywhere.

So weird the compiler didn't complain about an unused static...

But yes, this is a mistake, it should be:

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 37b278857ad807..0ca35485fd3865 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -1011,6 +1011,7 @@ static int gntdev_mmap(struct file *flip, struct 
vm_area_struct *vma)
 
if (use_ptemod) {
map->vma = vma;
+   map->notifier.ops = _mmu_ops;
err = mmu_range_notifier_insert_locked(
>notifier, vma->vm_start,
vma->vm_end - vma->vm_start, vma->vm_mm);

Thanks,
Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2a 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-11-01 Thread Jason Gunthorpe
Convert the collision-retry lock around hmm_range_fault to use the one now
provided by the mmu_range notifier.

Although this driver does not seem to use the collision retry lock that
hmm provides correctly, it can still be converted over to use the
mmu_range_notifier api instead of hmm_mirror without too much trouble.

This also deletes another place where a driver is associating additional
data (struct amdgpu_mn) with a mmu_struct.

Signed-off-by: Philip Yang 
Reviewed-by: Philip Yang 
Tested-by: Philip Yang 
Signed-off-by: Jason Gunthorpe 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c|  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c| 150 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h|  49 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   | 116 +-
 5 files changed, 96 insertions(+), 237 deletions(-)

Philip, here is what it loos like after combining the two patches, thanks

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 47700302a08b7f..1bcedb9b477dce 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1738,6 +1738,10 @@ static int update_invalid_user_pages(struct 
amdkfd_process_info *process_info,
return ret;
}
 
+   /*
+* FIXME: Cannot ignore the return code, must hold
+* notifier_lock
+*/
amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
 
/* Mark the BO as valid unless it was invalidated
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 82823d9a8ba887..22c989bca7514c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -603,8 +603,6 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
e->tv.num_shared = 2;
 
amdgpu_bo_list_get_list(p->bo_list, >validated);
-   if (p->bo_list->first_userptr != p->bo_list->num_entries)
-   p->mn = amdgpu_mn_get(p->adev, AMDGPU_MN_TYPE_GFX);
 
INIT_LIST_HEAD();
amdgpu_vm_get_pd_bo(>vm, >validated, >vm_pd);
@@ -1287,11 +1285,11 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
if (r)
goto error_unlock;
 
-   /* No memory allocation is allowed while holding the mn lock.
-* p->mn is hold until amdgpu_cs_submit is finished and fence is added
-* to BOs.
+   /* No memory allocation is allowed while holding the notifier lock.
+* The lock is held until amdgpu_cs_submit is finished and fence is
+* added to BOs.
 */
-   amdgpu_mn_lock(p->mn);
+   mutex_lock(>adev->notifier_lock);
 
/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
@@ -1334,13 +1332,13 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
amdgpu_vm_move_to_lru_tail(p->adev, >vm);
 
ttm_eu_fence_buffer_objects(>ticket, >validated, p->fence);
-   amdgpu_mn_unlock(p->mn);
+   mutex_unlock(>adev->notifier_lock);
 
return 0;
 
 error_abort:
drm_sched_job_cleanup(>base);
-   amdgpu_mn_unlock(p->mn);
+   mutex_unlock(>adev->notifier_lock);
 
 error_unlock:
amdgpu_job_free(job);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index ac74320b71e4e7..f7be34907e54f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -50,28 +50,6 @@
 #include "amdgpu.h"
 #include "amdgpu_amdkfd.h"
 
-/**
- * amdgpu_mn_lock - take the write side lock for this notifier
- *
- * @mn: our notifier
- */
-void amdgpu_mn_lock(struct amdgpu_mn *mn)
-{
-   if (mn)
-   down_write(>lock);
-}
-
-/**
- * amdgpu_mn_unlock - drop the write side lock for this notifier
- *
- * @mn: our notifier
- */
-void amdgpu_mn_unlock(struct amdgpu_mn *mn)
-{
-   if (mn)
-   up_write(>lock);
-}
-
 /**
  * amdgpu_mn_invalidate_gfx - callback to notify about mm change
  *
@@ -82,16 +60,20 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  * potentially dirty.
  */
 static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
-const struct mmu_notifier_range *range)
+const struct mmu_notifier_range *range,
+unsigned long cur_seq)
 {
struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
long r;
 
-   if (!mmu_notifier_range_blockable(range))
+   

Re: [Xen-devel] [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER

2019-11-01 Thread Jason Gunthorpe
On Mon, Oct 28, 2019 at 05:10:25PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> DMA_SHARED_BUFFER can not be enabled by the user (it represents a library
> set in the kernel). The kconfig convention is to use select for such
> symbols so they are turned on implicitly when the user enables a kconfig
> that needs them.
> 
> Otherwise the XEN_GNTDEV_DMABUF kconfig is overly difficult to enable.
> 
> Fixes: 932d6562179e ("xen/gntdev: Add initial support for dma-buf UAPI")
> Cc: Oleksandr Andrushchenko 
> Cc: Boris Ostrovsky 
> Cc: xen-devel@lists.xenproject.org
> Cc: Juergen Gross 
> Cc: Stefano Stabellini 
> Reviewed-by: Juergen Gross 
> Reviewed-by: Oleksandr Andrushchenko 
> Signed-off-by: Jason Gunthorpe 
> ---
>  drivers/xen/Kconfig | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Juergen/Oleksandr/Xen Maintainers:

Would you take this patch through a xen related tree? The only reason
I had in this series is to make it easier to compile-test the gntdev
changes.

Since it is looking like the gntdev rework might not make it this
cycle it is probably best for you to take it.

Thanks,
Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-11-01 Thread Jason Gunthorpe
On Fri, Nov 01, 2019 at 02:44:51PM +, Yang, Philip wrote:
> @@ -854,12 +853,20 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, 
> struct page **pages)
>   r = -EPERM;
>   goto out_unlock;
>   }
> + up_read(>mmap_sem);
> + timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +
> +retry:
> + range->notifier_seq = mmu_range_read_begin(>notifier);
>  
> + down_read(>mmap_sem);
>   r = hmm_range_fault(range, 0);
>   up_read(>mmap_sem);
> -
> - if (unlikely(r < 0))
> + if (unlikely(r <= 0)) {
> + if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
> + goto retry;
>   goto out_free_pfns;
> + }

I was reflecting on why this suddently became necessary, and I think
what might be happening is that hmm_range_fault() is trigging
invalidations as it runs (ie it is faulting in pages or something) and
that in turn causes the mrn to need retry.

The hmm version of this had a bug where a full
invalidate_range_start/end pair would not trigger retry, so this this
didn't happen.

This is unfortunate as the retry is unnecessary, but at this time I
can't think of a good way to separate an ignorable synchronous
invalidation caused by hmm_range_fault from an async one that cannot
be ignored..

A basic fix would be to not update the mrq seq in the notifier if
the invalidate is triggered by hmm_range_fault, but that seems
difficult to determine..

Any thoughts Jerome?

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert

2019-11-01 Thread Jason Gunthorpe
On Wed, Oct 30, 2019 at 12:55:37PM -0400, Boris Ostrovsky wrote:
> On 10/28/19 4:10 PM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe 
> >
> > gntdev simply wants to monitor a specific VMA for any notifier events,
> > this can be done straightforwardly using mmu_range_notifier_insert() over
> > the VMA's VA range.
> >
> > The notifier should be attached until the original VMA is destroyed.
> >
> > It is unclear if any of this is even sane, but at least a lot of duplicate
> > code is removed.
> 
> I didn't have a chance to look at the patch itself yet but as a heads-up
> --- it crashes dom0.

Thanks Boris. I spent a bit of time and got a VM running with a xen
4.9 hypervisor and a kernel with this patch series. It a ubuntu bionic
VM with the distro's xen stuff.

Can you give some guidance how you made it crash? I see the VM
autoloaded gntdev:

Module  Size  Used by
xen_gntdev 24576  2
xen_evtchn 16384  1
xenfs  16384  1
xen_privcmd24576  16 xenfs

And lsof says several xen processes have the chardev open:

xenstored  819 root   13u  CHR  10,53  0t0  
19595 /dev/xen/gntdev
xenconsol  857 root8u  CHR  10,53  0t0  
19595 /dev/xen/gntdev
xenconsol  857 860 root8u  CHR  10,53  0t0  
19595 /dev/xen/gntdev

But no crashing..

However, I wasn't able to get my usual debug kernel .config to boot
with the xen hypervisor, it crashes on early boot with:

(XEN) Dom0 has maximum 8 VCPUs
(XEN) Scrubbing Free RAM on 1 nodes using 8 CPUs
(XEN) .done.
(XEN) Initial low memory virq threshold set at 0x1000 pages.
(XEN) Std. Loglevel: All
(XEN) Guest Loglevel: All
(XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch input to 
Xen)
(XEN) Freed 468kB init memory
(XEN) d0v0 Unhandled page fault fault/trap [#14, ec=0002]
(XEN) Pagetable walk from fbfff0480fbe:
(XEN)  L4[0x1f7] =  
(XEN) domain_crash_sync called from entry.S: fault at 82d080348a06 
entry.o#create_bounce_frame+0x135/0x15f
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) [ Xen-4.9.2  x86_64  debug=n   Not tainted ]
(XEN) CPU:0
(XEN) RIP:e033:[]
(XEN) RFLAGS: 0296   EM: 1   CONTEXT: pv guest (d0v0)
(XEN) rax: fbfff0480fbe   rbx:    rcx: c101
(XEN) rdx:    rsi: 84026000   rdi: 82cb4a20
(XEN) rbp: 82407ff8   rsp: 82407da0   r8:  
(XEN) r9:     r10:    r11: 
(XEN) r12:    r13: 10480fbe   r14: 
(XEN) r15:    cr0: 8005003b   cr4: 003506e0
(XEN) cr3: 34027000   cr2: fbfff0480fbe
(XEN) fsb:    gsb: 82b61000   gss: 
(XEN) ds:    es:    fs:    gs:    ss: e02b   cs: e033

Which is surely some .config issue, but I didn't figure out what.

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-11-01 Thread Jason Gunthorpe
On Fri, Nov 01, 2019 at 03:59:26PM +, Yang, Philip wrote:
> > This test for range_blockable should be before mutex_lock, I can move
> > it up
> > 
> yes, thanks.

Okay, I wrote it like this:

if (mmu_notifier_range_blockable(range))
mutex_lock(>notifier_lock);
else if (!mutex_trylock(>notifier_lock))
return false;

> > Also, do you know if notifier_lock is held while calling
> > amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held'
> > to amdgpu_ttm_tt_get_user_pages_done()?
> 
> gpu side hold notifier_lock but kfd side doesn't. kfd side doesn't check 
> amdgpu_ttm_tt_get_user_pages_done/mmu_range_read_retry return value but 
> check mem->invalid flag which is updated from invalidate callback. It 
> takes more time to change, I will come to another patch to fix it later.

Ah.. confusing, OK, I'll let you sort that

> > However, this is all pre-existing bugs, so I'm OK go ahead with this
> > patch as modified. I advise AMD to make a followup patch ..
> > 
> yes, I will.

While you are here, this is also wrong:

int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
{
down_read(>mmap_sem);
r = hmm_range_fault(range, 0);
up_read(>mmap_sem);
if (unlikely(r <= 0)) {
if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
goto retry;
goto out_free_pfns;
}

for (i = 0; i < ttm->num_pages; i++) {
pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);

It is not allowed to read the results of hmm_range_fault() outside
locking, and in particular, we can't convert to a struct page.

This must be done inside the notifier_lock, after checking
mmu_range_read_retry(), all handling of the struct page must be
structured like that.

> >> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct 
> >> ttm_tt *ttm)
> >>sg_free_table(ttm->sg);
> >>   
> >>   #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
> >> -  if (gtt->range &&
> >> -  ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
> >> -gtt->range->pfns[0]))
> >> -  WARN_ONCE(1, "Missing get_user_page_done\n");
> >> +  if (gtt->range) {
> >> +  unsigned long i;
> >> +
> >> +  for (i = 0; i < ttm->num_pages; i++) {
> >> +  if (ttm->pages[i] !=
> >> +  hmm_device_entry_to_page(gtt->range,
> >> +gtt->range->pfns[i]))
> >> +  break;
> >> +  }
> >> +
> >> +  WARN((i == ttm->num_pages), "Missing get_user_page_done\n");
> >> +  }
> > 
> > Is this related/necessary? I can put it in another patch if it is just
> > debugging improvement? Please advise
> > 
> I see this WARN backtrace now, but I didn't see it before. This is 
> somehow related.

Hm, might be instructive to learn what is going on..

Thanks,
Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-11-01 Thread Jason Gunthorpe
On Fri, Nov 01, 2019 at 02:44:51PM +, Yang, Philip wrote:
> 
> 
> On 2019-10-29 3:25 p.m., Jason Gunthorpe wrote:
> > On Tue, Oct 29, 2019 at 07:22:37PM +, Yang, Philip wrote:
> >> Hi Jason,
> >>
> >> I did quick test after merging amd-staging-drm-next with the
> >> mmu_notifier branch, which includes this set changes. The test result
> >> has different failures, app stuck intermittently, GUI no display etc. I
> >> am understanding the changes and will try to figure out the cause.
> > 
> > Thanks! I'm not surprised by this given how difficult this patch was
> > to make. Let me know if I can assist in any way
> > 
> > Please ensure to run with lockdep enabled.. Your symptops sounds sort
> > of like deadlocking?
> > 
> Hi Jason,
> 
> Attached patch fix several issues in amdgpu driver, maybe you can squash 
> this into patch 14. With this is done, patch 12, 13, 14 is Reviewed-by 
> and Tested-by Philip Yang 

Wow, this is great thanks! Can you clarify what the problems you found
were? Was the bug the 'return !r' below?

I'll also add your signed off by

Here are some remarks:

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index cb718a064eb4..c8bbd06f1009 100644
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -67,21 +67,15 @@ static bool amdgpu_mn_invalidate_gfx(struct 
> mmu_range_notifier *mrn,
>   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>   long r;
>  
> - /*
> -  * FIXME: Must hold some lock shared with
> -  * amdgpu_ttm_tt_get_user_pages_done()
> -  */
> - mmu_range_set_seq(mrn, cur_seq);
> + mutex_lock(>notifier_lock);
>  
> - /* FIXME: Is this necessary? */
> - if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
> -   range->end))
> - return true;
> + mmu_range_set_seq(mrn, cur_seq);
>  
> - if (!mmu_notifier_range_blockable(range))
> + if (!mmu_notifier_range_blockable(range)) {
> + mutex_unlock(>notifier_lock);
>   return false;

This test for range_blockable should be before mutex_lock, I can move
it up

Also, do you know if notifier_lock is held while calling
amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held'
to amdgpu_ttm_tt_get_user_pages_done()?

> @@ -854,12 +853,20 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, 
> struct page **pages)
>   r = -EPERM;
>   goto out_unlock;
>   }
> + up_read(>mmap_sem);
> + timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +
> +retry:
> + range->notifier_seq = mmu_range_read_begin(>notifier);
>  
> + down_read(>mmap_sem);
>   r = hmm_range_fault(range, 0);
>   up_read(>mmap_sem);
> -
> - if (unlikely(r < 0))
> + if (unlikely(r <= 0)) {
> + if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
> + goto retry;
>   goto out_free_pfns;
> + }

This isn't really right, a retry loop like this needs to go all the
way to mmu_range_read_retry() and done under the notifier_lock. ie
mmu_range_read_retry() can fail just as likely as hmm_range_fault()
can, and drivers are supposed to retry in both cases, with a single
timeout.

AFAICT it is a major bug that many places ignore the return code of
amdgpu_ttm_tt_get_user_pages_done() ???

However, this is all pre-existing bugs, so I'm OK go ahead with this
patch as modified. I advise AMD to make a followup patch ..

I'll add a FIXME note to this effect.

>   for (i = 0; i < ttm->num_pages; i++) {
>   pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
> @@ -916,7 +923,7 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm)
>   gtt->range = NULL;
>   }
>  
> - return r;
> + return !r;

Ah is this the major error? hmm_range_valid() is inverted vs
mmu_range_read_retry()?

>  }
>  #endif
>  
> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct ttm_tt 
> *ttm)
>   sg_free_table(ttm->sg);
>  
>  #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
> - if (gtt->range &&
> - ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
> -   gtt->range->pfns[0]))
> - WARN_ONCE(1, "Missing get_user_page_done\n");
> + if (gtt->range) {
> + unsigned long i;
> +
> + for (i = 0; i < ttm->num_pages; i++) {
> + if (ttm->pages[i] !=
> +

Re: [Xen-devel] [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror

2019-10-29 Thread Jason Gunthorpe
On Tue, Oct 29, 2019 at 10:14:29PM +, Kuehling, Felix wrote:

> > +static const struct mmu_range_notifier_ops amdgpu_mn_hsa_ops = {
> > +   .invalidate = amdgpu_mn_invalidate_hsa,
> > +};
> > +
> > +static int amdgpu_mn_sync_pagetables(struct hmm_mirror *mirror,
> > +const struct mmu_notifier_range *update)
> >   {
> > struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
> > -   unsigned long start = update->start;
> > -   unsigned long end = update->end;
> > -   bool blockable = mmu_notifier_range_blockable(update);
> > -   struct interval_tree_node *it;
> >   
> > -   /* notification is exclusive, but interval is inclusive */
> > -   end -= 1;
> > -
> > -   if (amdgpu_mn_read_lock(amn, blockable))
> > -   return -EAGAIN;
> > -
> > -   it = interval_tree_iter_first(>objects, start, end);
> > -   while (it) {
> > -   struct amdgpu_mn_node *node;
> > -   struct amdgpu_bo *bo;
> > -
> > -   if (!blockable) {
> > -   amdgpu_mn_read_unlock(amn);
> > -   return -EAGAIN;
> > -   }
> > -
> > -   node = container_of(it, struct amdgpu_mn_node, it);
> > -   it = interval_tree_iter_next(it, start, end);
> > -
> > -   list_for_each_entry(bo, >bos, mn_list) {
> > -   struct kgd_mem *mem = bo->kfd_bo;
> > -
> > -   if (amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm,
> > -start, end))
> > -   amdgpu_amdkfd_evict_userptr(mem, amn->mm);
> > -   }
> > -   }
> > -
> > -   amdgpu_mn_read_unlock(amn);
> > +   if (!mmu_notifier_range_blockable(update))
> > +   return false;
> 
> This should return -EAGAIN. Not sure it matters much, because this whole 
> function disappears in the next commit in the series. It seems to be 
> only vestigial at this point.

Right, the only reason it is still here is that I couldn't really tell
if this:

> > +   down_read(>lock);
> > +   up_read(>lock);
> > return 0;
> >   }

Was serving as the 'driver lock' in the hmm scheme... If not then the
whole thing should just be deleted at this point.

I fixed the EAGAIN though

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-10-29 Thread Jason Gunthorpe
On Tue, Oct 29, 2019 at 10:04:45PM +, Kuehling, Felix wrote:

> >* because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> > @@ -52,17 +286,24 @@ struct mmu_notifier_mm {
> >* can't go away from under us as exit_mmap holds an mm_count pin
> >* itself.
> >*/
> > -void __mmu_notifier_release(struct mm_struct *mm)
> > +static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm,
> > +struct mm_struct *mm)
> >   {
> > struct mmu_notifier *mn;
> > int id;
> >   
> > +   if (mmn_mm->has_interval)
> > +   mn_itree_release(mmn_mm, mm);
> > +
> > +   if (hlist_empty(_mm->list))
> > +   return;
> 
> This seems to duplicate the conditions in __mmu_notifier_release. See my 
> comments below, I think one of them is wrong. I suspect this one, 
> because __mmu_notifier_release follows the same pattern as the other 
> notifiers.

Yep, this is a rebasing error from a earlier version, the above two
lines should be deleted.

I think it is harmless so it should not impact any testing.

Thanks,
Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-10-29 Thread Jason Gunthorpe
On Tue, Oct 29, 2019 at 07:22:37PM +, Yang, Philip wrote:
> Hi Jason,
> 
> I did quick test after merging amd-staging-drm-next with the 
> mmu_notifier branch, which includes this set changes. The test result 
> has different failures, app stuck intermittently, GUI no display etc. I 
> am understanding the changes and will try to figure out the cause.

Thanks! I'm not surprised by this given how difficult this patch was
to make. Let me know if I can assist in any way

Please ensure to run with lockdep enabled.. Your symptops sounds sort
of like deadlocking?

Regards,
Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem

2019-10-29 Thread Jason Gunthorpe
On Tue, Oct 29, 2019 at 04:28:43PM +, Kuehling, Felix wrote:
> On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> > From: Jason Gunthorpe 
> >
> > find_vma() must be called under the mmap_sem, reorganize this code to
> > do the vma check after entering the lock.
> >
> > Further, fix the unlocked use of struct task_struct's mm, instead use
> > the mm from hmm_mirror which has an active mm_grab. Also the mm_grab
> > must be converted to a mm_get before acquiring mmap_sem or calling
> > find_vma().
> >
> > Fixes: 66c45500bfdc ("drm/amdgpu: use new HMM APIs and helpers")
> > Fixes: 0919195f2b0d ("drm/amdgpu: Enable amdgpu_ttm_tt_get_user_pages in 
> > worker threads")
> > Cc: Alex Deucher 
> > Cc: Christian König 
> > Cc: David (ChunMing) Zhou 
> > Cc: amd-...@lists.freedesktop.org
> > Signed-off-by: Jason Gunthorpe 
> 
> One question inline to confirm my understanding. Otherwise this patch is
> 
> Reviewed-by: Felix Kuehling 

Thanks

> > -   if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
> > -   vma->vm_file)) {
> > -   r = -EPERM;
> > -   goto out;
> > -   }
> > +   mm = mirror->hmm->mmu_notifier.mm;
> > +   if (!mmget_not_zero(mm)) /* Happens during process shutdown */
> 
> This works because mirror->hmm->mmu_notifier holds an mmgrab reference 
> to the mm?

Yes, this makes sure the mm pointer remains valid

> So the MM will not just go away, but if the mmget refcount is 0, it
> means the mm is marked for destruction and shouldn't be used any
> more.

Not just marked for destruction, but that another thread is
progressing or finished release().

The other detail here is that in general you can't get the mmap_sem
without also having a mmget as exit_mmap() does not lock the mmap_sem
in some places where it alters the datastructures. ie racing
find_vma() with exit_mmap() is not allowed.

This means we have to hold the mmget across the hmm_range_fault(), but
we can drop the mmget and then test mmu_range_read_retry() under the
driver lock. It will return true if the mmget refcount has gone to
zero in the mean time.

But I think this is probably a poor driver design, a driver should
just hold the mmget() until it has completed establishing the shadow
PTEs, as it is hard to see a reason not to..

Jason
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror

2019-10-29 Thread Jason Gunthorpe
On Tue, Oct 29, 2019 at 07:51:30AM +, Koenig, Christian wrote:
> > +static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
> > +const struct mmu_notifier_range *range)
> >   {
> > -   struct amdgpu_bo *bo;
> > +   struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
> > +   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> > long r;
> >   
> > -   list_for_each_entry(bo, >bos, mn_list) {
> > -
> > -   if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, start, end))
> > -   continue;
> > -
> > -   r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv,
> > -   true, false, MAX_SCHEDULE_TIMEOUT);
> > -   if (r <= 0)
> > -   DRM_ERROR("(%ld) failed to wait for user bo\n", r);
> > -   }
> > +   /* FIXME: Is this necessary? */
> 
> Most likely not.
> 
> Christian.
> 
> > +   if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
> > + range->end))
> > +   return true;

So is the bo->tbo.mem.num_pages == bo->tbo.ttm.num_pages always?

And userptr can't be zero here, or at least it doesn't matter if it is?

> > +static bool amdgpu_mn_invalidate_hsa(struct mmu_range_notifier *mrn,
> > +const struct mmu_notifier_range *range)
> >   {
> > -   struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
> > -   unsigned long start = update->start;
> > -   unsigned long end = update->end;
> > -   bool blockable = mmu_notifier_range_blockable(update);
> > -   struct interval_tree_node *it;
> > -
> > -   /* notification is exclusive, but interval is inclusive */
> > -   end -= 1;
> > -
> > -   /* TODO we should be able to split locking for interval tree and
> > -* amdgpu_mn_invalidate_node
> > -*/
> > -   if (amdgpu_mn_read_lock(amn, blockable))
> > -   return -EAGAIN;
> > -
> > -   it = interval_tree_iter_first(>objects, start, end);
> > -   while (it) {
> > -   struct amdgpu_mn_node *node;
> > -
> > -   if (!blockable) {
> > -   amdgpu_mn_read_unlock(amn);
> > -   return -EAGAIN;
> > -   }
> > +   struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
> > +   struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> >   
> > -   node = container_of(it, struct amdgpu_mn_node, it);
> > -   it = interval_tree_iter_next(it, start, end);
> > +   /* FIXME: Is this necessary? */
> > +   if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
> > + range->end))
> > +   return true;
> >   
> > -   amdgpu_mn_invalidate_node(node, start, end);
> > -   }

This one too right?

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv

2019-10-29 Thread Jason Gunthorpe
On Tue, Oct 29, 2019 at 08:19:20AM -0400, Dennis Dalessandro wrote:
> On 10/28/2019 4:10 PM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe 
> > 
> > This converts one of the two users of mmu_notifiers to use the new API.
> > The conversion is fairly straightforward, however the existing use of
> > notifiers here seems to be racey.
> > 
> > Cc: Mike Marciniszyn 
> > Cc: Dennis Dalessandro 
> > Signed-off-by: Jason Gunthorpe 
> 
> I tested v1, and replied to it [1]. I can re-test with this version if you
> like as well.
> 
> [1] https://marc.info/?l=linux-rdma=157235130606412=2

I think it is fine, nothing really changed in v2, thanks

Jason

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Remove the interval tree in the driver and rely on the tree maintained by
the mmu_notifier for delivering mmu_notifier invalidation callbacks.

For some reason amdgpu has a very complicated arrangement where it tries
to prevent duplicate entries in the interval_tree, this is not necessary,
each amdgpu_bo can be its own stand alone entry. interval_tree already
allows duplicates and overlaps in the tree.

Also, there is no need to remove entries upon a release callback, the
mmu_range API safely allows objects to remain registered beyond the
lifetime of the mm. The driver only has to stop touching the pages during
release.

Cc: Alex Deucher 
Cc: Christian König 
Cc: David (ChunMing) Zhou 
Cc: amd-...@lists.freedesktop.org
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   2 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c| 341 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h|   4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|  13 +-
 6 files changed, 84 insertions(+), 282 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index bd37df5dd6d048..60591a5d420021 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1006,6 +1006,8 @@ struct amdgpu_device {
struct mutex  lock_reset;
struct amdgpu_doorbell_index doorbell_index;
 
+   struct mutexnotifier_lock;
+
int asic_reset_res;
struct work_struct  xgmi_reset_work;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 6d021ecc8d598f..47700302a08b7f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -481,8 +481,7 @@ static void remove_kgd_mem_from_kfd_bo_list(struct kgd_mem 
*mem,
  *
  * Returns 0 for success, negative errno for errors.
  */
-static int init_user_pages(struct kgd_mem *mem, struct mm_struct *mm,
-  uint64_t user_addr)
+static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr)
 {
struct amdkfd_process_info *process_info = mem->process_info;
struct amdgpu_bo *bo = mem->bo;
@@ -1195,7 +1194,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
add_kgd_mem_to_kfd_bo_list(*mem, avm->process_info, user_addr);
 
if (user_addr) {
-   ret = init_user_pages(*mem, current->mm, user_addr);
+   ret = init_user_pages(*mem, user_addr);
if (ret)
goto allocate_init_user_pages_failed;
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5a1939dbd4e3e6..38f97998aaddb2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2633,6 +2633,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
mutex_init(>virt.vf_errors.lock);
hash_init(adev->mn_hash);
mutex_init(>lock_reset);
+   mutex_init(>notifier_lock);
mutex_init(>virt.dpm_mutex);
mutex_init(>psp.mutex);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 31d4deb5d29484..4ffd7b90f4d907 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -50,66 +50,6 @@
 #include "amdgpu.h"
 #include "amdgpu_amdkfd.h"
 
-/**
- * struct amdgpu_mn_node
- *
- * @it: interval node defining start-last of the affected address range
- * @bos: list of all BOs in the affected address range
- *
- * Manages all BOs which are affected of a certain range of address space.
- */
-struct amdgpu_mn_node {
-   struct interval_tree_node   it;
-   struct list_headbos;
-};
-
-/**
- * amdgpu_mn_destroy - destroy the HMM mirror
- *
- * @work: previously sheduled work item
- *
- * Lazy destroys the notifier from a work item
- */
-static void amdgpu_mn_destroy(struct work_struct *work)
-{
-   struct amdgpu_mn *amn = container_of(work, struct amdgpu_mn, work);
-   struct amdgpu_device *adev = amn->adev;
-   struct amdgpu_mn_node *node, *next_node;
-   struct amdgpu_bo *bo, *next_bo;
-
-   mutex_lock(>mn_lock);
-   down_write(>lock);
-   hash_del(>node);
-   rbtree_postorder_for_each_entry_safe(node, next_node,
->objects.rb_root, it.rb) {
-   list_for_each_entry_safe(bo, next_bo, >bos, mn_list) {
-   bo->mn = NULL;
-   list_del_init(>mn_list);
-   }
-   kfree(node);
-   }
-   up_write(>lock);
-   mutex_

[Xen-devel] [PATCH v2 15/15] mm/hmm: remove hmm_mirror and related

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

The only two users of this are now converted to use mmu_range_notifier,
delete all the code and update hmm.rst.

Reviewed-by: Jérôme Glisse 
Signed-off-by: Jason Gunthorpe 
---
 Documentation/vm/hmm.rst | 105 ---
 include/linux/hmm.h  | 183 +
 mm/Kconfig   |   1 -
 mm/hmm.c | 284 +--
 4 files changed, 33 insertions(+), 540 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 0a5960beccf76d..a247643035c4e2 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -147,49 +147,16 @@ Address space mirroring implementation and API
 Address space mirroring's main objective is to allow duplication of a range of
 CPU page table into a device page table; HMM helps keep both synchronized. A
 device driver that wants to mirror a process address space must start with the
-registration of an hmm_mirror struct::
-
- int hmm_mirror_register(struct hmm_mirror *mirror,
- struct mm_struct *mm);
-
-The mirror struct has a set of callbacks that are used
-to propagate CPU page tables::
-
- struct hmm_mirror_ops {
- /* release() - release hmm_mirror
-  *
-  * @mirror: pointer to struct hmm_mirror
-  *
-  * This is called when the mm_struct is being released.  The callback
-  * must ensure that all access to any pages obtained from this mirror
-  * is halted before the callback returns. All future access should
-  * fault.
-  */
- void (*release)(struct hmm_mirror *mirror);
-
- /* sync_cpu_device_pagetables() - synchronize page tables
-  *
-  * @mirror: pointer to struct hmm_mirror
-  * @update: update information (see struct mmu_notifier_range)
-  * Return: -EAGAIN if update.blockable false and callback need to
-  * block, 0 otherwise.
-  *
-  * This callback ultimately originates from mmu_notifiers when the CPU
-  * page table is updated. The device driver must update its page table
-  * in response to this callback. The update argument tells what action
-  * to perform.
-  *
-  * The device driver must not return from this callback until the device
-  * page tables are completely updated (TLBs flushed, etc); this is a
-  * synchronous call.
-  */
- int (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
-   const struct hmm_update *update);
- };
-
-The device driver must perform the update action to the range (mark range
-read only, or fully unmap, etc.). The device must complete the update before
-the driver callback returns.
+registration of a mmu_range_notifier::
+
+ mrn->ops = _ops;
+ int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
+ unsigned long start, unsigned long length,
+ struct mm_struct *mm);
+
+During the driver_ops->invalidate() callback the device driver must perform
+the update action to the range (mark range read only, or fully unmap,
+etc.). The device must complete the update before the driver callback returns.
 
 When the device driver wants to populate a range of virtual addresses, it can
 use::
@@ -216,70 +183,46 @@ The usage pattern is::
   struct hmm_range range;
   ...
 
+  range.notifier = 
   range.start = ...;
   range.end = ...;
   range.pfns = ...;
   range.flags = ...;
   range.values = ...;
   range.pfn_shift = ...;
-  hmm_range_register(, mirror);
 
-  /*
-   * Just wait for range to be valid, safe to ignore return value as we
-   * will use the return value of hmm_range_fault() below under the
-   * mmap_sem to ascertain the validity of the range.
-   */
-  hmm_range_wait_until_valid(, TIMEOUT_IN_MSEC);
+  if (!mmget_not_zero(mrn->notifier.mm))
+  return -EFAULT;
 
  again:
+  range.notifier_seq = mmu_range_read_begin();
   down_read(>mmap_sem);
   ret = hmm_range_fault(, HMM_RANGE_SNAPSHOT);
   if (ret) {
   up_read(>mmap_sem);
-  if (ret == -EBUSY) {
-/*
- * No need to check hmm_range_wait_until_valid() return value
- * on retry we will get proper error with hmm_range_fault()
- */
-hmm_range_wait_until_valid(, TIMEOUT_IN_MSEC);
-goto again;
-  }
-  hmm_range_unregister();
+  if (ret == -EBUSY)
+ goto again;
   return ret;
   }
+  up_read(>mmap_sem);
+
   take_lock(driver->update);
-  if (!hmm_range_valid()) {
+  if (mmu_range_read_retry(, range.notifier_seq) {
   release_lock(driver->update);
-  up_read(>mmap_sem);
   goto again;
   }
 
-  // Use pfns array content to update device page table
+  /* Use pfns array content to update device page table,
+   

[Xen-devel] [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Convert the collision-retry lock around hmm_range_fault to use the one now
provided by the mmu_range notifier.

Although this driver does not seem to use the collision retry lock that
hmm provides correctly, it can still be converted over to use the
mmu_range_notifier api instead of hmm_mirror without too much trouble.

This also deletes another place where a driver is associating additional
data (struct amdgpu_mn) with a mmu_struct.

Cc: Alex Deucher 
Cc: Christian König 
Cc: David (ChunMing) Zhou 
Cc: amd-...@lists.freedesktop.org
Signed-off-by: Jason Gunthorpe 
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c|  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c| 148 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h|  49 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |  76 -
 5 files changed, 66 insertions(+), 225 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 47700302a08b7f..1bcedb9b477dce 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1738,6 +1738,10 @@ static int update_invalid_user_pages(struct 
amdkfd_process_info *process_info,
return ret;
}
 
+   /*
+* FIXME: Cannot ignore the return code, must hold
+* notifier_lock
+*/
amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
 
/* Mark the BO as valid unless it was invalidated
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 2e53feed40e230..76771f5f0b60ab 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -607,8 +607,6 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
e->tv.num_shared = 2;
 
amdgpu_bo_list_get_list(p->bo_list, >validated);
-   if (p->bo_list->first_userptr != p->bo_list->num_entries)
-   p->mn = amdgpu_mn_get(p->adev, AMDGPU_MN_TYPE_GFX);
 
INIT_LIST_HEAD();
amdgpu_vm_get_pd_bo(>vm, >validated, >vm_pd);
@@ -1291,11 +1289,11 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
if (r)
goto error_unlock;
 
-   /* No memory allocation is allowed while holding the mn lock.
-* p->mn is hold until amdgpu_cs_submit is finished and fence is added
-* to BOs.
+   /* No memory allocation is allowed while holding the notifier lock.
+* The lock is held until amdgpu_cs_submit is finished and fence is
+* added to BOs.
 */
-   amdgpu_mn_lock(p->mn);
+   mutex_lock(>adev->notifier_lock);
 
/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
@@ -1338,13 +1336,13 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
amdgpu_vm_move_to_lru_tail(p->adev, >vm);
 
ttm_eu_fence_buffer_objects(>ticket, >validated, p->fence);
-   amdgpu_mn_unlock(p->mn);
+   mutex_unlock(>adev->notifier_lock);
 
return 0;
 
 error_abort:
drm_sched_job_cleanup(>base);
-   amdgpu_mn_unlock(p->mn);
+   mutex_unlock(>adev->notifier_lock);
 
 error_unlock:
amdgpu_job_free(job);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 4ffd7b90f4d907..cb718a064eb491 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -50,28 +50,6 @@
 #include "amdgpu.h"
 #include "amdgpu_amdkfd.h"
 
-/**
- * amdgpu_mn_lock - take the write side lock for this notifier
- *
- * @mn: our notifier
- */
-void amdgpu_mn_lock(struct amdgpu_mn *mn)
-{
-   if (mn)
-   down_write(>lock);
-}
-
-/**
- * amdgpu_mn_unlock - drop the write side lock for this notifier
- *
- * @mn: our notifier
- */
-void amdgpu_mn_unlock(struct amdgpu_mn *mn)
-{
-   if (mn)
-   up_write(>lock);
-}
-
 /**
  * amdgpu_mn_invalidate_gfx - callback to notify about mm change
  *
@@ -82,12 +60,19 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  * potentially dirty.
  */
 static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
-const struct mmu_notifier_range *range)
+const struct mmu_notifier_range *range,
+unsigned long cur_seq)
 {
struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
long r;
 
+   /*
+* FIXME: Must hold some lock shared with
+* amdgpu_ttm

[Xen-devel] [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Of the 13 users of mmu_notifiers, 8 of them use only
invalidate_range_start/end() and immediately intersect the
mmu_notifier_range with some kind of internal list of VAs.  4 use an
interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
of some kind (scif_dma, vhost, gntdev, hmm)

And the remaining 5 either don't use invalidate_range_start() or do some
special thing with it.

It turns out that building a correct scheme with an interval tree is
pretty complicated, particularly if the use case is synchronizing against
another thread doing get_user_pages().  Many of these implementations have
various subtle and difficult to fix races.

This approach puts the interval tree as common code at the top of the mmu
notifier call tree and implements a shareable locking scheme.

It includes:
 - An interval tree tracking VA ranges, with per-range callbacks
 - A read/write locking scheme for the interval tree that avoids
   sleeping in the notifier path (for OOM killer)
 - A sequence counter based collision-retry locking scheme to tell
   device page fault that a VA range is being concurrently invalidated.

This is based on various ideas:
- hmm accumulates invalidated VA ranges and releases them when all
  invalidates are done, via active_invalidate_ranges count.
  This approach avoids having to intersect the interval tree twice (as
  umem_odp does) at the potential cost of a longer device page fault.

- kvm/umem_odp use a sequence counter to drive the collision retry,
  via invalidate_seq

- a deferred work todo list on unlock scheme like RTNL, via deferred_list.
  This makes adding/removing interval tree members more deterministic

- seqlock, except this version makes the seqlock idea multi-holder on the
  write side by protecting it with active_invalidate_ranges and a spinlock

To minimize MM overhead when only the interval tree is being used, the
entire SRCU and hlist overheads are dropped using some simple
branches. Similarly the interval tree overhead is dropped when in hlist
mode.

The overhead from the mandatory spinlock is broadly the same as most of
existing users which already had a lock (or two) of some sort on the
invalidation path.

Cc: Andrea Arcangeli 
Cc: Michal Hocko 
Acked-by: Christian König 
Signed-off-by: Jason Gunthorpe 
---
 include/linux/mmu_notifier.h |  98 +++
 mm/Kconfig   |   1 +
 mm/mmu_notifier.c| 533 +--
 3 files changed, 607 insertions(+), 25 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 12bd603d318ce7..51b92ba013ddce 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -6,10 +6,12 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mmu_notifier_mm;
 struct mmu_notifier;
 struct mmu_notifier_range;
+struct mmu_range_notifier;
 
 /**
  * enum mmu_notifier_event - reason for the mmu notifier callback
@@ -32,6 +34,9 @@ struct mmu_notifier_range;
  * access flags). User should soft dirty the page in the end callback to make
  * sure that anyone relying on soft dirtyness catch pages that might be written
  * through non CPU mappings.
+ *
+ * @MMU_NOTIFY_RELEASE: used during mmu_range_notifier invalidate to signal 
that
+ * the mm refcount is zero and the range is no longer accessible.
  */
 enum mmu_notifier_event {
MMU_NOTIFY_UNMAP = 0,
@@ -39,6 +44,7 @@ enum mmu_notifier_event {
MMU_NOTIFY_PROTECTION_VMA,
MMU_NOTIFY_PROTECTION_PAGE,
MMU_NOTIFY_SOFT_DIRTY,
+   MMU_NOTIFY_RELEASE,
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
@@ -222,6 +228,26 @@ struct mmu_notifier {
unsigned int users;
 };
 
+/**
+ * struct mmu_range_notifier_ops
+ * @invalidate: Upon return the caller must stop using any SPTEs within this
+ *  range, this function can sleep. Return false if blocking was
+ *  required but range is non-blocking
+ */
+struct mmu_range_notifier_ops {
+   bool (*invalidate)(struct mmu_range_notifier *mrn,
+  const struct mmu_notifier_range *range,
+  unsigned long cur_seq);
+};
+
+struct mmu_range_notifier {
+   struct interval_tree_node interval_tree;
+   const struct mmu_range_notifier_ops *ops;
+   struct hlist_node deferred_item;
+   unsigned long invalidate_seq;
+   struct mm_struct *mm;
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 #ifdef CONFIG_LOCKDEP
@@ -263,6 +289,78 @@ extern int __mmu_notifier_register(struct mmu_notifier *mn,
   struct mm_struct *mm);
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
struct mm_struct *mm);
+
+unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn);
+int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
+ unsigned long start, unsigned long length,
+ struct mm_struct *mm)

[Xen-devel] [PATCH v2 07/15] drm/radeon: use mmu_range_notifier_insert

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

The new API is an exact match for the needs of radeon.

For some reason radeon tries to remove overlapping ranges from the
interval tree, but interval trees (and mmu_range_notifier_insert)
support overlapping ranges directly. Simply delete all this code.

Since this driver is missing a invalidate_range_end callback, but
still calls get_user_pages(), it cannot be correct against all races.

Cc: Alex Deucher 
Cc: Christian König 
Cc: David (ChunMing) Zhou 
Cc: amd-...@lists.freedesktop.org
Cc: Petr Cvek 
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/radeon/radeon.h|   9 +-
 drivers/gpu/drm/radeon/radeon_mn.c | 219 ++---
 2 files changed, 52 insertions(+), 176 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index d59b004f669583..27959f3ace1152 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -68,6 +68,10 @@
 #include 
 #include 
 
+#ifdef CONFIG_MMU_NOTIFIER
+#include 
+#endif
+
 #include 
 #include 
 #include 
@@ -509,8 +513,9 @@ struct radeon_bo {
struct ttm_bo_kmap_obj  dma_buf_vmap;
pid_t   pid;
 
-   struct radeon_mn*mn;
-   struct list_headmn_list;
+#ifdef CONFIG_MMU_NOTIFIER
+   struct mmu_range_notifier   notifier;
+#endif
 };
 #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base)
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c 
b/drivers/gpu/drm/radeon/radeon_mn.c
index dbab9a3a969b9e..d3d41e20a64922 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -36,131 +36,51 @@
 
 #include "radeon.h"
 
-struct radeon_mn {
-   struct mmu_notifier mn;
-
-   /* objects protected by lock */
-   struct mutexlock;
-   struct rb_root_cached   objects;
-};
-
-struct radeon_mn_node {
-   struct interval_tree_node   it;
-   struct list_headbos;
-};
-
 /**
- * radeon_mn_invalidate_range_start - callback to notify about mm change
+ * radeon_mn_invalidate - callback to notify about mm change
  *
  * @mn: our notifier
- * @mn: the mm this callback is about
- * @start: start of updated range
- * @end: end of updated range
+ * @range: the VMA under invalidation
  *
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
-   const struct mmu_notifier_range *range)
+static bool radeon_mn_invalidate(struct mmu_range_notifier *mn,
+const struct mmu_notifier_range *range,
+unsigned long cur_seq)
 {
-   struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
+   struct radeon_bo *bo = container_of(mn, struct radeon_bo, notifier);
struct ttm_operation_ctx ctx = { false, false };
-   struct interval_tree_node *it;
-   unsigned long end;
-   int ret = 0;
-
-   /* notification is exclusive, but interval is inclusive */
-   end = range->end - 1;
-
-   /* TODO we should be able to split locking for interval tree and
-* the tear down.
-*/
-   if (mmu_notifier_range_blockable(range))
-   mutex_lock(>lock);
-   else if (!mutex_trylock(>lock))
-   return -EAGAIN;
-
-   it = interval_tree_iter_first(>objects, range->start, end);
-   while (it) {
-   struct radeon_mn_node *node;
-   struct radeon_bo *bo;
-   long r;
-
-   if (!mmu_notifier_range_blockable(range)) {
-   ret = -EAGAIN;
-   goto out_unlock;
-   }
-
-   node = container_of(it, struct radeon_mn_node, it);
-   it = interval_tree_iter_next(it, range->start, end);
+   long r;
 
-   list_for_each_entry(bo, >bos, mn_list) {
+   if (!bo->tbo.ttm || bo->tbo.ttm->state != tt_bound)
+   return true;
 
-   if (!bo->tbo.ttm || bo->tbo.ttm->state != tt_bound)
-   continue;
+   if (!mmu_notifier_range_blockable(range))
+   return false;
 
-   r = radeon_bo_reserve(bo, true);
-   if (r) {
-   DRM_ERROR("(%ld) failed to reserve user bo\n", 
r);
-   continue;
-   }
-
-   r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv,
-   true, false, MAX_SCHEDULE_TIMEOUT);
-   if (r <= 0)
-   DRM_ERROR("(%ld) failed to wait for user bo\n", 
r);
-
-   radeon_ttm_placement_from_domain(bo, 
RADEON_GE

[Xen-devel] [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

DMA_SHARED_BUFFER can not be enabled by the user (it represents a library
set in the kernel). The kconfig convention is to use select for such
symbols so they are turned on implicitly when the user enables a kconfig
that needs them.

Otherwise the XEN_GNTDEV_DMABUF kconfig is overly difficult to enable.

Fixes: 932d6562179e ("xen/gntdev: Add initial support for dma-buf UAPI")
Cc: Oleksandr Andrushchenko 
Cc: Boris Ostrovsky 
Cc: xen-devel@lists.xenproject.org
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Reviewed-by: Juergen Gross 
Reviewed-by: Oleksandr Andrushchenko 
Signed-off-by: Jason Gunthorpe 
---
 drivers/xen/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 79cc75096f4232..a50dadd0109336 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -141,7 +141,8 @@ config XEN_GNTDEV
 
 config XEN_GNTDEV_DMABUF
bool "Add support for dma-buf grant access device driver extension"
-   depends on XEN_GNTDEV && XEN_GRANT_DMA_ALLOC && DMA_SHARED_BUFFER
+   depends on XEN_GNTDEV && XEN_GRANT_DMA_ALLOC
+   select DMA_SHARED_BUFFER
help
  Allows userspace processes and kernel modules to use Xen backed
  dma-buf implementation. With this extension grant references to
-- 
2.23.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 05/15] RDMA/odp: Use mmu_range_notifier_insert()

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Replace the internal interval tree based mmu notifier with the new common
mmu_range_notifier_insert() API. This removes a lot of code and fixes a
deadlock that can be triggered in ODP:

 zap_page_range()
  mmu_notifier_invalidate_range_start()
   [..]
ib_umem_notifier_invalidate_range_start()
   down_read(_mm->umem_rwsem)
  unmap_single_vma()
[..]
  __split_huge_page_pmd()
mmu_notifier_invalidate_range_start()
[..]
   ib_umem_notifier_invalidate_range_start()
  down_read(_mm->umem_rwsem)   // DEADLOCK

mmu_notifier_invalidate_range_end()
   up_read(_mm->umem_rwsem)
  mmu_notifier_invalidate_range_end()
 up_read(_mm->umem_rwsem)

The umem_rwsem is held across the range_start/end as the ODP algorithm for
invalidate_range_end cannot tolerate changes to the interval
tree. However, due to the nested invalidation regions the second
down_read() can deadlock if there are competing writers. The new core code
provides an alternative scheme to solve this problem.

Fixes: ca748c39ea3f ("RDMA/umem: Get rid of per_mm->notifier_count")
Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/core/device.c |   1 -
 drivers/infiniband/core/umem_odp.c   | 288 +++
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   7 +-
 drivers/infiniband/hw/mlx5/mr.c  |   3 +-
 drivers/infiniband/hw/mlx5/odp.c |  50 +++--
 include/rdma/ib_umem_odp.h   |  65 ++
 include/rdma/ib_verbs.h  |   2 -
 7 files changed, 69 insertions(+), 347 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 2dd2cfe9b56136..ac7924b3c73abe 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2617,7 +2617,6 @@ void ib_set_device_ops(struct ib_device *dev, const 
struct ib_device_ops *ops)
SET_DEVICE_OP(dev_ops, get_vf_config);
SET_DEVICE_OP(dev_ops, get_vf_stats);
SET_DEVICE_OP(dev_ops, init_port);
-   SET_DEVICE_OP(dev_ops, invalidate_range);
SET_DEVICE_OP(dev_ops, iw_accept);
SET_DEVICE_OP(dev_ops, iw_add_ref);
SET_DEVICE_OP(dev_ops, iw_connect);
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index d7d5fadf0899ad..6132b8127e8435 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -48,197 +48,32 @@
 
 #include "uverbs.h"
 
-static void ib_umem_notifier_start_account(struct ib_umem_odp *umem_odp)
-{
-   mutex_lock(_odp->umem_mutex);
-   if (umem_odp->notifiers_count++ == 0)
-   /*
-* Initialize the completion object for waiting on
-* notifiers. Since notifier_count is zero, no one should be
-* waiting right now.
-*/
-   reinit_completion(_odp->notifier_completion);
-   mutex_unlock(_odp->umem_mutex);
-}
-
-static void ib_umem_notifier_end_account(struct ib_umem_odp *umem_odp)
-{
-   mutex_lock(_odp->umem_mutex);
-   /*
-* This sequence increase will notify the QP page fault that the page
-* that is going to be mapped in the spte could have been freed.
-*/
-   ++umem_odp->notifiers_seq;
-   if (--umem_odp->notifiers_count == 0)
-   complete_all(_odp->notifier_completion);
-   mutex_unlock(_odp->umem_mutex);
-}
-
-static void ib_umem_notifier_release(struct mmu_notifier *mn,
-struct mm_struct *mm)
-{
-   struct ib_ucontext_per_mm *per_mm =
-   container_of(mn, struct ib_ucontext_per_mm, mn);
-   struct rb_node *node;
-
-   down_read(_mm->umem_rwsem);
-   if (!per_mm->mn.users)
-   goto out;
-
-   for (node = rb_first_cached(_mm->umem_tree); node;
-node = rb_next(node)) {
-   struct ib_umem_odp *umem_odp =
-   rb_entry(node, struct ib_umem_odp, interval_tree.rb);
-
-   /*
-* Increase the number of notifiers running, to prevent any
-* further fault handling on this MR.
-*/
-   ib_umem_notifier_start_account(umem_odp);
-   complete_all(_odp->notifier_completion);
-   umem_odp->umem.ibdev->ops.invalidate_range(
-   umem_odp, ib_umem_start(umem_odp),
-   ib_umem_end(umem_odp));
-   }
-
-out:
-   up_read(_mm->umem_rwsem);
-}
-
-static int invalidate_range_start_trampoline(struct ib_umem_odp *item,
-u64 start, u64 end, void *cookie)
-{
-   ib_umem_notifier_start_account(item);
-   item->umem.ibdev->ops.invalidate_range(item, start, end);
-   return 0;
-}
-
-static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
-  

[Xen-devel] [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Now that we have KERNEL_HEADER_TEST all headers are generally compile
tested, so relying on makefile tricks to avoid compiling code that depends
on CONFIG_MMU_NOTIFIER is more annoying.

Instead follow the usual pattern and provide most of the header with only
the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
ensures code compiles no matter what the config setting is.

While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.

Reviewed-by: Jérôme Glisse 
Signed-off-by: Jason Gunthorpe 
---
 include/linux/mmu_notifier.h | 46 +---
 mm/mmu_notifier.c| 13 ++
 2 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 1bd8e6a09a3c27..12bd603d318ce7 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -7,8 +7,9 @@
 #include 
 #include 
 
+struct mmu_notifier_mm;
 struct mmu_notifier;
-struct mmu_notifier_ops;
+struct mmu_notifier_range;
 
 /**
  * enum mmu_notifier_event - reason for the mmu notifier callback
@@ -40,36 +41,8 @@ enum mmu_notifier_event {
MMU_NOTIFY_SOFT_DIRTY,
 };
 
-#ifdef CONFIG_MMU_NOTIFIER
-
-#ifdef CONFIG_LOCKDEP
-extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
-#endif
-
-/*
- * The mmu notifier_mm structure is allocated and installed in
- * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
- * critical section and it's released only when mm_count reaches zero
- * in mmdrop().
- */
-struct mmu_notifier_mm {
-   /* all mmu notifiers registerd in this mm are queued in this list */
-   struct hlist_head list;
-   /* to serialize the list modifications and hlist_unhashed */
-   spinlock_t lock;
-};
-
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
 
-struct mmu_notifier_range {
-   struct vm_area_struct *vma;
-   struct mm_struct *mm;
-   unsigned long start;
-   unsigned long end;
-   unsigned flags;
-   enum mmu_notifier_event event;
-};
-
 struct mmu_notifier_ops {
/*
 * Called either by mmu_notifier_unregister or when the mm is
@@ -249,6 +222,21 @@ struct mmu_notifier {
unsigned int users;
 };
 
+#ifdef CONFIG_MMU_NOTIFIER
+
+#ifdef CONFIG_LOCKDEP
+extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
+#endif
+
+struct mmu_notifier_range {
+   struct vm_area_struct *vma;
+   struct mm_struct *mm;
+   unsigned long start;
+   unsigned long end;
+   unsigned flags;
+   enum mmu_notifier_event event;
+};
+
 static inline int mm_has_notifiers(struct mm_struct *mm)
 {
return unlikely(mm->mmu_notifier_mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 7fde88695f35d6..367670cfd02b7b 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -27,6 +27,19 @@ struct lockdep_map __mmu_notifier_invalidate_range_start_map 
= {
 };
 #endif
 
+/*
+ * The mmu notifier_mm structure is allocated and installed in
+ * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
+ * critical section and it's released only when mm_count reaches zero
+ * in mmdrop().
+ */
+struct mmu_notifier_mm {
+   /* all mmu notifiers registered in this mm are queued in this list */
+   struct hlist_head list;
+   /* to serialize the list modifications and hlist_unhashed */
+   spinlock_t lock;
+};
+
 /*
  * This function can't run concurrently against mmu_notifier_register
  * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
-- 
2.23.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

find_vma() must be called under the mmap_sem, reorganize this code to
do the vma check after entering the lock.

Further, fix the unlocked use of struct task_struct's mm, instead use
the mm from hmm_mirror which has an active mm_grab. Also the mm_grab
must be converted to a mm_get before acquiring mmap_sem or calling
find_vma().

Fixes: 66c45500bfdc ("drm/amdgpu: use new HMM APIs and helpers")
Fixes: 0919195f2b0d ("drm/amdgpu: Enable amdgpu_ttm_tt_get_user_pages in worker 
threads")
Cc: Alex Deucher 
Cc: Christian König 
Cc: David (ChunMing) Zhou 
Cc: amd-...@lists.freedesktop.org
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 37 ++---
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index dff41d0a85fe96..c0e41f1f0c2365 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -788,7 +789,7 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, 
struct page **pages)
struct hmm_mirror *mirror = bo->mn ? >mn->mirror : NULL;
struct ttm_tt *ttm = bo->tbo.ttm;
struct amdgpu_ttm_tt *gtt = (void *)ttm;
-   struct mm_struct *mm = gtt->usertask->mm;
+   struct mm_struct *mm;
unsigned long start = gtt->userptr;
struct vm_area_struct *vma;
struct hmm_range *range;
@@ -796,25 +797,14 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, 
struct page **pages)
uint64_t *pfns;
int r = 0;
 
-   if (!mm) /* Happens during process shutdown */
-   return -ESRCH;
-
if (unlikely(!mirror)) {
DRM_DEBUG_DRIVER("Failed to get hmm_mirror\n");
-   r = -EFAULT;
-   goto out;
+   return -EFAULT;
}
 
-   vma = find_vma(mm, start);
-   if (unlikely(!vma || start < vma->vm_start)) {
-   r = -EFAULT;
-   goto out;
-   }
-   if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
-   vma->vm_file)) {
-   r = -EPERM;
-   goto out;
-   }
+   mm = mirror->hmm->mmu_notifier.mm;
+   if (!mmget_not_zero(mm)) /* Happens during process shutdown */
+   return -ESRCH;
 
range = kzalloc(sizeof(*range), GFP_KERNEL);
if (unlikely(!range)) {
@@ -847,6 +837,17 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, 
struct page **pages)
hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT);
 
down_read(>mmap_sem);
+   vma = find_vma(mm, start);
+   if (unlikely(!vma || start < vma->vm_start)) {
+   r = -EFAULT;
+   goto out_unlock;
+   }
+   if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
+   vma->vm_file)) {
+   r = -EPERM;
+   goto out_unlock;
+   }
+
r = hmm_range_fault(range, 0);
up_read(>mmap_sem);
 
@@ -865,15 +866,19 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, 
struct page **pages)
}
 
gtt->range = range;
+   mmput(mm);
 
return 0;
 
+out_unlock:
+   up_read(>mmap_sem);
 out_free_pfns:
hmm_range_unregister(range);
kvfree(pfns);
 out_free_ranges:
kfree(range);
 out:
+   mmput(mm);
return r;
 }
 
-- 
2.23.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 10/15] nouveau: use mmu_notifier directly for invalidate_range_start

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

There is no reason to get the invalidate_range_start() callback via an
indirection through hmm_mirror, just register a normal notifier directly.

Cc: Ben Skeggs 
Cc: dri-de...@lists.freedesktop.org
Cc: nouv...@lists.freedesktop.org
Cc: Ralph Campbell 
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 95 ++-
 1 file changed, 63 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 668d4bd0c118f1..577f8811925a59 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -88,6 +88,7 @@ nouveau_ivmm_find(struct nouveau_svm *svm, u64 inst)
 }
 
 struct nouveau_svmm {
+   struct mmu_notifier notifier;
struct nouveau_vmm *vmm;
struct {
unsigned long start;
@@ -96,7 +97,6 @@ struct nouveau_svmm {
 
struct mutex mutex;
 
-   struct mm_struct *mm;
struct hmm_mirror mirror;
 };
 
@@ -251,10 +251,11 @@ nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 
start, u64 limit)
 }
 
 static int
-nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
-   const struct mmu_notifier_range *update)
+nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
+   const struct mmu_notifier_range *update)
 {
-   struct nouveau_svmm *svmm = container_of(mirror, typeof(*svmm), mirror);
+   struct nouveau_svmm *svmm =
+   container_of(mn, struct nouveau_svmm, notifier);
unsigned long start = update->start;
unsigned long limit = update->end;
 
@@ -264,6 +265,9 @@ nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror 
*mirror,
SVMM_DBG(svmm, "invalidate %016lx-%016lx", start, limit);
 
mutex_lock(>mutex);
+   if (unlikely(!svmm->vmm))
+   goto out;
+
if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
if (start < svmm->unmanaged.start) {
nouveau_svmm_invalidate(svmm, start,
@@ -273,19 +277,31 @@ nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror 
*mirror,
}
 
nouveau_svmm_invalidate(svmm, start, limit);
+
+out:
mutex_unlock(>mutex);
return 0;
 }
 
-static void
-nouveau_svmm_release(struct hmm_mirror *mirror)
+static void nouveau_svmm_free_notifier(struct mmu_notifier *mn)
+{
+   kfree(container_of(mn, struct nouveau_svmm, notifier));
+}
+
+static const struct mmu_notifier_ops nouveau_mn_ops = {
+   .invalidate_range_start = nouveau_svmm_invalidate_range_start,
+   .free_notifier = nouveau_svmm_free_notifier,
+};
+
+static int
+nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
+   const struct mmu_notifier_range *update)
 {
+   return 0;
 }
 
-static const struct hmm_mirror_ops
-nouveau_svmm = {
+static const struct hmm_mirror_ops nouveau_svmm = {
.sync_cpu_device_pagetables = nouveau_svmm_sync_cpu_device_pagetables,
-   .release = nouveau_svmm_release,
 };
 
 void
@@ -294,7 +310,10 @@ nouveau_svmm_fini(struct nouveau_svmm **psvmm)
struct nouveau_svmm *svmm = *psvmm;
if (svmm) {
hmm_mirror_unregister(>mirror);
-   kfree(*psvmm);
+   mutex_lock(>mutex);
+   svmm->vmm = NULL;
+   mutex_unlock(>mutex);
+   mmu_notifier_put(>notifier);
*psvmm = NULL;
}
 }
@@ -320,7 +339,7 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
mutex_lock(>mutex);
if (cli->svm.cli) {
ret = -EBUSY;
-   goto done;
+   goto out_free;
}
 
/* Allocate a new GPU VMM that can support SVM (managed by the
@@ -335,24 +354,33 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
.fault_replay = true,
}, sizeof(struct gp100_vmm_v0), >svm.vmm);
if (ret)
-   goto done;
+   goto out_free;
 
-   /* Enable HMM mirroring of CPU address-space to VMM. */
-   svmm->mm = get_task_mm(current);
-   down_write(>mm->mmap_sem);
+   down_write(>mm->mmap_sem);
svmm->mirror.ops = _svmm;
-   ret = hmm_mirror_register(>mirror, svmm->mm);
-   if (ret == 0) {
-   cli->svm.svmm = svmm;
-   cli->svm.cli = cli;
-   }
-   up_write(>mm->mmap_sem);
-   mmput(svmm->mm);
+   ret = hmm_mirror_register(>mirror, current->mm);
+   if (ret)
+   goto out_mm_unlock;
 
-done:
+   svmm->notifier.ops = _mn_ops;
+   ret = __mmu_notifier_register(>notifier, current->mm);
if (ret)
-   nouveau_svmm_fini();
+ 

[Xen-devel] [PATCH v2 03/15] mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

hmm_mirror's handling of ranges does not use a sequence count which
results in this bug:

 CPU0   CPU1
 hmm_range_wait_until_valid(range)
 valid == true
 hmm_range_fault(range)
hmm_invalidate_range_start()
   range->valid = false
hmm_invalidate_range_end()
   range->valid = true
 hmm_range_valid(range)
  valid == true

Where the hmm_range_valid should not have succeeded.

Adding the required sequence count would make it nearly identical to the
new mmu_range_notifier. Instead replace the hmm_mirror stuff with
mmu_range_notifier.

Co-existence of the two APIs is the first step.

Reviewed-by: Jérôme Glisse 
Signed-off-by: Jason Gunthorpe 
---
 include/linux/hmm.h |  5 +
 mm/hmm.c| 25 +++--
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 3fec513b9c00f1..8ac1fd6a81af8f 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -145,6 +145,9 @@ enum hmm_pfn_value_e {
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
+ * @notifier: an optional mmu_range_notifier
+ * @notifier_seq: when notifier is used this is the result of
+ *mmu_range_read_begin()
  * @hmm: the core HMM structure this range is active against
  * @vma: the vm area struct for the range
  * @list: all range lock are on a list
@@ -159,6 +162,8 @@ enum hmm_pfn_value_e {
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
 struct hmm_range {
+   struct mmu_range_notifier *notifier;
+   unsigned long   notifier_seq;
struct hmm  *hmm;
struct list_headlist;
unsigned long   start;
diff --git a/mm/hmm.c b/mm/hmm.c
index 902f5fa6bf93ad..22ac3595771feb 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -852,6 +852,14 @@ void hmm_range_unregister(struct hmm_range *range)
 }
 EXPORT_SYMBOL(hmm_range_unregister);
 
+static bool needs_retry(struct hmm_range *range)
+{
+   if (range->notifier)
+   return mmu_range_check_retry(range->notifier,
+range->notifier_seq);
+   return !range->valid;
+}
+
 static const struct mm_walk_ops hmm_walk_ops = {
.pud_entry  = hmm_vma_walk_pud,
.pmd_entry  = hmm_vma_walk_pmd,
@@ -892,18 +900,23 @@ long hmm_range_fault(struct hmm_range *range, unsigned 
int flags)
const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
unsigned long start = range->start, end;
struct hmm_vma_walk hmm_vma_walk;
-   struct hmm *hmm = range->hmm;
+   struct mm_struct *mm;
struct vm_area_struct *vma;
int ret;
 
-   lockdep_assert_held(>mmu_notifier.mm->mmap_sem);
+   if (range->notifier)
+   mm = range->notifier->mm;
+   else
+   mm = range->hmm->mmu_notifier.mm;
+
+   lockdep_assert_held(>mmap_sem);
 
do {
/* If range is no longer valid force retry. */
-   if (!range->valid)
+   if (needs_retry(range))
return -EBUSY;
 
-   vma = find_vma(hmm->mmu_notifier.mm, start);
+   vma = find_vma(mm, start);
if (vma == NULL || (vma->vm_flags & device_vma))
return -EFAULT;
 
@@ -933,7 +946,7 @@ long hmm_range_fault(struct hmm_range *range, unsigned int 
flags)
start = hmm_vma_walk.last;
 
/* Keep trying while the range is valid. */
-   } while (ret == -EBUSY && range->valid);
+   } while (ret == -EBUSY && !needs_retry(range));
 
if (ret) {
unsigned long i;
@@ -991,7 +1004,7 @@ long hmm_range_dma_map(struct hmm_range *range, struct 
device *device,
continue;
 
/* Check if range is being invalidated */
-   if (!range->valid) {
+   if (needs_retry(range)) {
ret = -EBUSY;
goto unmap;
}
-- 
2.23.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 11/15] nouveau: use mmu_range_notifier instead of hmm_mirror

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Remove the hmm_mirror object and use the mmu_range_notifier API instead
for the range, and use the normal mmu_notifier API for the general
invalidation callback.

While here re-organize the pagefault path so the locking pattern is clear.

nouveau is the only driver that uses a temporary range object and instead
forwards nearly every invalidation range directly to the HW. While this is
not how the mmu_range_notifier was intended to be used, the overheads on
the pagefaulting path are similar to the existing hmm_mirror version.
Particularly since the interval tree will be small.

Cc: Ben Skeggs 
Cc: dri-de...@lists.freedesktop.org
Cc: nouv...@lists.freedesktop.org
Cc: Ralph Campbell 
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 180 ++
 1 file changed, 100 insertions(+), 80 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 577f8811925a59..f27317fbe36f45 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -96,8 +96,6 @@ struct nouveau_svmm {
} unmanaged;
 
struct mutex mutex;
-
-   struct hmm_mirror mirror;
 };
 
 #define SVMM_DBG(s,f,a...) 
\
@@ -293,23 +291,11 @@ static const struct mmu_notifier_ops nouveau_mn_ops = {
.free_notifier = nouveau_svmm_free_notifier,
 };
 
-static int
-nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
-   const struct mmu_notifier_range *update)
-{
-   return 0;
-}
-
-static const struct hmm_mirror_ops nouveau_svmm = {
-   .sync_cpu_device_pagetables = nouveau_svmm_sync_cpu_device_pagetables,
-};
-
 void
 nouveau_svmm_fini(struct nouveau_svmm **psvmm)
 {
struct nouveau_svmm *svmm = *psvmm;
if (svmm) {
-   hmm_mirror_unregister(>mirror);
mutex_lock(>mutex);
svmm->vmm = NULL;
mutex_unlock(>mutex);
@@ -357,15 +343,10 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
goto out_free;
 
down_write(>mm->mmap_sem);
-   svmm->mirror.ops = _svmm;
-   ret = hmm_mirror_register(>mirror, current->mm);
-   if (ret)
-   goto out_mm_unlock;
-
svmm->notifier.ops = _mn_ops;
ret = __mmu_notifier_register(>notifier, current->mm);
if (ret)
-   goto out_hmm_unregister;
+   goto out_mm_unlock;
/* Note, ownership of svmm transfers to mmu_notifier */
 
cli->svm.svmm = svmm;
@@ -374,8 +355,6 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
mutex_unlock(>mutex);
return 0;
 
-out_hmm_unregister:
-   hmm_mirror_unregister(>mirror);
 out_mm_unlock:
up_write(>mm->mmap_sem);
 out_free:
@@ -503,43 +482,91 @@ nouveau_svm_fault_cache(struct nouveau_svm *svm,
fault->inst, fault->addr, fault->access);
 }
 
-static inline bool
-nouveau_range_done(struct hmm_range *range)
+struct svm_notifier {
+   struct mmu_range_notifier notifier;
+   struct nouveau_svmm *svmm;
+};
+
+static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
+const struct mmu_notifier_range *range,
+unsigned long cur_seq)
 {
-   bool ret = hmm_range_valid(range);
+   struct svm_notifier *sn =
+   container_of(mrn, struct svm_notifier, notifier);
 
-   hmm_range_unregister(range);
-   return ret;
+   /*
+* serializes the update to mrn->invalidate_seq done by caller and
+* prevents invalidation of the PTE from progressing while HW is being
+* programmed. This is very hacky and only works because the normal
+* notifier that does invalidation is always called after the range
+* notifier.
+*/
+   if (mmu_notifier_range_blockable(range))
+   mutex_lock(>svmm->mutex);
+   else if (!mutex_trylock(>svmm->mutex))
+   return false;
+   mmu_range_set_seq(mrn, cur_seq);
+   mutex_unlock(>svmm->mutex);
+   return true;
 }
 
-static int
-nouveau_range_fault(struct nouveau_svmm *svmm, struct hmm_range *range)
+static const struct mmu_range_notifier_ops nouveau_svm_mrn_ops = {
+   .invalidate = nouveau_svm_range_invalidate,
+};
+
+static int nouveau_range_fault(struct nouveau_svmm *svmm,
+  struct nouveau_drm *drm, void *data, u32 size,
+  u64 *pfns,
+  struct svm_notifier *notifier)
 {
+   unsigned long timeout =
+   jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+   /* Have HMM fault pages within the fault window to the GPU. */
+   struct hmm_range range = {
+   .noti

[Xen-devel] [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
they only use invalidate_range_start/end and immediately check the
invalidating range against some driver data structure to tell if the
driver is interested. Half of them use an interval_tree, the others are
simple linear search lists.

Of the ones I checked they largely seem to have various kinds of races,
bugs and poor implementation. This is a result of the complexity in how
the notifier interacts with get_user_pages(). It is extremely difficult to
use it correctly.

Consolidate all of this code together into the core mmu_notifier and
provide a locking scheme similar to hmm_mirror that allows the user to
safely use get_user_pages() and reliably know if the page list still
matches the mm.

This new arrangment plays nicely with the !blockable mode for
OOM. Scanning the interval tree is done such that the intersection test
will always succeed, and since there is no invalidate_range_end exposed to
drivers the scheme safely allows multiple drivers to be subscribed.

Four places are converted as an example of how the new API is used.
Four are left for future patches:
 - i915_gem has complex locking around destruction of a registration,
   needs more study
 - hfi1 (2nd user) needs access to the rbtree
 - scif_dma has a complicated logic flow
 - vhost's mmu notifiers are already being rewritten

This series, and the other code it depends on is available on my github:

https://github.com/jgunthorpe/linux/commits/mmu_notifier

v2 changes:
- Add mmu_range_set_seq() to set the mrn sequence number under the driver
  lock and make the locking more understandable
- Add some additional comments around locking/READ_ONCe
- Make the WARN_ON flow in mn_itree_invalidate a bit easier to follow
- Fix wrong WARN_ON

Jason Gunthorpe (15):
  mm/mmu_notifier: define the header pre-processor parts even if
disabled
  mm/mmu_notifier: add an interval tree notifier
  mm/hmm: allow hmm_range to be used with a mmu_range_notifier or
hmm_mirror
  mm/hmm: define the pre-processor related parts of hmm.h even if
disabled
  RDMA/odp: Use mmu_range_notifier_insert()
  RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv
  drm/radeon: use mmu_range_notifier_insert
  xen/gntdev: Use select for DMA_SHARED_BUFFER
  xen/gntdev: use mmu_range_notifier_insert
  nouveau: use mmu_notifier directly for invalidate_range_start
  nouveau: use mmu_range_notifier instead of hmm_mirror
  drm/amdgpu: Call find_vma under mmap_sem
  drm/amdgpu: Use mmu_range_insert instead of hmm_mirror
  drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  mm/hmm: remove hmm_mirror and related

 Documentation/vm/hmm.rst  | 105 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   2 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c|  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c|   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c| 457 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h|  53 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h|  13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   | 111 ++--
 drivers/gpu/drm/nouveau/nouveau_svm.c | 231 +---
 drivers/gpu/drm/radeon/radeon.h   |   9 +-
 drivers/gpu/drm/radeon/radeon_mn.c| 219 ++-
 drivers/infiniband/core/device.c  |   1 -
 drivers/infiniband/core/umem_odp.c| 288 +
 drivers/infiniband/hw/hfi1/file_ops.c |   2 +-
 drivers/infiniband/hw/hfi1/hfi.h  |   2 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c | 146 ++---
 drivers/infiniband/hw/hfi1/user_exp_rcv.h |   3 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |   7 +-
 drivers/infiniband/hw/mlx5/mr.c   |   3 +-
 drivers/infiniband/hw/mlx5/odp.c  |  50 +-
 drivers/xen/Kconfig   |   3 +-
 drivers/xen/gntdev-common.h   |   8 +-
 drivers/xen/gntdev.c  | 180 ++
 include/linux/hmm.h   | 195 +--
 include/linux/mmu_notifier.h  | 144 -
 include/rdma/ib_umem_odp.h|  65 +--
 include/rdma/ib_verbs.h   |   2 -
 kernel/fork.c |   1 -
 mm/Kconfig|   2 +-
 mm/hmm.c  | 275 +
 mm/mmu_notifier.c | 546 +-
 32 files changed, 1225 insertions(+), 1922 deletions(-)

-- 
2.23.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 04/15] mm/hmm: define the pre-processor related parts of hmm.h even if disabled

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Only the function calls are stubbed out with static inlines that always
fail. This is the standard way to write a header for an optional component
and makes it easier for drivers that only optionally need HMM_MIRROR.

Reviewed-by: Jérôme Glisse 
Signed-off-by: Jason Gunthorpe 
---
 include/linux/hmm.h | 59 -
 kernel/fork.c   |  1 -
 2 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 8ac1fd6a81af8f..2666eb08a40615 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -62,8 +62,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_HMM_MIRROR
-
 #include 
 #include 
 #include 
@@ -374,6 +372,15 @@ struct hmm_mirror {
struct list_headlist;
 };
 
+/*
+ * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that case.
+ */
+#define HMM_FAULT_ALLOW_RETRY  (1 << 0)
+
+/* Don't fault in missing PTEs, just snapshot the current state. */
+#define HMM_FAULT_SNAPSHOT (1 << 1)
+
+#ifdef CONFIG_HMM_MIRROR
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
@@ -383,14 +390,6 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror);
 void hmm_range_unregister(struct hmm_range *range);
 
-/*
- * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that case.
- */
-#define HMM_FAULT_ALLOW_RETRY  (1 << 0)
-
-/* Don't fault in missing PTEs, just snapshot the current state. */
-#define HMM_FAULT_SNAPSHOT (1 << 1)
-
 long hmm_range_fault(struct hmm_range *range, unsigned int flags);
 
 long hmm_range_dma_map(struct hmm_range *range,
@@ -401,6 +400,44 @@ long hmm_range_dma_unmap(struct hmm_range *range,
 struct device *device,
 dma_addr_t *daddrs,
 bool dirty);
+#else
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+   return -EOPNOTSUPP;
+}
+
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+}
+
+int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror)
+{
+   return -EOPNOTSUPP;
+}
+
+void hmm_range_unregister(struct hmm_range *range)
+{
+}
+
+static inline long hmm_range_fault(struct hmm_range *range, unsigned int flags)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline long hmm_range_dma_map(struct hmm_range *range,
+struct device *device, dma_addr_t *daddrs,
+unsigned int flags)
+{
+   return -EOPNOTSUPP;
+}
+
+static inline long hmm_range_dma_unmap(struct hmm_range *range,
+  struct device *device,
+  dma_addr_t *daddrs, bool dirty)
+{
+   return -EOPNOTSUPP;
+}
+#endif
 
 /*
  * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
@@ -411,6 +448,4 @@ long hmm_range_dma_unmap(struct hmm_range *range,
  */
 #define HMM_RANGE_DEFAULT_TIMEOUT 1000
 
-#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-
 #endif /* LINUX_HMM_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index f9572f41612628..4561a65d19db88 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -40,7 +40,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
-- 
2.23.0


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

This converts one of the two users of mmu_notifiers to use the new API.
The conversion is fairly straightforward, however the existing use of
notifiers here seems to be racey.

Cc: Mike Marciniszyn 
Cc: Dennis Dalessandro 
Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/hw/hfi1/file_ops.c |   2 +-
 drivers/infiniband/hw/hfi1/hfi.h  |   2 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c | 146 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.h |   3 +-
 4 files changed, 60 insertions(+), 93 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/file_ops.c 
b/drivers/infiniband/hw/hfi1/file_ops.c
index f9a7e9d29c8ba2..7c5e3fb224139a 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -1138,7 +1138,7 @@ static int get_ctxt_info(struct hfi1_filedata *fd, 
unsigned long arg, u32 len)
HFI1_CAP_UGET_MASK(uctxt->flags, MASK) |
HFI1_CAP_KGET_MASK(uctxt->flags, K2U);
/* adjust flag if this fd is not able to cache */
-   if (!fd->handler)
+   if (!fd->use_mn)
cinfo.runtime_flags |= HFI1_CAP_TID_UNMAP; /* no caching */
 
cinfo.num_active = hfi1_count_active_units();
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index fa45350a9a1d32..fc10d65fc3e13c 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1444,7 +1444,7 @@ struct hfi1_filedata {
/* for cpu affinity; -1 if none */
int rec_cpu_num;
u32 tid_n_pinned;
-   struct mmu_rb_handler *handler;
+   bool use_mn;
struct tid_rb_node **entry_to_rb;
spinlock_t tid_lock; /* protect tid_[limit,used] counters */
u32 tid_limit;
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c 
b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index 3592a9ec155e85..a1ab3bd334f89e 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -59,11 +59,11 @@ static int set_rcvarray_entry(struct hfi1_filedata *fd,
  struct tid_user_buf *tbuf,
  u32 rcventry, struct tid_group *grp,
  u16 pageidx, unsigned int npages);
-static int tid_rb_insert(void *arg, struct mmu_rb_node *node);
 static void cacheless_tid_rb_remove(struct hfi1_filedata *fdata,
struct tid_rb_node *tnode);
-static void tid_rb_remove(void *arg, struct mmu_rb_node *node);
-static int tid_rb_invalidate(void *arg, struct mmu_rb_node *mnode);
+static bool tid_rb_invalidate(struct mmu_range_notifier *mrn,
+ const struct mmu_notifier_range *range,
+ unsigned long cur_seq);
 static int program_rcvarray(struct hfi1_filedata *fd, struct tid_user_buf *,
struct tid_group *grp,
unsigned int start, u16 count,
@@ -73,10 +73,8 @@ static int unprogram_rcvarray(struct hfi1_filedata *fd, u32 
tidinfo,
  struct tid_group **grp);
 static void clear_tid_node(struct hfi1_filedata *fd, struct tid_rb_node *node);
 
-static struct mmu_rb_ops tid_rb_ops = {
-   .insert = tid_rb_insert,
-   .remove = tid_rb_remove,
-   .invalidate = tid_rb_invalidate
+static const struct mmu_range_notifier_ops tid_mn_ops = {
+   .invalidate = tid_rb_invalidate,
 };
 
 /*
@@ -87,7 +85,6 @@ static struct mmu_rb_ops tid_rb_ops = {
 int hfi1_user_exp_rcv_init(struct hfi1_filedata *fd,
   struct hfi1_ctxtdata *uctxt)
 {
-   struct hfi1_devdata *dd = uctxt->dd;
int ret = 0;
 
spin_lock_init(>tid_lock);
@@ -109,20 +106,7 @@ int hfi1_user_exp_rcv_init(struct hfi1_filedata *fd,
fd->entry_to_rb = NULL;
return -ENOMEM;
}
-
-   /*
-* Register MMU notifier callbacks. If the registration
-* fails, continue without TID caching for this context.
-*/
-   ret = hfi1_mmu_rb_register(fd, fd->mm, _rb_ops,
-  dd->pport->hfi1_wq,
-  >handler);
-   if (ret) {
-   dd_dev_info(dd,
-   "Failed MMU notifier registration %d\n",
-   ret);
-   ret = 0;
-   }
+   fd->use_mn = true;
}
 
/*
@@ -139,7 +123,7 @@ int hfi1_user_exp_rcv_init(struct hfi1_filedata *fd,
 * init.
 */
spin_lock(>tid_lock);
-   if (uctxt->subctxt_cnt && fd->handler) {
+   if (uctxt->subctxt_cnt && fd->use_mn) {
u16 remainder;
 
fd->tid_limit = uctxt->expected_c

[Xen-devel] [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert

2019-10-28 Thread Jason Gunthorpe
From: Jason Gunthorpe 

gntdev simply wants to monitor a specific VMA for any notifier events,
this can be done straightforwardly using mmu_range_notifier_insert() over
the VMA's VA range.

The notifier should be attached until the original VMA is destroyed.

It is unclear if any of this is even sane, but at least a lot of duplicate
code is removed.

Cc: Oleksandr Andrushchenko 
Cc: Boris Ostrovsky 
Cc: xen-devel@lists.xenproject.org
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Signed-off-by: Jason Gunthorpe 
---
 drivers/xen/gntdev-common.h |   8 +-
 drivers/xen/gntdev.c| 180 ++--
 2 files changed, 49 insertions(+), 139 deletions(-)

diff --git a/drivers/xen/gntdev-common.h b/drivers/xen/gntdev-common.h
index 2f8b949c3eeb14..b201fdd20b667b 100644
--- a/drivers/xen/gntdev-common.h
+++ b/drivers/xen/gntdev-common.h
@@ -21,15 +21,8 @@ struct gntdev_dmabuf_priv;
 struct gntdev_priv {
/* Maps with visible offsets in the file descriptor. */
struct list_head maps;
-   /*
-* Maps that are not visible; will be freed on munmap.
-* Only populated if populate_freeable_maps == 1
-*/
-   struct list_head freeable_maps;
/* lock protects maps and freeable_maps. */
struct mutex lock;
-   struct mm_struct *mm;
-   struct mmu_notifier mn;
 
 #ifdef CONFIG_XEN_GRANT_DMA_ALLOC
/* Device for which DMA memory is allocated. */
@@ -49,6 +42,7 @@ struct gntdev_unmap_notify {
 };
 
 struct gntdev_grant_map {
+   struct mmu_range_notifier notifier;
struct list_head next;
struct vm_area_struct *vma;
int index;
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index a446a7221e13e9..12d626670bebbc 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -65,7 +65,6 @@ MODULE_PARM_DESC(limit, "Maximum number of grants that may be 
mapped by "
 static atomic_t pages_mapped = ATOMIC_INIT(0);
 
 static int use_ptemod;
-#define populate_freeable_maps use_ptemod
 
 static int unmap_grant_pages(struct gntdev_grant_map *map,
 int offset, int pages);
@@ -251,12 +250,6 @@ void gntdev_put_map(struct gntdev_priv *priv, struct 
gntdev_grant_map *map)
evtchn_put(map->notify.event);
}
 
-   if (populate_freeable_maps && priv) {
-   mutex_lock(>lock);
-   list_del(>next);
-   mutex_unlock(>lock);
-   }
-
if (map->pages && !use_ptemod)
unmap_grant_pages(map, 0, map->count);
gntdev_free_map(map);
@@ -445,17 +438,9 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
struct gntdev_priv *priv = file->private_data;
 
pr_debug("gntdev_vma_close %p\n", vma);
-   if (use_ptemod) {
-   /* It is possible that an mmu notifier could be running
-* concurrently, so take priv->lock to ensure that the vma won't
-* vanishing during the unmap_grant_pages call, since we will
-* spin here until that completes. Such a concurrent call will
-* not do any unmapping, since that has been done prior to
-* closing the vma, but it may still iterate the unmap_ops list.
-*/
-   mutex_lock(>lock);
+   if (use_ptemod && map->vma == vma) {
+   mmu_range_notifier_remove(>notifier);
map->vma = NULL;
-   mutex_unlock(>lock);
}
vma->vm_private_data = NULL;
gntdev_put_map(priv, map);
@@ -477,109 +462,44 @@ static const struct vm_operations_struct gntdev_vmops = {
 
 /* -- */
 
-static bool in_range(struct gntdev_grant_map *map,
- unsigned long start, unsigned long end)
-{
-   if (!map->vma)
-   return false;
-   if (map->vma->vm_start >= end)
-   return false;
-   if (map->vma->vm_end <= start)
-   return false;
-
-   return true;
-}
-
-static int unmap_if_in_range(struct gntdev_grant_map *map,
- unsigned long start, unsigned long end,
- bool blockable)
+static bool gntdev_invalidate(struct mmu_range_notifier *mn,
+ const struct mmu_notifier_range *range,
+ unsigned long cur_seq)
 {
+   struct gntdev_grant_map *map =
+   container_of(mn, struct gntdev_grant_map, notifier);
unsigned long mstart, mend;
int err;
 
-   if (!in_range(map, start, end))
-   return 0;
+   if (!mmu_notifier_range_blockable(range))
+   return false;
 
-   if (!blockable)
-   return -EAGAIN;
+   /*
+* If the VMA is split or otherwise changed the notifier is not
+* u

Re: [Xen-devel] [PATCH hmm 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER

2019-10-21 Thread Jason Gunthorpe
On Wed, Oct 16, 2019 at 06:35:15AM +, Oleksandr Andrushchenko wrote:
> On 10/16/19 8:11 AM, Jürgen Groß wrote:
> > On 15.10.19 20:12, Jason Gunthorpe wrote:
> >> From: Jason Gunthorpe 
> >>
> >> DMA_SHARED_BUFFER can not be enabled by the user (it represents a 
> >> library
> >> set in the kernel). The kconfig convention is to use select for such
> >> symbols so they are turned on implicitly when the user enables a kconfig
> >> that needs them.
> >>
> >> Otherwise the XEN_GNTDEV_DMABUF kconfig is overly difficult to enable.
> >>
> >> Fixes: 932d6562179e ("xen/gntdev: Add initial support for dma-buf UAPI")
> >> Cc: Oleksandr Andrushchenko 
> >> Cc: Boris Ostrovsky 
> >> Cc: xen-devel@lists.xenproject.org
> >> Cc: Juergen Gross 
> >> Cc: Stefano Stabellini 
> >> Signed-off-by: Jason Gunthorpe 
> >
> > Reviewed-by: Juergen Gross 
> >
> Reviewed-by: Oleksandr Andrushchenko 

Thanks Oleksandr and Juergen, can you also give me some advice on how
to progress the more complex patch:

https://patchwork.kernel.org/patch/11191369/

Is this gntdev stuff still in-use? I struggled a bit to understand
what it is doing, but I think I made a reasonable guess?

Jason
___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

  1   2   >