from:"Christoph Hellwig"

Re: [Nouveau] [PATCH drm-misc-next 2/3] drm/gpuva_mgr: generalize dma_resv/extobj handling and GEM validation

2023-10-12 Thread Christoph Hellwig

On Thu, Oct 12, 2023 at 02:35:15PM +0200, Christian König wrote:
> Additional to that from the software side Felix summarized it in the HMM
> peer2peer discussion thread recently quite well.

Do you have a pointer to that discussion?

Re: [Nouveau] [PATCH drm-next v5 03/14] drm: manager to keep track of GPUs VA mappings

2023-06-19 Thread Christoph Hellwig

Why are none of these EXPORT_SYMBOL_GPL as it's very linux-internal
stuff?

Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling

2023-06-12 Thread Christoph Hellwig

Thank you.  I'll queue it up as a separate patch.

Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling

2023-06-12 Thread Christoph Hellwig

On Fri, Jun 09, 2023 at 05:38:28PM +0200, Juergen Gross wrote:
>>> guest started with e820_host=1 even if no PCI passthrough was planned.
>>> But this should be rather rare (at least I hope so).
>>
>> So is this an ACK for the patch and can we go ahead with it?
>
> As long as above mentioned check of the E820 map is done, yes.
>
> If you want I can send a diff to be folded into your patch on Monday.

Yes, that would be great!

Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling

2023-06-07 Thread Christoph Hellwig

On Mon, May 22, 2023 at 10:37:09AM +0200, Juergen Gross wrote:
> In normal cases PCI passthrough in PV guests requires to start the guest
> with e820_host=1. So it should be rather easy to limit allocating the
> 64MB in PV guests to the cases where the memory map has non-RAM regions
> especially in the first 1MB of the memory.
>
> This will cover even hotplug cases. The only case not covered would be a
> guest started with e820_host=1 even if no PCI passthrough was planned.
> But this should be rather rare (at least I hope so).

So is this an ACK for the patch and can we go ahead with it?

(I'd still like to merge swiotlb-xen into swiotlb eventually, but it's
probably not going to happen this merge window)

Re: [Nouveau] [PATCH 3/4] drm/nouveau: stop using is_swiotlb_active

2023-06-07 Thread Christoph Hellwig

On Thu, May 18, 2023 at 04:30:49PM -0400, Lyude Paul wrote:
> Reviewed-by: Lyude Paul 
> 
> Thanks for getting to this!

I've tentantively queued this up in the dma-mapping for-next tree.
Let me know if you'd prefer it to go through the nouveau tree.

Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling

2023-05-20 Thread Christoph Hellwig

On Fri, May 19, 2023 at 02:58:57PM +0200, Christoph Hellwig wrote:
> On Fri, May 19, 2023 at 01:49:46PM +0100, Andrew Cooper wrote:
> > > The alternative would be to finally merge swiotlb-xen into swiotlb, in
> > > which case we might be able to do this later.  Let me see what I can
> > > do there.
> > 
> > If that is an option, it would be great to reduce the special-cashing.
> 
> I think it's doable, and I've been wanting it for a while.  I just
> need motivated testers, but it seems like I just found at least two :)

So looking at swiotlb-xen it does these off things where it takes a value
generated originally be xen_phys_to_dma, then only does a dma_to_phys
to go back and call pfn_valid on the result.  Does this make sense, or
is it wrong and just works by accident?  I.e. is the patch below correct?


diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 67aa74d201627d..3396c5766f0dd8 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -90,9 +90,7 @@ static inline int range_straddles_page_boundary(phys_addr_t 
p, size_t size)
 
 static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr)
 {
-   unsigned long bfn = XEN_PFN_DOWN(dma_to_phys(dev, dma_addr));
-   unsigned long xen_pfn = bfn_to_local_pfn(bfn);
-   phys_addr_t paddr = (phys_addr_t)xen_pfn << XEN_PAGE_SHIFT;
+   phys_addr_t paddr = xen_dma_to_phys(dev, dma_addr);
 
/* If the address is outside our domain, it CAN
 * have the same virtual address as another address
@@ -234,7 +232,7 @@ static dma_addr_t xen_swiotlb_map_page(struct device *dev, 
struct page *page,
 
 done:
if (!dev_is_dma_coherent(dev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
-   if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dev_addr
+   if (pfn_valid(PFN_DOWN(phys)))
arch_sync_dma_for_device(phys, size, dir);
else
xen_dma_sync_for_device(dev, dev_addr, size, dir);
@@ -258,7 +256,7 @@ static void xen_swiotlb_unmap_page(struct device *hwdev, 
dma_addr_t dev_addr,
BUG_ON(dir == DMA_NONE);
 
if (!dev_is_dma_coherent(hwdev) && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) {
-   if (pfn_valid(PFN_DOWN(dma_to_phys(hwdev, dev_addr
+   if (pfn_valid(PFN_DOWN(paddr)))
arch_sync_dma_for_cpu(paddr, size, dir);
else
xen_dma_sync_for_cpu(hwdev, dev_addr, size, dir);
@@ -276,7 +274,7 @@ xen_swiotlb_sync_single_for_cpu(struct device *dev, 
dma_addr_t dma_addr,
phys_addr_t paddr = xen_dma_to_phys(dev, dma_addr);
 
if (!dev_is_dma_coherent(dev)) {
-   if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dma_addr
+   if (pfn_valid(PFN_DOWN(paddr)))
arch_sync_dma_for_cpu(paddr, size, dir);
else
xen_dma_sync_for_cpu(dev, dma_addr, size, dir);
@@ -296,7 +294,7 @@ xen_swiotlb_sync_single_for_device(struct device *dev, 
dma_addr_t dma_addr,
swiotlb_sync_single_for_device(dev, paddr, size, dir);
 
if (!dev_is_dma_coherent(dev)) {
-   if (pfn_valid(PFN_DOWN(dma_to_phys(dev, dma_addr
+   if (pfn_valid(PFN_DOWN(paddr)))
arch_sync_dma_for_device(paddr, size, dir);
else
xen_dma_sync_for_device(dev, dma_addr, size, dir);

Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling

2023-05-19 Thread Christoph Hellwig

On Fri, May 19, 2023 at 01:49:46PM +0100, Andrew Cooper wrote:
> > The alternative would be to finally merge swiotlb-xen into swiotlb, in
> > which case we might be able to do this later.  Let me see what I can
> > do there.
> 
> If that is an option, it would be great to reduce the special-cashing.

I think it's doable, and I've been wanting it for a while.  I just
need motivated testers, but it seems like I just found at least two :)

Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling

2023-05-19 Thread Christoph Hellwig

On Fri, May 19, 2023 at 12:10:26PM +0200, Marek Marczykowski-Górecki wrote:
> While I would say PCI passthrough is not very common for PV guests, can
> the decision about xen-swiotlb be delayed until you can enumerate
> xenstore to check if there are any PCI devices connected (and not
> allocate xen-swiotlb by default if there are none)? This would
> still not cover the hotplug case (in which case, you'd need to force it
> with a cmdline), but at least you wouldn't loose much memory just
> because one of your VMs may use PCI passthrough (so, you have it enabled
> in your kernel).

How early can we query xenstore?  We'd need to do this before setting
up DMA for any device.

The alternative would be to finally merge swiotlb-xen into swiotlb, in
which case we might be able to do this later.  Let me see what I can
do there.

Re: [Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling

2023-05-18 Thread Christoph Hellwig

On Thu, May 18, 2023 at 08:18:39PM +0200, Marek Marczykowski-Górecki wrote:
> On Thu, May 18, 2023 at 03:42:51PM +0200, Christoph Hellwig wrote:
> > Remove the dangerous late initialization of xen-swiotlb in
> > pci_xen_swiotlb_init_late and instead just always initialize
> > xen-swiotlb in the boot code if CONFIG_XEN_PCIDEV_FRONTEND is enabled.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Doesn't it mean all the PV guests will basically waste 64MB of RAM
> by default each if they don't really have PCI devices?

If CONFIG_XEN_PCIDEV_FRONTEND is enabled, and the kernel's isn't booted
with swiotlb=noforce, yes.

[Nouveau] [PATCH 3/4] drm/nouveau: stop using is_swiotlb_active

2023-05-18 Thread Christoph Hellwig

Drivers have no business looking into dma-mapping internals and check
what backend is used.  Unfortunstely the DRM core is still broken and
tries to do plain page allocations instead of using DMA API allocators
by default and uses various bandaids on when to use dma_alloc_coherent.

Switch nouveau to use the same (broken) scheme as amdgpu and radeon
to remove the last driver user of is_swiotlb_active.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/nouveau/nouveau_ttm.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c 
b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index 1469a88910e45d..486f39f31a38df 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -24,9 +24,9 @@
  */
 
 #include 
-#include 
 
 #include 
+#include 
 
 #include "nouveau_drv.h"
 #include "nouveau_gem.h"
@@ -265,7 +265,6 @@ nouveau_ttm_init(struct nouveau_drm *drm)
struct nvkm_pci *pci = device->pci;
struct nvif_mmu *mmu = >client.mmu;
struct drm_device *dev = drm->dev;
-   bool need_swiotlb = false;
int typei, ret;
 
ret = nouveau_ttm_init_host(drm, 0);
@@ -300,13 +299,10 @@ nouveau_ttm_init(struct nouveau_drm *drm)
drm->agp.cma = pci->agp.cma;
}
 
-#if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)
-   need_swiotlb = is_swiotlb_active(dev->dev);
-#endif
-
ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev,
  dev->anon_inode->i_mapping,
- dev->vma_offset_manager, need_swiotlb,
+ dev->vma_offset_manager,
+ drm_need_swiotlb(drm->client.mmu.dmabits),
  drm->client.mmu.dmabits <= 32);
if (ret) {
NV_ERROR(drm, "error initialising bo driver, %d\n", ret);
-- 
2.39.2

[Nouveau] unexport swiotlb_active

2023-05-18 Thread Christoph Hellwig

Hi all,

this little series removes the last swiotlb API exposed to modules.

Diffstat:
 arch/x86/include/asm/xen/swiotlb-xen.h |6 --
 arch/x86/kernel/pci-dma.c  |   28 
 drivers/gpu/drm/nouveau/nouveau_ttm.c  |   10 +++---
 drivers/pci/xen-pcifront.c |6 --
 kernel/dma/swiotlb.c   |1 -
 5 files changed, 7 insertions(+), 44 deletions(-)

[Nouveau] [PATCH 2/4] x86: always initialize xen-swiotlb when xen-pcifront is enabling

2023-05-18 Thread Christoph Hellwig

Remove the dangerous late initialization of xen-swiotlb in
pci_xen_swiotlb_init_late and instead just always initialize
xen-swiotlb in the boot code if CONFIG_XEN_PCIDEV_FRONTEND is enabled.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/include/asm/xen/swiotlb-xen.h |  6 --
 arch/x86/kernel/pci-dma.c  | 25 +++--
 drivers/pci/xen-pcifront.c |  6 --
 3 files changed, 3 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/xen/swiotlb-xen.h 
b/arch/x86/include/asm/xen/swiotlb-xen.h
index 77a2d19cc9909e..abde0f44df57dc 100644
--- a/arch/x86/include/asm/xen/swiotlb-xen.h
+++ b/arch/x86/include/asm/xen/swiotlb-xen.h
@@ -2,12 +2,6 @@
 #ifndef _ASM_X86_SWIOTLB_XEN_H
 #define _ASM_X86_SWIOTLB_XEN_H
 
-#ifdef CONFIG_SWIOTLB_XEN
-extern int pci_xen_swiotlb_init_late(void);
-#else
-static inline int pci_xen_swiotlb_init_late(void) { return -ENXIO; }
-#endif
-
 int xen_swiotlb_fixup(void *buf, unsigned long nslabs);
 int xen_create_contiguous_region(phys_addr_t pstart, unsigned int order,
unsigned int address_bits,
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index f887b08ac5ffe4..c4a7ead9eb674e 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -81,27 +81,6 @@ static void __init pci_xen_swiotlb_init(void)
if (IS_ENABLED(CONFIG_PCI))
pci_request_acs();
 }
-
-int pci_xen_swiotlb_init_late(void)
-{
-   if (dma_ops == _swiotlb_dma_ops)
-   return 0;
-
-   /* we can work with the default swiotlb */
-   if (!io_tlb_default_mem.nslabs) {
-   int rc = swiotlb_init_late(swiotlb_size_or_default(),
-  GFP_KERNEL, xen_swiotlb_fixup);
-   if (rc < 0)
-   return rc;
-   }
-
-   /* XXX: this switches the dma ops under live devices! */
-   dma_ops = _swiotlb_dma_ops;
-   if (IS_ENABLED(CONFIG_PCI))
-   pci_request_acs();
-   return 0;
-}
-EXPORT_SYMBOL_GPL(pci_xen_swiotlb_init_late);
 #else
 static inline void __init pci_xen_swiotlb_init(void)
 {
@@ -111,7 +90,9 @@ static inline void __init pci_xen_swiotlb_init(void)
 void __init pci_iommu_alloc(void)
 {
if (xen_pv_domain()) {
-   if (xen_initial_domain() || x86_swiotlb_enable)
+   if (xen_initial_domain() ||
+   IS_ENABLED(CONFIG_XEN_PCIDEV_FRONTEND) ||
+   x86_swiotlb_enable)
pci_xen_swiotlb_init();
return;
}
diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
index 83c0ab50676dff..11636634ae512f 100644
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -22,7 +22,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include 
@@ -669,11 +668,6 @@ static int pcifront_connect_and_init_dma(struct 
pcifront_device *pdev)
 
spin_unlock(_dev_lock);
 
-   if (!err && !is_swiotlb_active(>xdev->dev)) {
-   err = pci_xen_swiotlb_init_late();
-   if (err)
-   dev_err(>xdev->dev, "Could not setup SWIOTLB!\n");
-   }
return err;
 }
 
-- 
2.39.2

[Nouveau] [PATCH 1/4] x86: move a check out of pci_xen_swiotlb_init

2023-05-18 Thread Christoph Hellwig

Move the exact checks when to initialize the Xen swiotlb code out
of pci_xen_swiotlb_init and into the caller so that is uses readable
positive checks, rather than negative ones that will get even more
confusing with another addition.

Signed-off-by: Christoph Hellwig 
---
 arch/x86/kernel/pci-dma.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index de6be0a3965ee4..f887b08ac5ffe4 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -74,8 +74,6 @@ static inline void __init pci_swiotlb_detect(void)
 #ifdef CONFIG_SWIOTLB_XEN
 static void __init pci_xen_swiotlb_init(void)
 {
-   if (!xen_initial_domain() && !x86_swiotlb_enable)
-   return;
x86_swiotlb_enable = true;
x86_swiotlb_flags |= SWIOTLB_ANY;
swiotlb_init_remap(true, x86_swiotlb_flags, xen_swiotlb_fixup);
@@ -113,7 +111,8 @@ static inline void __init pci_xen_swiotlb_init(void)
 void __init pci_iommu_alloc(void)
 {
if (xen_pv_domain()) {
-   pci_xen_swiotlb_init();
+   if (xen_initial_domain() || x86_swiotlb_enable)
+   pci_xen_swiotlb_init();
return;
}
pci_swiotlb_detect();
-- 
2.39.2

[Nouveau] [PATCH 4/4] swiotlb: unexport is_swiotlb_active

2023-05-18 Thread Christoph Hellwig

Drivers have no business looking at dma-mapping or swiotlb internals.

Signed-off-by: Christoph Hellwig 
---
 kernel/dma/swiotlb.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index af2e304c672c43..9f1fd28264a067 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -921,7 +921,6 @@ bool is_swiotlb_active(struct device *dev)
 
return mem && mem->nslabs;
 }
-EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
 #ifdef CONFIG_DEBUG_FS
 
-- 
2.39.2

Re: [Nouveau] [PATCH v2] mm: Take a page reference when removing device exclusive entries

2023-03-29 Thread Christoph Hellwig

s/page/folio/ in the entire commit log?

Re: [Nouveau] [RFC] drm/nouveau/ttm: Stop calling into swiotlb

2022-07-29 Thread Christoph Hellwig

Hi Lyude, and thanks for taking a look.

> -#if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)
> - need_swiotlb = is_swiotlb_active(dev->dev);
> -#endif
> -
>   ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev,
> -   dev->anon_inode->i_mapping,
> -   dev->vma_offset_manager, need_swiotlb,
> -   drm->client.mmu.dmabits <= 32);
> +   dev->anon_inode->i_mapping,
> +   dev->vma_offset_manager,
> +   nouveau_drm_use_coherent_gpu_mapping(drm),
> +   drm->client.mmu.dmabits <= 32);

This will break setups for two reasons:

 - swiotlb is not only used to do device addressing limitations, so
   this will not catch the case of interconnect addressing limitations
   or forced bounce buffering which used used e.g. in secure VMs.
 - we might need bouncing for any DMA address below the physical
   address limit of the CPU

But more fundamentally the use_dma32 argument to ttm_device_init
is rather broken, as the onlyway to get a memory allocation that
fits the DMA addressing needs of a device is to use the proper
DMA mapping helpers. i.e. ttm_pool_alloc_page really needs to use
dma_alloc_pages instead of alloc_pages as a first step.  That way
all users of the TTM pool will always get dma addressable pages
and there is no need to guess the addressing limitations.

The use_dma_alloc is then only needed for users that require coherent
memory and are willing to deal with the limitations that this entails
(e.g. no way to get at the page struct).

>   if (ret) {
>   NV_ERROR(drm, "error initialising bo driver, %d\n", ret);
>   return ret;
> -- 
> 2.35.3
---end quoted text---

Re: [Nouveau] susetting the remaining swioltb couplin in DRM

2022-07-11 Thread Christoph Hellwig

On Mon, Jul 11, 2022 at 04:31:49PM -0400, Rodrigo Vivi wrote:
> On Mon, Jul 11, 2022 at 10:26:14AM +0200, Christoph Hellwig wrote:
> > Hi i915 and nouveau maintainers,
> > 
> > any chance I could get some help to remove the remaining direct
> > driver calls into swiotlb, namely swiotlb_max_segment and
> > is_swiotlb_active.  Either should not matter to a driver as they
> > should be written to the DMA API.
> 
> Hi Christoph,
> 
> while we take a look here, could you please share the reasons
> behind sunsetting this calls?

Because they are a completely broken layering violation.  A driver has
absolutely no business knowing the dma-mapping violation.  The DMA
API reports what we think is all useful constraints (e.g.
dma_max_mapping_size()), and provides useful APIs to (e.g.
dma_alloc_noncoherent or dma_alloc_noncontiguous) to allocate pages
that can be mapped without bounce buffering and drivers should use
the proper API instead of poking into one particular implementation
and restrict it from changing.

swiotlb_max_segment in particular returns a value that isn't actually
correct (a driver can't just use all of swiotlb) AND actually doesn't
work as is in various scenarious that are becoming more common,
most notably host with memory encryption schemes that always require
bounce buffering.

[Nouveau] susetting the remaining swioltb couplin in DRM

2022-07-11 Thread Christoph Hellwig

Hi i915 and nouveau maintainers,

any chance I could get some help to remove the remaining direct
driver calls into swiotlb, namely swiotlb_max_segment and
is_swiotlb_active.  Either should not matter to a driver as they
should be written to the DMA API.

In the i915 case it seems like the driver should use
dma_alloc_noncontiguous and/or dma_alloc_noncoherent to allocate
DMAable memory instead of using alloc_page and the streaming
dma mapping helpers.

For the latter it seems like it should just stop passing
use_dma_alloc == true to ttm_device_init and/or that function
should switch to use dma_alloc_noncoherent.

Re: [Nouveau] [PATCH 13/27] mm: move the migrate_vma_* device migration code into it's own file

2022-02-10 Thread Christoph Hellwig

On Thu, Feb 10, 2022 at 09:35:10PM +1100, Alistair Popple wrote:
> I got the following build error:
> 
> /data/source/linux/mm/migrate_device.c: In function ‘migrate_vma_collect_pmd’:
> /data/source/linux/mm/migrate_device.c:242:3: error: implicit declaration of 
> function ‘flush_tlb_range’; did you mean ‘flush_pmd_tlb_range’? 
> [-Werror=implicit-function-declaration]
>   242 |   flush_tlb_range(walk->vma, start, end);
>   |   ^~~
>   |   flush_pmd_tlb_range
> 
> Including asm/tlbflush.h in migrate_device.c fixed it for me.

Yes, the buildbot also complained about this, but somehow in my test
configfs it got pulled in implicitly.

[Nouveau] [PATCH 27/27] tools: add hmm gup test for long term pinned device pages

2022-02-09 Thread Christoph Hellwig

From: Alex Sierra 

The intention is to test device coherent type pages that have been
called through get user pages with PIN_LONGTERM flag set. These pages
should get migrated back to normal system memory.

Signed-off-by: Alex Sierra 
Signed-off-by: Alistair Popple 
Reviewed-by: Felix Kuehling 
Signed-off-by: Christoph Hellwig 
---
 tools/testing/selftests/vm/Makefile|  2 +-
 tools/testing/selftests/vm/hmm-tests.c | 81 ++
 2 files changed, 82 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index 1607322a112c91..58c8427114f0c2 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -142,7 +142,7 @@ $(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS 
+= -lcap
 
 $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
 
-$(OUTPUT)/hmm-tests: local_config.h
+$(OUTPUT)/hmm-tests: local_config.h ../../../../mm/gup_test.h
 
 # HMM_EXTRA_LIBS may get set in local_config.mk, or it may be left empty.
 $(OUTPUT)/hmm-tests: LDLIBS += $(HMM_EXTRA_LIBS)
diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index 84ec8c4a1dc7b6..11b83a8084fee2 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -36,6 +36,7 @@
  * in the usual include/uapi/... directory.
  */
 #include "../../../../lib/test_hmm_uapi.h"
+#include "../../../../mm/gup_test.h"
 
 struct hmm_buffer {
void*ptr;
@@ -60,6 +61,8 @@ enum {
 #define NTIMES 10
 
 #define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
+/* Just the flags we need, copied from mm.h: */
+#define FOLL_WRITE 0x01/* check pte is writable */
 
 FIXTURE(hmm)
 {
@@ -1766,4 +1769,82 @@ TEST_F(hmm, exclusive_cow)
hmm_buffer_free(buffer);
 }
 
+/*
+ * Test get user device pages through gup_test. Setting PIN_LONGTERM flag.
+ * This should trigger a migration back to system memory for both, private
+ * and coherent type pages.
+ * This test makes use of gup_test module. Make sure GUP_TEST_CONFIG is added
+ * to your configuration before you run it.
+ */
+TEST_F(hmm, hmm_gup_test)
+{
+   struct hmm_buffer *buffer;
+   struct gup_test gup;
+   int gup_fd;
+   unsigned long npages;
+   unsigned long size;
+   unsigned long i;
+   int *ptr;
+   int ret;
+   unsigned char *m;
+
+   gup_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+   if (gup_fd == -1)
+   SKIP(return, "Skipping test, could not find gup_test driver");
+
+   npages = 4;
+   ASSERT_NE(npages, 0);
+   size = npages << self->page_shift;
+
+   buffer = malloc(sizeof(*buffer));
+   ASSERT_NE(buffer, NULL);
+
+   buffer->fd = -1;
+   buffer->size = size;
+   buffer->mirror = malloc(size);
+   ASSERT_NE(buffer->mirror, NULL);
+
+   buffer->ptr = mmap(NULL, size,
+  PROT_READ | PROT_WRITE,
+  MAP_PRIVATE | MAP_ANONYMOUS,
+  buffer->fd, 0);
+   ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+   /* Initialize buffer in system memory. */
+   for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+   ptr[i] = i;
+
+   /* Migrate memory to device. */
+   ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+   ASSERT_EQ(ret, 0);
+   ASSERT_EQ(buffer->cpages, npages);
+   /* Check what the device read. */
+   for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+   ASSERT_EQ(ptr[i], i);
+
+   gup.nr_pages_per_call = npages;
+   gup.addr = (unsigned long)buffer->ptr;
+   gup.gup_flags = FOLL_WRITE;
+   gup.size = size;
+   /*
+* Calling gup_test ioctl. It will try to PIN_LONGTERM these device 
pages
+* causing a migration back to system memory for both, private and 
coherent
+* type pages.
+*/
+   if (ioctl(gup_fd, PIN_LONGTERM_BENCHMARK, )) {
+   perror("ioctl on PIN_LONGTERM_BENCHMARK\n");
+   goto out_test;
+   }
+
+   /* Take snapshot to make sure pages have been migrated to sys memory */
+   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+   ASSERT_EQ(ret, 0);
+   ASSERT_EQ(buffer->cpages, npages);
+   m = buffer->mirror;
+   for (i = 0; i < npages; i++)
+   ASSERT_EQ(m[i], HMM_DMIRROR_PROT_WRITE);
+out_test:
+   close(gup_fd);
+   hmm_buffer_free(buffer);
+}
 TEST_HARNESS_MAIN
-- 
2.30.2

[Nouveau] [PATCH 26/27] mm/gup: migrate device coherent pages when pinning instead of failing

2022-02-09 Thread Christoph Hellwig

From: Alistair Popple 

Currently any attempts to pin a device coherent page will fail. This is
because device coherent pages need to be managed by a device driver, and
pinning them would prevent a driver from migrating them off the device.

However this is no reason to fail pinning of these pages. These are
coherent and accessible from the CPU so can be migrated just like
pinning ZONE_MOVABLE pages. So instead of failing all attempts to pin
them first try migrating them out of ZONE_DEVICE.

Signed-off-by: Alistair Popple 
Acked-by: Felix Kuehling 
[hch: rebased to the split device memory checks,
  moved migrate_device_page to migrate_device.c]
Signed-off-by: Christoph Hellwig 
---
 mm/gup.c| 37 ++-
 mm/internal.h   |  1 +
 mm/migrate_device.c | 53 +
 3 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 39b23ad39a7bde..41349b685eafb4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1889,9 +1889,31 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
ret = -EFAULT;
goto unpin_pages;
}
+
+   /*
+* Device coherent pages are managed by a driver and should not
+* be pinned indefinitely as it prevents the driver moving the
+* page. So when trying to pin with FOLL_LONGTERM instead try
+* to migrate the page out of device memory.
+*/
if (is_device_coherent_page(head)) {
-   ret = -EFAULT;
-   goto unpin_pages;
+   WARN_ON_ONCE(PageCompound(head));
+
+   /*
+* Migration will fail if the page is pinned, so convert
+* the pin on the source page to a normal reference.
+*/
+   if (gup_flags & FOLL_PIN) {
+   get_page(head);
+   unpin_user_page(head);
+   }
+
+   pages[i] = migrate_device_page(head, gup_flags);
+   if (!pages[i]) {
+   ret = -EBUSY;
+   goto unpin_pages;
+   }
+   continue;
}
 
if (is_pinnable_page(head))
@@ -1931,10 +1953,13 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
return nr_pages;
 
 unpin_pages:
-   if (gup_flags & FOLL_PIN) {
-   unpin_user_pages(pages, nr_pages);
-   } else {
-   for (i = 0; i < nr_pages; i++)
+   for (i = 0; i < nr_pages; i++) {
+   if (!pages[i])
+   continue;
+
+   if (gup_flags & FOLL_PIN)
+   unpin_user_page(pages[i]);
+   else
put_page(pages[i]);
}
 
diff --git a/mm/internal.h b/mm/internal.h
index a67222d17e5987..1bded5d7f41a9d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -719,5 +719,6 @@ int numa_migrate_prep(struct page *page, struct 
vm_area_struct *vma,
  unsigned long addr, int page_nid, int *flags);
 
 void free_zone_device_page(struct page *page);
+struct page *migrate_device_page(struct page *page, unsigned int gup_flags);
 
 #endif /* __MM_INTERNAL_H */
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 03e182f9fc7865..3373b535d5c9d9 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -767,3 +767,56 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
}
 }
 EXPORT_SYMBOL(migrate_vma_finalize);
+
+/*
+ * Migrate a device coherent page back to normal memory.  The caller should 
have
+ * a reference on page which will be copied to the new page if migration is
+ * successful or dropped on failure.
+ */
+struct page *migrate_device_page(struct page *page, unsigned int gup_flags)
+{
+   unsigned long src_pfn, dst_pfn = 0;
+   struct migrate_vma args;
+   struct page *dpage;
+
+   lock_page(page);
+   src_pfn = migrate_pfn(page_to_pfn(page)) | MIGRATE_PFN_MIGRATE;
+   args.src = _pfn;
+   args.dst = _pfn;
+   args.cpages = 1;
+   args.npages = 1;
+   args.vma = NULL;
+   migrate_vma_setup();
+   if (!(src_pfn & MIGRATE_PFN_MIGRATE))
+   return NULL;
+
+   dpage = alloc_pages(GFP_USER | __GFP_NOWARN, 0);
+
+   /*
+* get/pin the new page now so we don't have to retry gup after
+* migrating. We already have a reference so this should never fail.
+*/
+   if (dpage && WARN_ON_ONCE(!try_grab_page(dpage, gup_flags))) {
+   __free_pages(dpage, 0);
+   dpage = NULL;
+   }
+
+   if (dpage) {
+   lock_page(dpage);
+   dst_pfn =

[Nouveau] [PATCH 23/27] tools: update hmm-test to support device coherent type

2022-02-09 Thread Christoph Hellwig

From: Alex Sierra 

Test cases such as migrate_fault and migrate_multiple, were modified to
explicit migrate from device to sys memory without the need of page
faults, when using device coherent type.

Snapshot test case updated to read memory device type first and based
on that, get the proper returned results migrate_ping_pong test case
added to test explicit migration from device to sys memory for both
private and coherent zone types.

Helpers to migrate from device to sys memory and vicerversa
were also added.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Popple 
Signed-off-by: Christoph Hellwig 
---
 tools/testing/selftests/vm/hmm-tests.c | 123 -
 1 file changed, 102 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c 
b/tools/testing/selftests/vm/hmm-tests.c
index 203323967b507a..84ec8c4a1dc7b6 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -44,6 +44,14 @@ struct hmm_buffer {
int fd;
uint64_tcpages;
uint64_tfaults;
+   int zone_device_type;
+};
+
+enum {
+   HMM_PRIVATE_DEVICE_ONE,
+   HMM_PRIVATE_DEVICE_TWO,
+   HMM_COHERENCE_DEVICE_ONE,
+   HMM_COHERENCE_DEVICE_TWO,
 };
 
 #define TWOMEG (1 << 21)
@@ -60,6 +68,21 @@ FIXTURE(hmm)
unsigned intpage_shift;
 };
 
+FIXTURE_VARIANT(hmm)
+{
+   int device_number;
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_private)
+{
+   .device_number = HMM_PRIVATE_DEVICE_ONE,
+};
+
+FIXTURE_VARIANT_ADD(hmm, hmm_device_coherent)
+{
+   .device_number = HMM_COHERENCE_DEVICE_ONE,
+};
+
 FIXTURE(hmm2)
 {
int fd0;
@@ -68,6 +91,24 @@ FIXTURE(hmm2)
unsigned intpage_shift;
 };
 
+FIXTURE_VARIANT(hmm2)
+{
+   int device_number0;
+   int device_number1;
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_private)
+{
+   .device_number0 = HMM_PRIVATE_DEVICE_ONE,
+   .device_number1 = HMM_PRIVATE_DEVICE_TWO,
+};
+
+FIXTURE_VARIANT_ADD(hmm2, hmm2_device_coherent)
+{
+   .device_number0 = HMM_COHERENCE_DEVICE_ONE,
+   .device_number1 = HMM_COHERENCE_DEVICE_TWO,
+};
+
 static int hmm_open(int unit)
 {
char pathname[HMM_PATH_MAX];
@@ -81,12 +122,19 @@ static int hmm_open(int unit)
return fd;
 }
 
+static bool hmm_is_coherent_type(int dev_num)
+{
+   return (dev_num >= HMM_COHERENCE_DEVICE_ONE);
+}
+
 FIXTURE_SETUP(hmm)
 {
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;
 
-   self->fd = hmm_open(0);
+   self->fd = hmm_open(variant->device_number);
+   if (self->fd < 0 && hmm_is_coherent_type(variant->device_number))
+   SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd, 0);
 }
 
@@ -95,9 +143,11 @@ FIXTURE_SETUP(hmm2)
self->page_size = sysconf(_SC_PAGE_SIZE);
self->page_shift = ffs(self->page_size) - 1;
 
-   self->fd0 = hmm_open(0);
+   self->fd0 = hmm_open(variant->device_number0);
+   if (self->fd0 < 0 && hmm_is_coherent_type(variant->device_number0))
+   SKIP(exit(0), "DEVICE_COHERENT not available");
ASSERT_GE(self->fd0, 0);
-   self->fd1 = hmm_open(1);
+   self->fd1 = hmm_open(variant->device_number1);
ASSERT_GE(self->fd1, 0);
 }
 
@@ -144,6 +194,7 @@ static int hmm_dmirror_cmd(int fd,
}
buffer->cpages = cmd.cpages;
buffer->faults = cmd.faults;
+   buffer->zone_device_type = cmd.zone_device_type;
 
return 0;
 }
@@ -211,6 +262,20 @@ static void hmm_nanosleep(unsigned int n)
nanosleep(, NULL);
 }
 
+static int hmm_migrate_sys_to_dev(int fd,
+  struct hmm_buffer *buffer,
+  unsigned long npages)
+{
+   return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_DEV, buffer, npages);
+}
+
+static int hmm_migrate_dev_to_sys(int fd,
+  struct hmm_buffer *buffer,
+  unsigned long npages)
+{
+   return hmm_dmirror_cmd(fd, HMM_DMIRROR_MIGRATE_TO_SYS, buffer, npages);
+}
+
 /*
  * Simple NULL test of device open/close.
  */
@@ -875,7 +940,7 @@ TEST_F(hmm, migrate)
ptr[i] = i;
 
/* Migrate memory to device. */
-   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+   ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
ASSERT_EQ(ret, 0);
ASSERT_EQ(buffer->cpages, npages);
 
@@ -923,7 +988,7 @@ TEST_F(hmm, migrate_fault)
ptr[i] = i;
 
/* Migrate memory to device. */
-   ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+   ret = hmm_migrate_sys_to_dev(self-&g

[Nouveau] [PATCH 25/27] mm: remove the vma check in migrate_vma_setup()

2022-02-09 Thread Christoph Hellwig

From: Alistair Popple 

migrate_vma_setup() checks that a valid vma is passed so that the page
tables can be walked to find the pfns associated with a given address
range. However in some cases the pfns are already known, such as when
migrating device coherent pages during pin_user_pages() meaning a valid
vma isn't required.

Signed-off-by: Alistair Popple 
Acked-by: Felix Kuehling 
Signed-off-by: Christoph Hellwig 
---
 mm/migrate_device.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 0b295594e7626d..03e182f9fc7865 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -462,24 +462,24 @@ int migrate_vma_setup(struct migrate_vma *args)
 
args->start &= PAGE_MASK;
args->end &= PAGE_MASK;
-   if (!args->vma || is_vm_hugetlb_page(args->vma) ||
-   (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma))
-   return -EINVAL;
-   if (nr_pages <= 0)
-   return -EINVAL;
-   if (args->start < args->vma->vm_start ||
-   args->start >= args->vma->vm_end)
-   return -EINVAL;
-   if (args->end <= args->vma->vm_start || args->end > args->vma->vm_end)
-   return -EINVAL;
if (!args->src || !args->dst)
return -EINVAL;
-
-   memset(args->src, 0, sizeof(*args->src) * nr_pages);
-   args->cpages = 0;
-   args->npages = 0;
-
-   migrate_vma_collect(args);
+   if (args->vma) {
+   if (is_vm_hugetlb_page(args->vma) ||
+   (args->vma->vm_flags & VM_SPECIAL) || vma_is_dax(args->vma))
+   return -EINVAL;
+   if (args->start < args->vma->vm_start ||
+   args->start >= args->vma->vm_end)
+   return -EINVAL;
+   if (args->end <= args->vma->vm_start ||
+   args->end > args->vma->vm_end)
+   return -EINVAL;
+   memset(args->src, 0, sizeof(*args->src) * nr_pages);
+   args->cpages = 0;
+   args->npages = 0;
+
+   migrate_vma_collect(args);
+   }
 
if (args->cpages)
migrate_vma_unmap(args);
@@ -661,7 +661,7 @@ void migrate_vma_pages(struct migrate_vma *migrate)
continue;
}
 
-   if (!page) {
+   if (!page && migrate->vma) {
if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
continue;
if (!notified) {
-- 
2.30.2

[Nouveau] [PATCH 24/27] tools: update test_hmm script to support SP config

2022-02-09 Thread Christoph Hellwig

From: Alex Sierra 

Add two more parameters to set spm_addr_dev0 & spm_addr_dev1
addresses. These two parameters configure the start SP
addresses for each device in test_hmm driver.
Consequently, this configures zone device type as coherent.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Popple 
Signed-off-by: Christoph Hellwig 
---
 tools/testing/selftests/vm/test_hmm.sh | 24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/vm/test_hmm.sh 
b/tools/testing/selftests/vm/test_hmm.sh
index 0647b525a62564..539c9371e592a1 100755
--- a/tools/testing/selftests/vm/test_hmm.sh
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -40,11 +40,26 @@ check_test_requirements()
 
 load_driver()
 {
-   modprobe $DRIVER > /dev/null 2>&1
+   if [ $# -eq 0 ]; then
+   modprobe $DRIVER > /dev/null 2>&1
+   else
+   if [ $# -eq 2 ]; then
+   modprobe $DRIVER spm_addr_dev0=$1 spm_addr_dev1=$2
+   > /dev/null 2>&1
+   else
+   echo "Missing module parameters. Make sure pass"\
+   "spm_addr_dev0 and spm_addr_dev1"
+   usage
+   fi
+   fi
if [ $? == 0 ]; then
major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
mknod /dev/hmm_dmirror0 c $major 0
mknod /dev/hmm_dmirror1 c $major 1
+   if [ $# -eq 2 ]; then
+   mknod /dev/hmm_dmirror2 c $major 2
+   mknod /dev/hmm_dmirror3 c $major 3
+   fi
fi
 }
 
@@ -58,7 +73,7 @@ run_smoke()
 {
echo "Running smoke test. Note, this test provides basic coverage."
 
-   load_driver
+   load_driver $1 $2
$(dirname "${BASH_SOURCE[0]}")/hmm-tests
unload_driver
 }
@@ -75,6 +90,9 @@ usage()
echo "# Smoke testing"
echo "./${TEST_NAME}.sh smoke"
echo
+   echo "# Smoke testing with SPM enabled"
+   echo "./${TEST_NAME}.sh smoke  "
+   echo
exit 0
 }
 
@@ -84,7 +102,7 @@ function run_test()
usage
else
if [ "$1" = "smoke" ]; then
-   run_smoke
+   run_smoke $2 $3
else
usage
fi
-- 
2.30.2

[Nouveau] [PATCH 22/27] lib: add support for device coherent type in test_hmm

2022-02-09 Thread Christoph Hellwig

From: Alex Sierra 

Device Coherent type uses device memory that is coherently accesible by
the CPU. This could be shown as SP (special purpose) memory range
at the BIOS-e820 memory enumeration. If no SP memory is supported in
system, this could be faked by setting CONFIG_EFI_FAKE_MEMMAP.

Currently, test_hmm only supports two different SP ranges of at least
256MB size. This could be specified in the kernel parameter variable
efi_fake_mem. Ex. Two SP ranges of 1GB starting at 0x1 &
0x14000 physical address. Ex.
efi_fake_mem=1G@0x1:0x4,1G@0x14000:0x4

Private and coherent device mirror instances can be created in the same
probed. This is done by passing the module parameters spm_addr_dev0 &
spm_addr_dev1. In this case, it will create four instances of
device_mirror. The first two correspond to private device type, the
last two to coherent type. Then, they can be easily accessed from user
space through /dev/hmm_mirror. Usually num_device 0 and 1
are for private, and 2 and 3 for coherent types. If no module
parameters are passed, two instances of private type device_mirror will
be created only.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Poppple 
---
 lib/test_hmm.c  | 253 +---
 lib/test_hmm_uapi.h |  15 ++-
 2 files changed, 202 insertions(+), 66 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 15747f70c5bc9a..361a026c5d2126 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -32,11 +32,22 @@
 
 #include "test_hmm_uapi.h"
 
-#define DMIRROR_NDEVICES   2
+#define DMIRROR_NDEVICES   4
 #define DMIRROR_RANGE_FAULT_TIMEOUT1000
 #define DEVMEM_CHUNK_SIZE  (256 * 1024 * 1024U)
 #define DEVMEM_CHUNKS_RESERVE  16
 
+/*
+ * For device_private pages, dpage is just a dummy struct page
+ * representing a piece of device memory. dmirror_devmem_alloc_page
+ * allocates a real system memory page as backing storage to fake a
+ * real device. zone_device_data points to that backing page. But
+ * for device_coherent memory, the struct page represents real
+ * physical CPU-accessible memory that we can use directly.
+ */
+#define BACKING_PAGE(page) (is_device_private_page((page)) ? \
+  (page)->zone_device_data : (page))
+
 static unsigned long spm_addr_dev0;
 module_param(spm_addr_dev0, long, 0644);
 MODULE_PARM_DESC(spm_addr_dev0,
@@ -125,6 +136,21 @@ static int dmirror_bounce_init(struct dmirror_bounce 
*bounce,
return 0;
 }
 
+static bool dmirror_is_private_zone(struct dmirror_device *mdevice)
+{
+   return (mdevice->zone_device_type ==
+   HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ? true : false;
+}
+
+static enum migrate_vma_direction
+dmirror_select_device(struct dmirror *dmirror)
+{
+   return (dmirror->mdevice->zone_device_type ==
+   HMM_DMIRROR_MEMORY_DEVICE_PRIVATE) ?
+   MIGRATE_VMA_SELECT_DEVICE_PRIVATE :
+   MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+}
+
 static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
 {
vfree(bounce->ptr);
@@ -575,16 +601,19 @@ static int dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
 static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 {
struct page *dpage = NULL;
-   struct page *rpage;
+   struct page *rpage = NULL;
 
/*
-* This is a fake device so we alloc real system memory to store
-* our device memory.
+* For ZONE_DEVICE private type, this is a fake device so we allocate
+* real system memory to store our device memory.
+* For ZONE_DEVICE coherent type we use the actual dpage to store the
+* data and ignore rpage.
 */
-   rpage = alloc_page(GFP_HIGHUSER);
-   if (!rpage)
-   return NULL;
-
+   if (dmirror_is_private_zone(mdevice)) {
+   rpage = alloc_page(GFP_HIGHUSER);
+   if (!rpage)
+   return NULL;
+   }
spin_lock(>lock);
 
if (mdevice->free_pages) {
@@ -603,7 +632,8 @@ static struct page *dmirror_devmem_alloc_page(struct 
dmirror_device *mdevice)
return dpage;
 
 error:
-   __free_page(rpage);
+   if (rpage)
+   __free_page(rpage);
return NULL;
 }
 
@@ -629,12 +659,16 @@ static void dmirror_migrate_alloc_and_copy(struct 
migrate_vma *args,
 * unallocated pte_none() or read-only zero page.
 */
spage = migrate_pfn_to_page(*src);
+   if (WARN(spage && is_zone_device_page(spage),
+"page already in device spage pfn: 0x%lx\n",
+page_to_pfn(spage)))
+   continue;
 
dpage = dmirror_devmem_alloc_page(mdevice);
if (!dpage)
continue;
 
-   rpage = dpage->zone_device_data;
+   rpage =

[Nouveau] [PATCH 21/27] lib: test_hmm add module param for zone device type

2022-02-09 Thread Christoph Hellwig

From: Alex Sierra 

In order to configure device coherent in test_hmm, two module parameters
should be passed, which correspond to the SP start address of each
device (2) spm_addr_dev0 & spm_addr_dev1. If no parameters are passed,
private device type is configured.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Poppple 
Signed-off-by: Christoph Hellwig 
---
 lib/test_hmm.c  | 73 -
 lib/test_hmm_uapi.h |  1 +
 2 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 7a27584484ce0f..15747f70c5bc9a 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -37,6 +37,16 @@
 #define DEVMEM_CHUNK_SIZE  (256 * 1024 * 1024U)
 #define DEVMEM_CHUNKS_RESERVE  16
 
+static unsigned long spm_addr_dev0;
+module_param(spm_addr_dev0, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev0,
+   "Specify start address for SPM (special purpose memory) used 
for device 0. By setting this Coherent device type will be used. Make sure 
spm_addr_dev1 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE.");
+
+static unsigned long spm_addr_dev1;
+module_param(spm_addr_dev1, long, 0644);
+MODULE_PARM_DESC(spm_addr_dev1,
+   "Specify start address for SPM (special purpose memory) used 
for device 1. By setting this Coherent device type will be used. Make sure 
spm_addr_dev0 is set too. Minimum SPM size should be DEVMEM_CHUNK_SIZE.");
+
 static const struct dev_pagemap_ops dmirror_devmem_ops;
 static const struct mmu_interval_notifier_ops dmirror_min_ops;
 static dev_t dmirror_dev;
@@ -455,28 +465,44 @@ static int dmirror_write(struct dmirror *dmirror, struct 
hmm_dmirror_cmd *cmd)
return ret;
 }
 
-static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+static int dmirror_allocate_chunk(struct dmirror_device *mdevice,
   struct page **ppage)
 {
struct dmirror_chunk *devmem;
-   struct resource *res;
+   struct resource *res = NULL;
unsigned long pfn;
unsigned long pfn_first;
unsigned long pfn_last;
void *ptr;
+   int ret = -ENOMEM;
 
devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
if (!devmem)
-   return false;
+   return ret;
 
-   res = request_free_mem_region(_resource, DEVMEM_CHUNK_SIZE,
- "hmm_dmirror");
-   if (IS_ERR(res))
+   switch (mdevice->zone_device_type) {
+   case HMM_DMIRROR_MEMORY_DEVICE_PRIVATE:
+   res = request_free_mem_region(_resource, 
DEVMEM_CHUNK_SIZE,
+ "hmm_dmirror");
+   if (IS_ERR_OR_NULL(res))
+   goto err_devmem;
+   devmem->pagemap.range.start = res->start;
+   devmem->pagemap.range.end = res->end;
+   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+   break;
+   case HMM_DMIRROR_MEMORY_DEVICE_COHERENT:
+   devmem->pagemap.range.start = (MINOR(mdevice->cdevice.dev) - 2) 
?
+   spm_addr_dev0 :
+   spm_addr_dev1;
+   devmem->pagemap.range.end = devmem->pagemap.range.start +
+   DEVMEM_CHUNK_SIZE - 1;
+   devmem->pagemap.type = MEMORY_DEVICE_COHERENT;
+   break;
+   default:
+   ret = -EINVAL;
goto err_devmem;
+   }
 
-   devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
-   devmem->pagemap.range.start = res->start;
-   devmem->pagemap.range.end = res->end;
devmem->pagemap.nr_range = 1;
devmem->pagemap.ops = _devmem_ops;
devmem->pagemap.owner = mdevice;
@@ -497,10 +523,14 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
mdevice->devmem_capacity = new_capacity;
mdevice->devmem_chunks = new_chunks;
}
-
ptr = memremap_pages(>pagemap, numa_node_id());
-   if (IS_ERR(ptr))
+   if (IS_ERR_OR_NULL(ptr)) {
+   if (ptr)
+   ret = PTR_ERR(ptr);
+   else
+   ret = -EFAULT;
goto err_release;
+   }
 
devmem->mdevice = mdevice;
pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
@@ -529,15 +559,17 @@ static bool dmirror_allocate_chunk(struct dmirror_device 
*mdevice,
}
spin_unlock(>lock);
 
-   return true;
+   return 0;
 
 err_release:
mutex_unlock(>devmem_lock);
-   release_mem_region(devmem->pagemap.range.start, 
range_len(>pagemap.range));
+   if (res && devmem->pagemap.type == MEMORY_DEVICE_PRIVATE)
+

[Nouveau] [PATCH 20/27] lib: test_hmm add ioctl to get zone device type

2022-02-09 Thread Christoph Hellwig

From: Alex Sierra 

new ioctl cmd added to query zone device type. This will be
used once the test_hmm adds zone device coherent type.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Poppple 
Signed-off-by: Christoph Hellwig 
---
 lib/test_hmm.c  | 23 +--
 lib/test_hmm_uapi.h |  8 
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index cfe63204783918..7a27584484ce0f 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -87,6 +87,7 @@ struct dmirror_chunk {
 struct dmirror_device {
struct cdev cdevice;
struct hmm_devmem   *devmem;
+   unsigned intzone_device_type;
 
unsigned intdevmem_capacity;
unsigned intdevmem_count;
@@ -1026,6 +1027,15 @@ static int dmirror_snapshot(struct dmirror *dmirror,
return ret;
 }
 
+static int dmirror_get_device_type(struct dmirror *dmirror,
+   struct hmm_dmirror_cmd *cmd)
+{
+   mutex_lock(>mutex);
+   cmd->zone_device_type = dmirror->mdevice->zone_device_type;
+   mutex_unlock(>mutex);
+
+   return 0;
+}
 static long dmirror_fops_unlocked_ioctl(struct file *filp,
unsigned int command,
unsigned long arg)
@@ -1076,6 +1086,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
ret = dmirror_snapshot(dmirror, );
break;
 
+   case HMM_DMIRROR_GET_MEM_DEV_TYPE:
+   ret = dmirror_get_device_type(dmirror, );
+   break;
default:
return -EINVAL;
}
@@ -1260,14 +1273,20 @@ static void dmirror_device_remove(struct dmirror_device 
*mdevice)
 static int __init hmm_dmirror_init(void)
 {
int ret;
-   int id;
+   int id = 0;
+   int ndevices = 0;
 
ret = alloc_chrdev_region(_dev, 0, DMIRROR_NDEVICES,
  "HMM_DMIRROR");
if (ret)
goto err_unreg;
 
-   for (id = 0; id < DMIRROR_NDEVICES; id++) {
+   memset(dmirror_devices, 0, DMIRROR_NDEVICES * 
sizeof(dmirror_devices[0]));
+   dmirror_devices[ndevices++].zone_device_type =
+   HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+   dmirror_devices[ndevices++].zone_device_type =
+   HMM_DMIRROR_MEMORY_DEVICE_PRIVATE;
+   for (id = 0; id < ndevices; id++) {
ret = dmirror_device_init(dmirror_devices + id, id);
if (ret)
goto err_chrdev;
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index f14dea5dcd062b..17f842f1aa02c7 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -19,6 +19,7 @@
  * @npages: (in) number of pages to read/write
  * @cpages: (out) number of pages copied
  * @faults: (out) number of device page faults seen
+ * @zone_device_type: (out) zone device memory type
  */
 struct hmm_dmirror_cmd {
__u64   addr;
@@ -26,6 +27,7 @@ struct hmm_dmirror_cmd {
__u64   npages;
__u64   cpages;
__u64   faults;
+   __u64   zone_device_type;
 };
 
 /* Expose the address space of the calling process through hmm device file */
@@ -35,6 +37,7 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_SNAPSHOT   _IOWR('H', 0x03, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_EXCLUSIVE  _IOWR('H', 0x04, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_CHECK_EXCLUSIVE_IOWR('H', 0x05, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_GET_MEM_DEV_TYPE   _IOWR('H', 0x06, struct hmm_dmirror_cmd)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
@@ -62,4 +65,9 @@ enum {
HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE = 0x30,
 };
 
+enum {
+   /* 0 is reserved to catch uninitialized type fields */
+   HMM_DMIRROR_MEMORY_DEVICE_PRIVATE = 1,
+};
+
 #endif /* _LIB_TEST_HMM_UAPI_H */
-- 
2.30.2

[Nouveau] [PATCH 19/27] drm/amdkfd: coherent type as sys mem on migration to ram

2022-02-09 Thread Christoph Hellwig

From: Alex Sierra 

Coherent device type memory on VRAM to RAM migration, has similar access
as System RAM from the CPU. This flag sets the source from the sender.
Which in Coherent type case, should be set as
MIGRATE_VMA_SELECT_DEVICE_COHERENT.

Signed-off-by: Alex Sierra 
Reviewed-by: Felix Kuehling 
Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index 2c51f2ac3b46ac..6646291d75d574 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -659,9 +659,12 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct 
svm_range *prange,
migrate.vma = vma;
migrate.start = start;
migrate.end = end;
-   migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev);
 
+   if (adev->gmc.xgmi.connected_to_cpu)
+   migrate.flags = MIGRATE_VMA_SELECT_DEVICE_COHERENT;
+   else
+   migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t);
size *= npages;
buf = kvmalloc(size, GFP_KERNEL | __GFP_ZERO);
-- 
2.30.2

[Nouveau] [PATCH 18/27] drm/amdkfd: add SPM support for SVM

2022-02-09 Thread Christoph Hellwig

From: Alex Sierra 

When CPU is connected throug XGMI, it has coherent
access to VRAM resource. In this case that resource
is taken from a table in the device gmc aperture base.
This resource is used along with the device type, which could
be DEVICE_PRIVATE or DEVICE_COHERENT to create the device
page map region.

Signed-off-by: Alex Sierra 
Reviewed-by: Felix Kuehling 
Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 28 ++--
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index e27ca375876230..2c51f2ac3b46ac 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -933,7 +933,7 @@ int svm_migrate_init(struct amdgpu_device *adev)
 {
struct kfd_dev *kfddev = adev->kfd.dev;
struct dev_pagemap *pgmap;
-   struct resource *res;
+   struct resource *res = NULL;
unsigned long size;
void *r;
 
@@ -948,28 +948,34 @@ int svm_migrate_init(struct amdgpu_device *adev)
 * should remove reserved size
 */
size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
-   res = devm_request_free_mem_region(adev->dev, _resource, size);
-   if (IS_ERR(res))
-   return -ENOMEM;
+   if (adev->gmc.xgmi.connected_to_cpu) {
+   pgmap->range.start = adev->gmc.aper_base;
+   pgmap->range.end = adev->gmc.aper_base + adev->gmc.aper_size - 
1;
+   pgmap->type = MEMORY_DEVICE_COHERENT;
+   } else {
+   res = devm_request_free_mem_region(adev->dev, _resource, 
size);
+   if (IS_ERR(res))
+   return -ENOMEM;
+   pgmap->range.start = res->start;
+   pgmap->range.end = res->end;
+   pgmap->type = MEMORY_DEVICE_PRIVATE;
+   }
 
-   pgmap->type = MEMORY_DEVICE_PRIVATE;
pgmap->nr_range = 1;
-   pgmap->range.start = res->start;
-   pgmap->range.end = res->end;
pgmap->ops = _migrate_pgmap_ops;
pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev);
-   pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
-
+   pgmap->flags = 0;
/* Device manager releases device-specific resources, memory region and
 * pgmap when driver disconnects from device.
 */
r = devm_memremap_pages(adev->dev, pgmap);
if (IS_ERR(r)) {
pr_err("failed to register HMM device memory\n");
-
/* Disable SVM support capability */
pgmap->type = 0;
-   devm_release_mem_region(adev->dev, res->start, 
resource_size(res));
+   if (pgmap->type == MEMORY_DEVICE_PRIVATE)
+   devm_release_mem_region(adev->dev, res->start,
+   res->end - res->start + 1);
return PTR_ERR(r);
}
 
-- 
2.30.2

[Nouveau] [PATCH 16/27] mm: add device coherent vma selection for memory migration

2022-02-09 Thread Christoph Hellwig

From: Alex Sierra 

This case is used to migrate pages from device memory, back to system
memory. Device coherent type memory is cache coherent from device and CPU
point of view.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Poppple 
Signed-off-by: Christoph Hellwig 
---
 include/linux/migrate.h |  1 +
 mm/migrate_device.c | 12 +---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index db96e10eb8da22..66a34eae8cb635 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -130,6 +130,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
 enum migrate_vma_direction {
MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+   MIGRATE_VMA_SELECT_DEVICE_COHERENT = 1 << 2,
 };
 
 struct migrate_vma {
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index bfd66e7d830b02..0b295594e7626d 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -147,15 +147,21 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_writable_device_private_entry(entry))
mpfn |= MIGRATE_PFN_WRITE;
} else {
-   if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
-   goto next;
pfn = pte_pfn(pte);
-   if (is_zero_pfn(pfn)) {
+   if (is_zero_pfn(pfn) &&
+   (migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
mpfn = MIGRATE_PFN_MIGRATE;
migrate->cpages++;
goto next;
}
page = vm_normal_page(migrate->vma, addr, pte);
+   if (page && !is_zone_device_page(page) &&
+   !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
+   goto next;
+   else if (page && is_device_coherent_page(page) &&
+   (!(migrate->flags & 
MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
+page->pgmap->owner != migrate->pgmap_owner))
+   goto next;
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
}
-- 
2.30.2

[Nouveau] [PATCH 17/27] mm/gup: fail get_user_pages for LONGTERM dev coherent type

2022-02-09 Thread Christoph Hellwig

From: Alex Sierra 

Avoid long term pinning for Coherent device type pages. This could
interfere with their own device memory manager. For now, we are just
returning error for PIN_LONGTERM Coherent device type pages. Eventually,
these type of pages will get migrated to system memory, once the device
migration pages support is added.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Poppple 
[hch: rebased on previous cleanups, split the two checks]
Signed-off-by: Christoph Hellwig 
---
 mm/gup.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 37d6c24ca71225..39b23ad39a7bde 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1881,6 +1881,19 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
continue;
prev_head = head;
 
+   /*
+* Device private pages will get faulted in during gup so it
+* shouldn't be possible to see one here.
+*/
+   if (WARN_ON_ONCE(is_device_private_page(head))) {
+   ret = -EFAULT;
+   goto unpin_pages;
+   }
+   if (is_device_coherent_page(head)) {
+   ret = -EFAULT;
+   goto unpin_pages;
+   }
+
if (is_pinnable_page(head))
continue;
 
@@ -1925,7 +1938,7 @@ static long check_and_migrate_movable_pages(unsigned long 
nr_pages,
put_page(pages[i]);
}
 
-   if (!list_empty(_page_list)) {
+   if (!ret && !list_empty(_page_list)) {
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
.gfp_mask = GFP_USER | __GFP_NOWARN,
-- 
2.30.2

[Nouveau] [PATCH 13/27] mm: move the migrate_vma_* device migration code into it's own file

2022-02-09 Thread Christoph Hellwig

Split the code used to migrate to and from ZONE_DEVICE memory from
migrate.c into a new file.

Signed-off-by: Christoph Hellwig 
---
 mm/Kconfig  |   3 +
 mm/Makefile |   1 +
 mm/migrate.c| 753 ---
 mm/migrate_device.c | 765 
 4 files changed, 769 insertions(+), 753 deletions(-)
 create mode 100644 mm/migrate_device.c

diff --git a/mm/Kconfig b/mm/Kconfig
index a1901ae6d06293..6391d8d3a616f3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -249,6 +249,9 @@ config MIGRATION
  pages as migration can relocate pages to satisfy a huge page
  allocation instead of reclaiming.
 
+config DEVICE_MIGRATION
+   def_bool MIGRATION && DEVICE_PRIVATE
+
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
bool
 
diff --git a/mm/Makefile b/mm/Makefile
index 70d4309c9ce338..4cc13f3179a518 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMTEST)  += memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/migrate.c b/mm/migrate.c
index 746e1230886ddb..c31d04b46a5e17 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -38,12 +38,10 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -2125,757 +2123,6 @@ int migrate_misplaced_page(struct page *page, struct 
vm_area_struct *vma,
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
 
-#ifdef CONFIG_DEVICE_PRIVATE
-static int migrate_vma_collect_skip(unsigned long start,
-   unsigned long end,
-   struct mm_walk *walk)
-{
-   struct migrate_vma *migrate = walk->private;
-   unsigned long addr;
-
-   for (addr = start; addr < end; addr += PAGE_SIZE) {
-   migrate->dst[migrate->npages] = 0;
-   migrate->src[migrate->npages++] = 0;
-   }
-
-   return 0;
-}
-
-static int migrate_vma_collect_hole(unsigned long start,
-   unsigned long end,
-   __always_unused int depth,
-   struct mm_walk *walk)
-{
-   struct migrate_vma *migrate = walk->private;
-   unsigned long addr;
-
-   /* Only allow populating anonymous memory. */
-   if (!vma_is_anonymous(walk->vma))
-   return migrate_vma_collect_skip(start, end, walk);
-
-   for (addr = start; addr < end; addr += PAGE_SIZE) {
-   migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
-   migrate->dst[migrate->npages] = 0;
-   migrate->npages++;
-   migrate->cpages++;
-   }
-
-   return 0;
-}
-
-static int migrate_vma_collect_pmd(pmd_t *pmdp,
-  unsigned long start,
-  unsigned long end,
-  struct mm_walk *walk)
-{
-   struct migrate_vma *migrate = walk->private;
-   struct vm_area_struct *vma = walk->vma;
-   struct mm_struct *mm = vma->vm_mm;
-   unsigned long addr = start, unmapped = 0;
-   spinlock_t *ptl;
-   pte_t *ptep;
-
-again:
-   if (pmd_none(*pmdp))
-   return migrate_vma_collect_hole(start, end, -1, walk);
-
-   if (pmd_trans_huge(*pmdp)) {
-   struct page *page;
-
-   ptl = pmd_lock(mm, pmdp);
-   if (unlikely(!pmd_trans_huge(*pmdp))) {
-   spin_unlock(ptl);
-   goto again;
-   }
-
-   page = pmd_page(*pmdp);
-   if (is_huge_zero_page(page)) {
-   spin_unlock(ptl);
-   split_huge_pmd(vma, pmdp, addr);
-   if (pmd_trans_unstable(pmdp))
-   return migrate_vma_collect_skip(start, end,
-   walk);
-   } else {
-   int ret;
-
-   get_page(page);
-   spin_unlock(ptl);
-   if (unlikely(!trylock_page(page)))
-   return migrate_vma_collect_skip(start, end,
-   walk);
-   ret = split_huge_page(page);
-   unlock_page(page);
-   put_page(page);
-   if (ret)
-   return migrate_vma_collect_skip(start, end,
-   walk);
-

[Nouveau] [PATCH 14/27] mm: build migrate_vma_* for all configs with ZONE_DEVICE support

2022-02-09 Thread Christoph Hellwig

This code will be used for device coherent memory as well in a bit,
so relax the ifdef a bit.

Signed-off-by: Christoph Hellwig 
---
 mm/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 6391d8d3a616f3..95d4aa3acaefe0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -250,7 +250,7 @@ config MIGRATION
  allocation instead of reclaiming.
 
 config DEVICE_MIGRATION
-   def_bool MIGRATION && DEVICE_PRIVATE
+   def_bool MIGRATION && ZONE_DEVICE
 
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
bool
-- 
2.30.2

[Nouveau] [PATCH 15/27] mm: add zone device coherent type memory support

2022-02-09 Thread Christoph Hellwig

From: Alex Sierra 

Device memory that is cache coherent from device and CPU point of view.
This is used on platforms that have an advanced system bus (like CAPI
or CXL). Any page of a process can be migrated to such memory. However,
no one should be allowed to pin such memory so that it can always be
evicted.

Signed-off-by: Alex Sierra 
Acked-by: Felix Kuehling 
Reviewed-by: Alistair Popple 
[hch: rebased ontop of the refcount changes,
  removed is_dev_private_or_coherent_page]
Signed-off-by: Christoph Hellwig 
---
 include/linux/memremap.h | 14 ++
 mm/memcontrol.c  |  7 ---
 mm/memory-failure.c  |  8 ++--
 mm/memremap.c| 10 ++
 mm/migrate_device.c  | 16 +++-
 mm/rmap.c|  5 +++--
 6 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index d6a114dd5ea8b7..eb73630a49da39 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -41,6 +41,13 @@ struct vmem_altmap {
  * A more complete discussion of unaddressable memory may be found in
  * include/linux/hmm.h and Documentation/vm/hmm.rst.
  *
+ * MEMORY_DEVICE_COHERENT:
+ * Device memory that is cache coherent from device and CPU point of view. This
+ * is used on platforms that have an advanced system bus (like CAPI or CXL). A
+ * driver can hotplug the device memory using ZONE_DEVICE and with that memory
+ * type. Any page of a process can be migrated to such memory. However no one
+ * should be allowed to pin such memory so that it can always be evicted.
+ *
  * MEMORY_DEVICE_FS_DAX:
  * Host memory that has similar access semantics as System RAM i.e. DMA
  * coherent and supports page pinning. In support of coordinating page
@@ -61,6 +68,7 @@ struct vmem_altmap {
 enum memory_type {
/* 0 is reserved to catch uninitialized type fields */
MEMORY_DEVICE_PRIVATE = 1,
+   MEMORY_DEVICE_COHERENT,
MEMORY_DEVICE_FS_DAX,
MEMORY_DEVICE_GENERIC,
MEMORY_DEVICE_PCI_P2PDMA,
@@ -138,6 +146,12 @@ static inline bool is_device_private_page(const struct 
page *page)
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
+static inline bool is_device_coherent_page(const struct page *page)
+{
+   return is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_COHERENT;
+}
+
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 510cbfb82bb62a..10259c35fde20d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5687,8 +5687,8 @@ static int mem_cgroup_move_account(struct page *page,
  *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  * target for charge migration. if @target is not NULL, the entry is stored
  * in target->ent.
- *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is 
MEMORY_DEVICE_PRIVATE
- * (so ZONE_DEVICE page and thus not on the lru).
+ *   3(MC_TARGET_DEVICE): like MC_TARGET_PAGE  but page is device memory and
+ *   thus not on the lru.
  * For now we such page is charge like a regular page would be as for all
  * intent and purposes it is just special memory taking the place of a
  * regular page.
@@ -5722,7 +5722,8 @@ static enum mc_target_type get_mctgt_type(struct 
vm_area_struct *vma,
 */
if (page_memcg(page) == mc.from) {
ret = MC_TARGET_PAGE;
-   if (is_device_private_page(page))
+   if (is_device_private_page(page) ||
+   is_device_coherent_page(page))
ret = MC_TARGET_DEVICE;
if (target)
target->page = page;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 97a9ed8f87a96a..f498ed3ece79ae 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1617,12 +1617,16 @@ static int memory_failure_dev_pagemap(unsigned long 
pfn, int flags,
goto unlock;
}
 
-   if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
+   switch (pgmap->type) {
+   case MEMORY_DEVICE_PRIVATE:
+   case MEMORY_DEVICE_COHERENT:
/*
-* TODO: Handle HMM pages which may need coordination
+* TODO: Handle device pages which may need coordination
 * with device-side memory.
 */
goto unlock;
+   default:
+   break;
}
 
/*
diff --git a/mm/memremap.c b/mm/memremap.c
index e00ffcdba7b632..d00bb21a0630cd 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -313,6 +313,16 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
return ERR_PTR(-EINVAL);
}
break;
+   case MEMORY_DEVICE_COHERENT:
+

[Nouveau] [PATCH 12/27] mm: refactor the ZONE_DEVICE handling in migrate_vma_pages

2022-02-09 Thread Christoph Hellwig

Make the flow a little more clear and prepare for adding a new
ZONE_DEVICE memory type.

Signed-off-by: Christoph Hellwig 
---
 mm/migrate.c | 27 ---
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 30ecd7223656c1..746e1230886ddb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2788,24 +2788,21 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 
mapping = page_mapping(page);
 
-   if (is_zone_device_page(newpage)) {
-   if (is_device_private_page(newpage)) {
-   /*
-* For now only support private anonymous when
-* migrating to un-addressable device memory.
-*/
-   if (mapping) {
-   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
-   continue;
-   }
-   } else {
-   /*
-* Other types of ZONE_DEVICE page are not
-* supported.
-*/
+   if (is_device_private_page(newpage)) {
+   /*
+* For now only support private anonymous when migrating
+* to un-addressable device memory.
+*/
+   if (mapping) {
migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
continue;
}
+   } else if (is_zone_device_page(newpage)) {
+   /*
+* Other types of ZONE_DEVICE page are not supported.
+*/
+   migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
+   continue;
}
 
r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY);
-- 
2.30.2

[Nouveau] [PATCH 11/27] mm: refactor the ZONE_DEVICE handling in migrate_vma_insert_page

2022-02-09 Thread Christoph Hellwig

Make the flow a little more clear and prepare for adding a new
ZONE_DEVICE memory type.

Signed-off-by: Christoph Hellwig 
---
 mm/migrate.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 8e0370a73f8a43..30ecd7223656c1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2670,26 +2670,25 @@ static void migrate_vma_insert_page(struct migrate_vma 
*migrate,
 */
__SetPageUptodate(page);
 
-   if (is_zone_device_page(page)) {
-   if (is_device_private_page(page)) {
-   swp_entry_t swp_entry;
+   if (is_device_private_page(page)) {
+   swp_entry_t swp_entry;
 
-   if (vma->vm_flags & VM_WRITE)
-   swp_entry = make_writable_device_private_entry(
-   page_to_pfn(page));
-   else
-   swp_entry = make_readable_device_private_entry(
-   page_to_pfn(page));
-   entry = swp_entry_to_pte(swp_entry);
-   } else {
-   /*
-* For now we only support migrating to un-addressable
-* device memory.
-*/
+   if (vma->vm_flags & VM_WRITE)
+   swp_entry = make_writable_device_private_entry(
+   page_to_pfn(page));
+   else
+   swp_entry = make_readable_device_private_entry(
+   page_to_pfn(page));
+   entry = swp_entry_to_pte(swp_entry);
+   } else {
+   /*
+* For now we only support migrating to un-addressable device
+* memory.
+*/
+   if (is_zone_device_page(page)) {
pr_warn_once("Unsupported ZONE_DEVICE page type.\n");
goto abort;
}
-   } else {
entry = mk_pte(page, vma->vm_page_prot);
if (vma->vm_flags & VM_WRITE)
entry = pte_mkwrite(pte_mkdirty(entry));
-- 
2.30.2

[Nouveau] [PATCH 10/27] mm: refactor check_and_migrate_movable_pages

2022-02-09 Thread Christoph Hellwig

Remove up to two levels of indentation by using continue statements
and move variables to local scope where possible.

Signed-off-by: Christoph Hellwig 
---
 mm/gup.c | 81 ++--
 1 file changed, 44 insertions(+), 37 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index a9d4d724aef749..37d6c24ca71225 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1868,72 +1868,79 @@ static long check_and_migrate_movable_pages(unsigned 
long nr_pages,
struct page **pages,
unsigned int gup_flags)
 {
-   unsigned long i;
-   unsigned long isolation_error_count = 0;
-   bool drain_allow = true;
-   LIST_HEAD(movable_page_list);
-   long ret = 0;
+   unsigned long isolation_error_count = 0, i;
struct page *prev_head = NULL;
-   struct page *head;
-   struct migration_target_control mtc = {
-   .nid = NUMA_NO_NODE,
-   .gfp_mask = GFP_USER | __GFP_NOWARN,
-   };
+   LIST_HEAD(movable_page_list);
+   bool drain_allow = true;
+   int ret = 0;
 
for (i = 0; i < nr_pages; i++) {
-   head = compound_head(pages[i]);
+   struct page *head = compound_head(pages[i]);
+
if (head == prev_head)
continue;
prev_head = head;
+
+   if (is_pinnable_page(head))
+   continue;
+
/*
-* If we get a movable page, since we are going to be pinning
-* these entries, try to move them out if possible.
+* Try to move out any movable page before pinning the range.
 */
-   if (!is_pinnable_page(head)) {
-   if (PageHuge(head)) {
-   if (!isolate_huge_page(head, 
_page_list))
-   isolation_error_count++;
-   } else {
-   if (!PageLRU(head) && drain_allow) {
-   lru_add_drain_all();
-   drain_allow = false;
-   }
+   if (PageHuge(head)) {
+   if (!isolate_huge_page(head, _page_list))
+   isolation_error_count++;
+   continue;
+   }
 
-   if (isolate_lru_page(head)) {
-   isolation_error_count++;
-   continue;
-   }
-   list_add_tail(>lru, _page_list);
-   mod_node_page_state(page_pgdat(head),
-   NR_ISOLATED_ANON +
-   page_is_file_lru(head),
-   thp_nr_pages(head));
-   }
+   if (!PageLRU(head) && drain_allow) {
+   lru_add_drain_all();
+   drain_allow = false;
+   }
+
+   if (isolate_lru_page(head)) {
+   isolation_error_count++;
+   continue;
}
+   list_add_tail(>lru, _page_list);
+   mod_node_page_state(page_pgdat(head),
+   NR_ISOLATED_ANON + page_is_file_lru(head),
+   thp_nr_pages(head));
}
 
+   if (!list_empty(_page_list) || isolation_error_count)
+   goto unpin_pages;
+
/*
 * If list is empty, and no isolation errors, means that all pages are
 * in the correct zone.
 */
-   if (list_empty(_page_list) && !isolation_error_count)
-   return nr_pages;
+   return nr_pages;
 
+unpin_pages:
if (gup_flags & FOLL_PIN) {
unpin_user_pages(pages, nr_pages);
} else {
for (i = 0; i < nr_pages; i++)
put_page(pages[i]);
}
+
if (!list_empty(_page_list)) {
+   struct migration_target_control mtc = {
+   .nid = NUMA_NO_NODE,
+   .gfp_mask = GFP_USER | __GFP_NOWARN,
+   };
+
ret = migrate_pages(_page_list, alloc_migration_target,
NULL, (unsigned long), MIGRATE_SYNC,
MR_LONGTERM_PIN, NULL);
-   if (ret && !list_empty(_page_list))
-   putback_movable_pages(_page_list);
+   if (ret > 0) /* number of pages not migrated */
+   ret = -ENOMEM;
}
 
-   return ret > 0 ? -ENOMEM : ret;
+   if (ret && !list_empty(_page_list))
+

[Nouveau] [PATCH 09/27] mm: generalize the pgmap based page_free infrastructure

2022-02-09 Thread Christoph Hellwig

Key off on the existence of ->page_free to prepare for adding support for
more pgmap types that are device managed and thus need the free callback.

Signed-off-by: Christoph Hellwig 
---
 mm/memremap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index fef5734d5e4933..e00ffcdba7b632 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -452,7 +452,7 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
 
 void free_zone_device_page(struct page *page)
 {
-   if (WARN_ON_ONCE(!is_device_private_page(page)))
+   if (WARN_ON_ONCE(!page->pgmap->ops || !page->pgmap->ops->page_free))
return;
 
__ClearPageWaiters(page);
@@ -460,7 +460,7 @@ void free_zone_device_page(struct page *page)
mem_cgroup_uncharge(page_folio(page));
 
/*
-* When a device_private page is freed, the page->mapping field
+* When a device managed page is freed, the page->mapping field
 * may still contain a (stale) mapping value. For example, the
 * lower bits of page->mapping may still identify the page as an
 * anonymous page. Ultimately, this entire field is just stale
-- 
2.30.2

[Nouveau] [PATCH 08/27] fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED

2022-02-09 Thread Christoph Hellwig

Add a depends on ZONE_DEVICE support or the s390-specific limited DAX
support, as one of the two is required at runtime for fsdax code to
actually work.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
---
 fs/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index e9433bbc48010a..7f2455e8e18ae2 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -48,6 +48,7 @@ config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+   depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
help
-- 
2.30.2

[Nouveau] [PATCH 07/27] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-09 Thread Christoph Hellwig

ZONE_DEVICE struct pages have an extra reference count that complicates
the code for put_page() and several places in the kernel that need to
check the reference count to see that a page is not being used (gup,
compaction, migration, etc.). Clean up the code so the reference count
doesn't need to be treated specially for ZONE_DEVICE pages.

Note that this excludes the special idle page wakeup for fsdax pages,
which still happens at refcount 1.  This is a separate issue and will
be sorted out later.  Given that only fsdax pages require the
notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
symbol can go away and be replaced with a FS_DAX check for this hook
in the put_page fastpath.

Based on an earlier patch from Ralph Campbell .

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Ralph Campbell 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Dan Williams 
Acked-by: Felix Kuehling 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   |  1 -
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  1 -
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  1 -
 fs/Kconfig   |  1 -
 include/linux/memremap.h | 12 +++--
 include/linux/mm.h   |  6 +--
 lib/test_hmm.c   |  1 -
 mm/Kconfig   |  4 --
 mm/internal.h|  2 +
 mm/memcontrol.c  | 11 ++---
 mm/memremap.c| 57 
 mm/migrate.c |  6 ---
 mm/swap.c| 16 ++-
 13 files changed, 36 insertions(+), 83 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e414ca44839fd1..8b6438fa18fc2b 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -712,7 +712,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index cb835f95a76e66..e27ca375876230 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -225,7 +225,6 @@ svm_migrate_get_vram_page(struct svm_range *prange, 
unsigned long pfn)
page = pfn_to_page(pfn);
svm_range_bo_ref(prange->svm_bo);
page->zone_device_data = prange->svm_bo;
-   get_page(page);
lock_page(page);
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a5cdfbe32b5e54..7ba66ad68a8a1e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -326,7 +326,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
lock_page(page);
return page;
 }
diff --git a/fs/Kconfig b/fs/Kconfig
index 6c7dc1387beb0f..e9433bbc48010a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -48,7 +48,6 @@ config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
-   select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
select FS_IOMAP
select DAX
help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 514ab46f597e5c..d6a114dd5ea8b7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -68,9 +68,9 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0.  The reference count will be
+* reset to one by the core code after the method is called to prepare
+* for handing out the page again.
 */
void (*page_free)(struct page *page);
 
@@ -133,16 +133,14 @@ static inline unsigned long pgmap_vmemmap_nr(struct 
dev_pagemap *pgmap)
 
 static inline bool is_device_private_page(const struct page *page)
 {
-   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
-   IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+   return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
is_zone_device_page(page) &&
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
-   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
-   IS_ENABLED(CONFIG_PCI_P2PDMA) &&
+   return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
is_zone_device_page(page) &am

[Nouveau] [PATCH 06/27] mm: don't include in

2022-02-09 Thread Christoph Hellwig

Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Dan Williams 
Acked-by: Felix Kuehling 
---
 arch/arm64/mm/mmu.c|  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h  |  1 +
 drivers/gpu/drm/drm_cache.c|  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c  |  1 +
 drivers/infiniband/core/rw.c   |  1 +
 drivers/nvdimm/pmem.h  |  1 +
 drivers/nvme/host/pci.c|  1 +
 drivers/nvme/target/io-cmd-bdev.c  |  1 +
 fs/fuse/virtio_fs.c|  1 +
 include/linux/memremap.h   | 18 ++
 include/linux/mm.h | 20 
 lib/test_hmm.c |  1 +
 mm/memcontrol.c|  1 +
 mm/memremap.c  |  6 +-
 15 files changed, 35 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index acfae9b41cc8c9..580abae6c0b93f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index ea68f3b3a4e9cb..6d643b4b791d87 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -25,6 +25,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
index f19d9acbe95936..50b8a088f763a6 100644
--- a/drivers/gpu/drm/drm_cache.c
+++ b/drivers/gpu/drm/drm_cache.c
@@ -27,11 +27,11 @@
 /*
  * Authors: Thomas Hellström 
  */
-
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index e886a3b9e08c7d..a5cdfbe32b5e54 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -39,6 +39,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /*
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 266809e511e2c1..090b9b47708cca 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct nouveau_svm {
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 5a3bd41b331c93..4d98f931a13ddd 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -2,6 +2,7 @@
 /*
  * Copyright (c) 2016 HGST, a Western Digital Company.
  */
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 59cfe13ea8a85c..1f51a23614299b 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -3,6 +3,7 @@
 #define __NVDIMM_PMEM_H__
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6a99ed68091589..ab15bc72710dbe 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index 70ca9dfc1771a9..a141446db1bea3 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -6,6 +6,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include 
 #include 
+#include 
 #include 
 #include "nvmet.h"
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 9d737904d07c0b..86b7dbb6a0d43e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1fafcc38acbad6..514ab46f597e5c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,6 +1,8 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
+
+#include 
 #include 
 #include 
 #include 
@@ -129,6 +131,22 @@ static inline unsigned long pgmap_vmemmap_nr(struct 
dev_pagemap *pgmap)
return 1 << pgmap->vmemmap_shift;
 }
 
+static inline bool is_device_private_page(const struct page *page)
+{
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+   is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+}
+
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+   return IS_ENABLED(CONFIG_DEV_P

[Nouveau] [PATCH 05/27] mm: simplify freeing of devmap managed pages

2022-02-09 Thread Christoph Hellwig

Make put_devmap_managed_page return if it took charge of the page
or not and remove the separate page_is_devmap_managed helper.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Dan Williams 
---
 include/linux/mm.h | 34 ++
 mm/memremap.c  | 20 +---
 mm/swap.c  | 10 +-
 3 files changed, 20 insertions(+), 44 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 91dd0bc786a9ec..26baadcef4556b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1094,33 +1094,24 @@ static inline bool is_zone_movable_page(const struct 
page *page)
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
 
-static inline bool page_is_devmap_managed(struct page *page)
+bool __put_devmap_managed_page(struct page *page);
+static inline bool put_devmap_managed_page(struct page *page)
 {
if (!static_branch_unlikely(_managed_key))
return false;
if (!is_zone_device_page(page))
return false;
-   switch (page->pgmap->type) {
-   case MEMORY_DEVICE_PRIVATE:
-   case MEMORY_DEVICE_FS_DAX:
-   return true;
-   default:
-   break;
-   }
-   return false;
+   if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
+   page->pgmap->type != MEMORY_DEVICE_FS_DAX)
+   return false;
+   return __put_devmap_managed_page(page);
 }
 
-void put_devmap_managed_page(struct page *page);
-
 #else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
+static inline bool put_devmap_managed_page(struct page *page)
 {
return false;
 }
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline bool is_device_private_page(const struct page *page)
@@ -1220,16 +1211,11 @@ static inline void put_page(struct page *page)
struct folio *folio = page_folio(page);
 
/*
-* For devmap managed pages we need to catch refcount transition from
-* 2 to 1, when refcount reach one it means the page is free and we
-* need to inform the device driver through callback. See
-* include/linux/memremap.h and HMM for details.
+* For some devmap managed pages we need to catch refcount transition
+* from 2 to 1:
 */
-   if (page_is_devmap_managed(>page)) {
-   put_devmap_managed_page(>page);
+   if (put_devmap_managed_page(>page))
return;
-   }
-
folio_put(folio);
 }
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 55d23e9f5c04ec..f41233a67edb12 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -502,24 +502,22 @@ void free_devmap_managed_page(struct page *page)
page->pgmap->ops->page_free(page);
 }
 
-void put_devmap_managed_page(struct page *page)
+bool __put_devmap_managed_page(struct page *page)
 {
-   int count;
-
-   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
-   return;
-
-   count = page_ref_dec_return(page);
-
/*
 * devmap page refcounts are 1-based, rather than 0-based: if
 * refcount is 1, then the page is free and the refcount is
 * stable because nobody holds a reference on the page.
 */
-   if (count == 1)
+   switch (page_ref_dec_return(page)) {
+   case 1:
free_devmap_managed_page(page);
-   else if (!count)
+   break;
+   case 0:
__put_page(page);
+   break;
+   }
+   return true;
 }
-EXPORT_SYMBOL(put_devmap_managed_page);
+EXPORT_SYMBOL(__put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 08058f74cae23e..25b55c56614311 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -930,16 +930,8 @@ void release_pages(struct page **pages, int nr)
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
-   /*
-* ZONE_DEVICE pages that return 'false' from
-* page_is_devmap_managed() do not require special
-* processing, and instead, expect a call to
-* put_page_testzero().
-*/
-   if (page_is_devmap_managed(page)) {
-   put_devmap_managed_page(page);
+   if (put_devmap_managed_page(page))
continue;
-   }
if (put_page_testzero(page))
put_dev_pagemap(page->pgmap);
continue;
-- 
2.30.2

[Nouveau] [PATCH 04/27] mm: move free_devmap_managed_page to memremap.c

2022-02-09 Thread Christoph Hellwig

free_devmap_managed_page has nothing to do with the code in swap.c,
move it to live with the rest of the code for devmap handling.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Muchun Song 
Reviewed-by: Dan Williams 
---
 include/linux/mm.h |  1 -
 mm/memremap.c  | 21 +
 mm/swap.c  | 23 ---
 3 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b46174989b086..91dd0bc786a9ec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,7 +1092,6 @@ static inline bool is_zone_movable_page(const struct page 
*page)
 }
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
 
 static inline bool page_is_devmap_managed(struct page *page)
diff --git a/mm/memremap.c b/mm/memremap.c
index 5f04a0709e436e..55d23e9f5c04ec 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -501,4 +501,25 @@ void free_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 }
+
+void put_devmap_managed_page(struct page *page)
+{
+   int count;
+
+   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
+   return;
+
+   count = page_ref_dec_return(page);
+
+   /*
+* devmap page refcounts are 1-based, rather than 0-based: if
+* refcount is 1, then the page is free and the refcount is
+* stable because nobody holds a reference on the page.
+*/
+   if (count == 1)
+   free_devmap_managed_page(page);
+   else if (!count)
+   __put_page(page);
+}
+EXPORT_SYMBOL(put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56d5..08058f74cae23e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1153,26 +1153,3 @@ void __init swap_setup(void)
 * _really_ don't want to cluster much more
 */
 }
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
-   int count;
-
-   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
-   return;
-
-   count = page_ref_dec_return(page);
-
-   /*
-* devmap page refcounts are 1-based, rather than 0-based: if
-* refcount is 1, then the page is free and the refcount is
-* stable because nobody holds a reference on the page.
-*/
-   if (count == 1)
-   free_devmap_managed_page(page);
-   else if (!count)
-   __put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
-- 
2.30.2

[Nouveau] [PATCH 02/27] mm: remove the KERNEL guard from

2022-02-09 Thread Christoph Hellwig

__KERNEL__ ifdefs don't make sense outside of include/uapi/.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Muchun Song 
Reviewed-by: Dan Williams 
---
 include/linux/mm.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b19223..7b46174989b086 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3,9 +3,6 @@
 #define _LINUX_MM_H
 
 #include 
-
-#ifdef __KERNEL__
-
 #include 
 #include 
 #include 
@@ -3381,5 +3378,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long 
start,
 }
 #endif
 
-#endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
-- 
2.30.2

[Nouveau] [PATCH 03/27] mm: remove pointless includes from

2022-02-09 Thread Christoph Hellwig

hmm.h pulls in the world for no good reason at all.  Remove the
includes and push a few ones into the users instead.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
 include/linux/hmm.h  | 9 ++---
 lib/test_hmm.c   | 2 ++
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index ed5385137f4831..cb835f95a76e66 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "amdgpu_sync.h"
 #include "amdgpu_object.h"
 #include "amdgpu_vm.h"
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 3828aafd3ac46f..e886a3b9e08c7d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -39,6 +39,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * FIXME: this is ugly right now we are using TTM to allocate vram and we pin
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 2fd2e91d5107c0..d5a6f101f843e6 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -9,14 +9,9 @@
 #ifndef LINUX_HMM_H
 #define LINUX_HMM_H
 
-#include 
-#include 
+#include 
 
-#include 
-#include 
-#include 
-#include 
-#include 
+struct mmu_interval_notifier;
 
 /*
  * On output:
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 767538089a62e4..396beee6b061d4 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -26,6 +26,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "test_hmm_uapi.h"
 
-- 
2.30.2

[Nouveau] [PATCH 01/27] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages

2022-02-09 Thread Christoph Hellwig

memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove
the superflous extra check.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Logan Gunthorpe 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Chaitanya Kulkarni 
Reviewed-by: Muchun Song 
Reviewed-by: Dan Williams 
---
 mm/memremap.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11fda..5f04a0709e436e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -328,8 +328,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
break;
case MEMORY_DEVICE_FS_DAX:
-   if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
-   IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
WARN(1, "File system DAX not supported\n");
return ERR_PTR(-EINVAL);
}
-- 
2.30.2

[Nouveau] start sorting out the ZONE_DEVICE refcount mess v2

2022-02-09 Thread Christoph Hellwig

Hi all,

this series removes the offset by one refcount for ZONE_DEVICE pages
that are freed back to the driver owning them, which is just device
private ones for now, but also the planned device coherent pages
and the ehanced p2p ones pending.

It does not address the fsdax pages yet, which will be attacked in a
follow on series.

Note that if we want to get the p2p series rebased on top of this
we'll need a git branch for this series.  I could offer to host one.

A git tree is available here:

git://git.infradead.org/users/hch/misc.git pgmap-refcount

Gitweb:


http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount

Changes since v1:
 - add a missing memremap.h include in memcontrol.c
 - include rebased versions of the device coherent support and
   device coherent migration support series as well as additional
   cleanup patches

Diffstt:
 arch/arm64/mm/mmu.c  |1 
 arch/powerpc/kvm/book3s_hv_uvmem.c   |1 
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |   35 -
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 
 drivers/gpu/drm/drm_cache.c  |2 
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |3 
 drivers/gpu/drm/nouveau/nouveau_svm.c|1 
 drivers/infiniband/core/rw.c |1 
 drivers/nvdimm/pmem.h|1 
 drivers/nvme/host/pci.c  |1 
 drivers/nvme/target/io-cmd-bdev.c|1 
 fs/Kconfig   |2 
 fs/fuse/virtio_fs.c  |1 
 include/linux/hmm.h  |9 
 include/linux/memremap.h |   36 +
 include/linux/migrate.h  |1 
 include/linux/mm.h   |   59 --
 lib/test_hmm.c   |  353 ++---
 lib/test_hmm_uapi.h  |   22 
 mm/Kconfig   |7 
 mm/Makefile  |1 
 mm/gup.c |  127 +++-
 mm/internal.h|3 
 mm/memcontrol.c  |   19 
 mm/memory-failure.c  |8 
 mm/memremap.c|   75 +-
 mm/migrate.c |  763 
 mm/migrate_device.c  |  822 +++
 mm/rmap.c|5 
 mm/swap.c|   49 -
 tools/testing/selftests/vm/Makefile  |2 
 tools/testing/selftests/vm/hmm-tests.c   |  204 ++-
 tools/testing/selftests/vm/test_hmm.sh   |   24 
 33 files changed, 1552 insertions(+), 1088 deletions(-)

Re: [Nouveau] [PATCH 6/8] mm: don't include in

2022-02-09 Thread Christoph Hellwig

On Thu, Feb 10, 2022 at 01:10:47PM +1100, Alistair Popple wrote:
> diff --git a/mm/gup.c b/mm/gup.c
> index cbb49abb7992..8e85c9fb8df4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2007,7 +2007,6 @@ static long check_and_migrate_movable_pages(unsigned 
> long nr_pages,
>   if (!ret && list_empty(_page_list) && !isolation_error_count)
>   return nr_pages;
>  
> - ret = 0;
>  unpin_pages:

This isn't quite correct as ret is initially set to -EFAULT now.  I'll
fix it by removing the early ret initialization and always using the
goto. I've also added another refactoring patch for this messy function.

I've folded the inversion of the is_device_coherent_page check in
migrate.c in as well, thanks!

Re: [Nouveau] [PATCH 6/8] mm: don't include in

2022-02-09 Thread Christoph Hellwig

On Mon, Feb 07, 2022 at 04:19:29PM -0500, Felix Kuehling wrote:
>
> Am 2022-02-07 um 01:32 schrieb Christoph Hellwig:
>> Move the check for the actual pgmap types that need the free at refcount
>> one behavior into the out of line helper, and thus avoid the need to
>> pull memremap.h into mm.h.
>>
>> Signed-off-by: Christoph Hellwig 
>
> The amdkfd part looks good to me.
>
> It looks like this patch is not based on Alex Sierra's coherent memory 
> series. He added two new helpers is_device_coherent_page and 
> is_dev_private_or_coherent_page that would need to be moved along with 
> is_device_private_page and is_pci_p2pdma_page.

FYI, here is a branch that contains a rebase of the coherent memory
related patches on top of this series:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/pgmap-refcount

I don't have a good way to test this, but I'll at least let the build bot
finish before sending it out (probably tomorrow).

Re: [Nouveau] [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-09 Thread Christoph Hellwig

On Wed, Feb 09, 2022 at 08:29:56AM -0400, Jason Gunthorpe wrote:
> It is nice, but the other series are still impacted by the fsdax mess
> - they still stuff pages into ptes without proper refcounts and have
> to carry nonsense to dance around this problem.
> 
> I certainly would be unhappy if the amd driver, for instance, gained
> the fsdax problem as well and started pushing 4k pages into PMDs.

As said before: I think this all needs to be fixed.  But I'd rather
fix it gradually and I think this series is a nice step forward.
After that we can look at the pte mappings.

Re: [Nouveau] [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-08 Thread Christoph Hellwig

On Tue, Feb 08, 2022 at 07:30:11PM -0800, Dan Williams wrote:
> Interesting. I had expected that to really fix the refcount problem
> that fs/dax.c would need to start taking real page references as pages
> were added to a mapping, just like page cache.

I think we should do that eventually.  But I think this series that
just attacks the device private type and extends to the device coherent
and p2p enhacements is a good first step to stop the proliferation of
the one off refcount and to allow to deal with the fsdax pages in another
more focuessed series.

Re: [Nouveau] [PATCH 6/8] mm: don't include in

2022-02-08 Thread Christoph Hellwig

On Tue, Feb 08, 2022 at 03:53:14PM -0800, Dan Williams wrote:
> Yeah, same as Logan:
> 
> mm/memcontrol.c: In function ‘get_mctgt_type’:
> mm/memcontrol.c:5724:29: error: implicit declaration of function
> ‘is_device_private_page’; did you mean
> ‘is_device_private_entry’? [-Werror=implicit-function-declaration]
>  5724 | if (is_device_private_page(page))
>   | ^~
>   | is_device_private_entry
> 
> ...needs:

Yeah, the buildbot also complained.  I've fixed this up locally now.

Re: [Nouveau] [PATCH 6/8] mm: don't include in

2022-02-07 Thread Christoph Hellwig

On Mon, Feb 07, 2022 at 04:19:29PM -0500, Felix Kuehling wrote:
>
> Am 2022-02-07 um 01:32 schrieb Christoph Hellwig:
>> Move the check for the actual pgmap types that need the free at refcount
>> one behavior into the out of line helper, and thus avoid the need to
>> pull memremap.h into mm.h.
>>
>> Signed-off-by: Christoph Hellwig 
>
> The amdkfd part looks good to me.
>
> It looks like this patch is not based on Alex Sierra's coherent memory 
> series. He added two new helpers is_device_coherent_page and 
> is_dev_private_or_coherent_page that would need to be moved along with 
> is_device_private_page and is_pci_p2pdma_page.

Yes.  I Naked that series because it spreads te mess with the refcount
further in this latest version.  My intent is that it gets rebased
on top of this to avoid that spread.  Same for the p2p series form Logan.

[Nouveau] [PATCH 8/8] fsdax: depend on ZONE_DEVICE || FS_DAX_LIMITED

2022-02-06 Thread Christoph Hellwig

Add a depends on ZONE_DEVICE support or the s390-specific limited DAX
support, as one of the two is required at runtime for fsdax code to
actually work.

Signed-off-by: Christoph Hellwig 
---
 fs/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index 05efea674bffa0..6e8818a5e53c45 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -48,6 +48,7 @@ config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+   depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
help
-- 
2.30.2

[Nouveau] [PATCH 7/8] mm: remove the extra ZONE_DEVICE struct page refcount

2022-02-06 Thread Christoph Hellwig

ZONE_DEVICE struct pages have an extra reference count that complicates
the code for put_page() and several places in the kernel that need to
check the reference count to see that a page is not being used (gup,
compaction, migration, etc.). Clean up the code so the reference count
doesn't need to be treated specially for ZONE_DEVICE pages.

Note that this excludes the special idle page wakeup for fsdax pages,
which still happens at refcount 1.  This is a separate issue and will
be sorted out later.  Given that only fsdax pages require the
notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig
symbol can go away and be replaced with a FS_DAX check for this hook
in the put_page fastpath.

Based on an earlier patch from Ralph Campbell .

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/kvm/book3s_hv_uvmem.c   |  1 -
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |  1 -
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |  1 -
 fs/Kconfig   |  1 -
 include/linux/memremap.h | 12 +++--
 include/linux/mm.h   |  6 +--
 lib/test_hmm.c   |  1 -
 mm/Kconfig   |  4 --
 mm/internal.h|  2 +
 mm/memcontrol.c  | 11 ++---
 mm/memremap.c| 57 
 mm/migrate.c |  6 ---
 mm/swap.c| 16 ++-
 13 files changed, 36 insertions(+), 83 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
b/arch/powerpc/kvm/book3s_hv_uvmem.c
index e414ca44839fd1..8b6438fa18fc2b 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -712,7 +712,6 @@ static struct page *kvmppc_uvmem_get_page(unsigned long 
gpa, struct kvm *kvm)
 
dpage = pfn_to_page(uvmem_pfn);
dpage->zone_device_data = pvt;
-   get_page(dpage);
lock_page(dpage);
return dpage;
 out_clear:
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index cb835f95a76e66..e27ca375876230 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -225,7 +225,6 @@ svm_migrate_get_vram_page(struct svm_range *prange, 
unsigned long pfn)
page = pfn_to_page(pfn);
svm_range_bo_ref(prange->svm_bo);
page->zone_device_data = prange->svm_bo;
-   get_page(page);
lock_page(page);
 }
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a5cdfbe32b5e54..7ba66ad68a8a1e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -326,7 +326,6 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
return NULL;
}
 
-   get_page(page);
lock_page(page);
return page;
 }
diff --git a/fs/Kconfig b/fs/Kconfig
index 7a2b11c0b8036d..05efea674bffa0 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -48,7 +48,6 @@ config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
-   select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
select FS_IOMAP
select DAX
help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 514ab46f597e5c..d6a114dd5ea8b7 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -68,9 +68,9 @@ enum memory_type {
 
 struct dev_pagemap_ops {
/*
-* Called once the page refcount reaches 1.  (ZONE_DEVICE pages never
-* reach 0 refcount unless there is a refcount bug. This allows the
-* device driver to implement its own memory management.)
+* Called once the page refcount reaches 0.  The reference count will be
+* reset to one by the core code after the method is called to prepare
+* for handing out the page again.
 */
void (*page_free)(struct page *page);
 
@@ -133,16 +133,14 @@ static inline unsigned long pgmap_vmemmap_nr(struct 
dev_pagemap *pgmap)
 
 static inline bool is_device_private_page(const struct page *page)
 {
-   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
-   IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+   return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
is_zone_device_page(page) &&
page->pgmap->type == MEMORY_DEVICE_PRIVATE;
 }
 
 static inline bool is_pci_p2pdma_page(const struct page *page)
 {
-   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
-   IS_ENABLED(CONFIG_PCI_P2PDMA) &&
+   return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
is_zone_device_page(page) &&
page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index

[Nouveau] [PATCH 6/8] mm: don't include in

2022-02-06 Thread Christoph Hellwig

Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.

Signed-off-by: Christoph Hellwig 
---
 arch/arm64/mm/mmu.c|  1 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h  |  1 +
 drivers/gpu/drm/drm_cache.c|  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  1 +
 drivers/gpu/drm/nouveau/nouveau_svm.c  |  1 +
 drivers/infiniband/core/rw.c   |  1 +
 drivers/nvdimm/pmem.h  |  1 +
 drivers/nvme/host/pci.c|  1 +
 drivers/nvme/target/io-cmd-bdev.c  |  1 +
 fs/fuse/virtio_fs.c|  1 +
 include/linux/memremap.h   | 18 ++
 include/linux/mm.h | 20 
 lib/test_hmm.c |  1 +
 mm/memremap.c  |  6 +-
 14 files changed, 34 insertions(+), 22 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index acfae9b41cc8c9..580abae6c0b93f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index ea68f3b3a4e9cb..6d643b4b791d87 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -25,6 +25,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
index f19d9acbe95936..50b8a088f763a6 100644
--- a/drivers/gpu/drm/drm_cache.c
+++ b/drivers/gpu/drm/drm_cache.c
@@ -27,11 +27,11 @@
 /*
  * Authors: Thomas Hellström 
  */
-
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index e886a3b9e08c7d..a5cdfbe32b5e54 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -39,6 +39,7 @@
 
 #include 
 #include 
+#include 
 #include 
 
 /*
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 266809e511e2c1..090b9b47708cca 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct nouveau_svm {
diff --git a/drivers/infiniband/core/rw.c b/drivers/infiniband/core/rw.c
index 5a3bd41b331c93..4d98f931a13ddd 100644
--- a/drivers/infiniband/core/rw.c
+++ b/drivers/infiniband/core/rw.c
@@ -2,6 +2,7 @@
 /*
  * Copyright (c) 2016 HGST, a Western Digital Company.
  */
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h
index 59cfe13ea8a85c..1f51a23614299b 100644
--- a/drivers/nvdimm/pmem.h
+++ b/drivers/nvdimm/pmem.h
@@ -3,6 +3,7 @@
 #define __NVDIMM_PMEM_H__
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 6a99ed68091589..ab15bc72710dbe 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/nvme/target/io-cmd-bdev.c 
b/drivers/nvme/target/io-cmd-bdev.c
index 70ca9dfc1771a9..a141446db1bea3 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -6,6 +6,7 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 #include 
 #include 
+#include 
 #include 
 #include "nvmet.h"
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 9d737904d07c0b..86b7dbb6a0d43e 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1fafcc38acbad6..514ab46f597e5c 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,6 +1,8 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
+
+#include 
 #include 
 #include 
 #include 
@@ -129,6 +131,22 @@ static inline unsigned long pgmap_vmemmap_nr(struct 
dev_pagemap *pgmap)
return 1 << pgmap->vmemmap_shift;
 }
 
+static inline bool is_device_private_page(const struct page *page)
+{
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
+   is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+}
+
+static inline bool is_pci_p2pdma_page(const struct page *page)
+{
+   return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
+   IS_ENABLED(CONFIG_PCI_P2PDMA) &&
+   is_zone_device_page(page) &&
+   page->pgmap->type == MEMORY_DEVICE_PC

[Nouveau] [PATCH 5/8] mm: simplify freeing of devmap managed pages

2022-02-06 Thread Christoph Hellwig

Make put_devmap_managed_page return if it took charge of the page
or not and remove the separate page_is_devmap_managed helper.

Signed-off-by: Christoph Hellwig 
---
 include/linux/mm.h | 34 ++
 mm/memremap.c  | 20 +---
 mm/swap.c  | 10 +-
 3 files changed, 20 insertions(+), 44 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 91dd0bc786a9ec..26baadcef4556b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1094,33 +1094,24 @@ static inline bool is_zone_movable_page(const struct 
page *page)
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
 
-static inline bool page_is_devmap_managed(struct page *page)
+bool __put_devmap_managed_page(struct page *page);
+static inline bool put_devmap_managed_page(struct page *page)
 {
if (!static_branch_unlikely(_managed_key))
return false;
if (!is_zone_device_page(page))
return false;
-   switch (page->pgmap->type) {
-   case MEMORY_DEVICE_PRIVATE:
-   case MEMORY_DEVICE_FS_DAX:
-   return true;
-   default:
-   break;
-   }
-   return false;
+   if (page->pgmap->type != MEMORY_DEVICE_PRIVATE &&
+   page->pgmap->type != MEMORY_DEVICE_FS_DAX)
+   return false;
+   return __put_devmap_managed_page(page);
 }
 
-void put_devmap_managed_page(struct page *page);
-
 #else /* CONFIG_DEV_PAGEMAP_OPS */
-static inline bool page_is_devmap_managed(struct page *page)
+static inline bool put_devmap_managed_page(struct page *page)
 {
return false;
 }
-
-static inline void put_devmap_managed_page(struct page *page)
-{
-}
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
 static inline bool is_device_private_page(const struct page *page)
@@ -1220,16 +1211,11 @@ static inline void put_page(struct page *page)
struct folio *folio = page_folio(page);
 
/*
-* For devmap managed pages we need to catch refcount transition from
-* 2 to 1, when refcount reach one it means the page is free and we
-* need to inform the device driver through callback. See
-* include/linux/memremap.h and HMM for details.
+* For some devmap managed pages we need to catch refcount transition
+* from 2 to 1:
 */
-   if (page_is_devmap_managed(>page)) {
-   put_devmap_managed_page(>page);
+   if (put_devmap_managed_page(>page))
return;
-   }
-
folio_put(folio);
 }
 
diff --git a/mm/memremap.c b/mm/memremap.c
index 55d23e9f5c04ec..f41233a67edb12 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -502,24 +502,22 @@ void free_devmap_managed_page(struct page *page)
page->pgmap->ops->page_free(page);
 }
 
-void put_devmap_managed_page(struct page *page)
+bool __put_devmap_managed_page(struct page *page)
 {
-   int count;
-
-   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
-   return;
-
-   count = page_ref_dec_return(page);
-
/*
 * devmap page refcounts are 1-based, rather than 0-based: if
 * refcount is 1, then the page is free and the refcount is
 * stable because nobody holds a reference on the page.
 */
-   if (count == 1)
+   switch (page_ref_dec_return(page)) {
+   case 1:
free_devmap_managed_page(page);
-   else if (!count)
+   break;
+   case 0:
__put_page(page);
+   break;
+   }
+   return true;
 }
-EXPORT_SYMBOL(put_devmap_managed_page);
+EXPORT_SYMBOL(__put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index 08058f74cae23e..25b55c56614311 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -930,16 +930,8 @@ void release_pages(struct page **pages, int nr)
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
-   /*
-* ZONE_DEVICE pages that return 'false' from
-* page_is_devmap_managed() do not require special
-* processing, and instead, expect a call to
-* put_page_testzero().
-*/
-   if (page_is_devmap_managed(page)) {
-   put_devmap_managed_page(page);
+   if (put_devmap_managed_page(page))
continue;
-   }
if (put_page_testzero(page))
put_dev_pagemap(page->pgmap);
continue;
-- 
2.30.2

[Nouveau] [PATCH 4/8] mm: move free_devmap_managed_page to memremap.c

2022-02-06 Thread Christoph Hellwig

free_devmap_managed_page has nothing to do with the code in swap.c,
move it to live with the rest of the code for devmap handling.

Signed-off-by: Christoph Hellwig 
---
 include/linux/mm.h |  1 -
 mm/memremap.c  | 21 +
 mm/swap.c  | 23 ---
 3 files changed, 21 insertions(+), 24 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b46174989b086..91dd0bc786a9ec 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1092,7 +1092,6 @@ static inline bool is_zone_movable_page(const struct page 
*page)
 }
 
 #ifdef CONFIG_DEV_PAGEMAP_OPS
-void free_devmap_managed_page(struct page *page);
 DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
 
 static inline bool page_is_devmap_managed(struct page *page)
diff --git a/mm/memremap.c b/mm/memremap.c
index 5f04a0709e436e..55d23e9f5c04ec 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -501,4 +501,25 @@ void free_devmap_managed_page(struct page *page)
page->mapping = NULL;
page->pgmap->ops->page_free(page);
 }
+
+void put_devmap_managed_page(struct page *page)
+{
+   int count;
+
+   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
+   return;
+
+   count = page_ref_dec_return(page);
+
+   /*
+* devmap page refcounts are 1-based, rather than 0-based: if
+* refcount is 1, then the page is free and the refcount is
+* stable because nobody holds a reference on the page.
+*/
+   if (count == 1)
+   free_devmap_managed_page(page);
+   else if (!count)
+   __put_page(page);
+}
+EXPORT_SYMBOL(put_devmap_managed_page);
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56d5..08058f74cae23e 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1153,26 +1153,3 @@ void __init swap_setup(void)
 * _really_ don't want to cluster much more
 */
 }
-
-#ifdef CONFIG_DEV_PAGEMAP_OPS
-void put_devmap_managed_page(struct page *page)
-{
-   int count;
-
-   if (WARN_ON_ONCE(!page_is_devmap_managed(page)))
-   return;
-
-   count = page_ref_dec_return(page);
-
-   /*
-* devmap page refcounts are 1-based, rather than 0-based: if
-* refcount is 1, then the page is free and the refcount is
-* stable because nobody holds a reference on the page.
-*/
-   if (count == 1)
-   free_devmap_managed_page(page);
-   else if (!count)
-   __put_page(page);
-}
-EXPORT_SYMBOL(put_devmap_managed_page);
-#endif
-- 
2.30.2

[Nouveau] [PATCH 3/8] mm: remove pointless includes from

2022-02-06 Thread Christoph Hellwig

hmm.h pulls in the world for no good reason at all.  Remove the
includes and push a few ones into the users instead.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 1 +
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 1 +
 include/linux/hmm.h  | 9 ++---
 lib/test_hmm.c   | 2 ++
 4 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
index ed5385137f4831..cb835f95a76e66 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "amdgpu_sync.h"
 #include "amdgpu_object.h"
 #include "amdgpu_vm.h"
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c 
b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 3828aafd3ac46f..e886a3b9e08c7d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -39,6 +39,7 @@
 
 #include 
 #include 
+#include 
 
 /*
  * FIXME: this is ugly right now we are using TTM to allocate vram and we pin
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 2fd2e91d5107c0..d5a6f101f843e6 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -9,14 +9,9 @@
 #ifndef LINUX_HMM_H
 #define LINUX_HMM_H
 
-#include 
-#include 
+#include 
 
-#include 
-#include 
-#include 
-#include 
-#include 
+struct mmu_interval_notifier;
 
 /*
  * On output:
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 767538089a62e4..396beee6b061d4 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -26,6 +26,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "test_hmm_uapi.h"
 
-- 
2.30.2

[Nouveau] [PATCH 2/8] mm: remove the KERNEL guard from

2022-02-06 Thread Christoph Hellwig

__KERNEL__ ifdefs don't make sense outside of include/uapi/.

Signed-off-by: Christoph Hellwig 
---
 include/linux/mm.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b19223..7b46174989b086 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3,9 +3,6 @@
 #define _LINUX_MM_H
 
 #include 
-
-#ifdef __KERNEL__
-
 #include 
 #include 
 #include 
@@ -3381,5 +3378,4 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long 
start,
 }
 #endif
 
-#endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
-- 
2.30.2

[Nouveau] start sorting out the ZONE_DEVICE refcount mess

2022-02-06 Thread Christoph Hellwig

Hi all,

this series removes the offset by one refcount for ZONE_DEVICE pages
that are freed back to the driver owning them, which is just device
private ones for now, but also the planned device coherent pages
and the ehanced p2p ones pending.

It does not address the fsdax pages yet, which will be attacked in a
follow on series.

Diffstat:
 arch/arm64/mm/mmu.c  |1 
 arch/powerpc/kvm/book3s_hv_uvmem.c   |1 
 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c |2 
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h|1 
 drivers/gpu/drm/drm_cache.c  |2 
 drivers/gpu/drm/nouveau/nouveau_dmem.c   |3 -
 drivers/gpu/drm/nouveau/nouveau_svm.c|1 
 drivers/infiniband/core/rw.c |1 
 drivers/nvdimm/pmem.h|1 
 drivers/nvme/host/pci.c  |1 
 drivers/nvme/target/io-cmd-bdev.c|1 
 fs/Kconfig   |2 
 fs/fuse/virtio_fs.c  |1 
 include/linux/hmm.h  |9 
 include/linux/memremap.h |   22 +-
 include/linux/mm.h   |   59 -
 lib/test_hmm.c   |4 +
 mm/Kconfig   |4 -
 mm/internal.h|2 
 mm/memcontrol.c  |   11 +
 mm/memremap.c|   63 ---
 mm/migrate.c |6 --
 mm/swap.c|   49 ++--
 23 files changed, 90 insertions(+), 157 deletions(-)

[Nouveau] [PATCH 1/8] mm: remove a pointless CONFIG_ZONE_DEVICE check in memremap_pages

2022-02-06 Thread Christoph Hellwig

memremap.c is only built when CONFIG_ZONE_DEVICE is set, so remove
the superflous extra check.

Signed-off-by: Christoph Hellwig 
---
 mm/memremap.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memremap.c b/mm/memremap.c
index 6aa5f0c2d11fda..5f04a0709e436e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -328,8 +328,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
break;
case MEMORY_DEVICE_FS_DAX:
-   if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
-   IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
WARN(1, "File system DAX not supported\n");
return ERR_PTR(-EINVAL);
}
-- 
2.30.2

[Nouveau] [PATCH 7/7] vgaarb: don't pass a cookie to vga_client_register

2021-07-16 Thread Christoph Hellwig

The VGA arbitration is entirely based on pci_dev structures, so just pass
that back to the set_vga_decode callback.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  9 
 drivers/gpu/drm/i915/display/intel_vga.c   |  7 ---
 drivers/gpu/drm/nouveau/nouveau_vga.c  |  6 +++---
 drivers/gpu/drm/radeon/radeon_device.c |  9 
 drivers/gpu/vga/vgaarb.c   | 24 +-
 drivers/vfio/pci/vfio_pci.c|  9 
 include/linux/vgaarb.h | 10 -
 7 files changed, 36 insertions(+), 38 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e433fab6bcf6..8398daa0c06a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1266,15 +1266,16 @@ bool amdgpu_device_need_post(struct amdgpu_device *adev)
 /**
  * amdgpu_device_vga_set_decode - enable/disable vga decode
  *
- * @cookie: amdgpu_device pointer
+ * @pdev: PCI device pointer
  * @state: enable/disable vga decode
  *
  * Enable/disable vga decode (all asics).
  * Returns VGA resource flags.
  */
-static unsigned int amdgpu_device_vga_set_decode(void *cookie, bool state)
+static unsigned int amdgpu_device_vga_set_decode(struct pci_dev *pdev,
+   bool state)
 {
-   struct amdgpu_device *adev = cookie;
+   struct amdgpu_device *adev = drm_to_adev(pci_get_drvdata(pdev));
amdgpu_asic_set_vga_state(adev, state);
if (state)
return VGA_RSRC_LEGACY_IO | VGA_RSRC_LEGACY_MEM |
@@ -3715,7 +3716,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
/* this will fail for cards that aren't VGA class devices, just
 * ignore it */
if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
-   vga_client_register(adev->pdev, adev, 
amdgpu_device_vga_set_decode);
+   vga_client_register(adev->pdev, amdgpu_device_vga_set_decode);
 
if (amdgpu_device_supports_px(ddev)) {
px = true;
diff --git a/drivers/gpu/drm/i915/display/intel_vga.c 
b/drivers/gpu/drm/i915/display/intel_vga.c
index 0222719e0824..16c250700985 100644
--- a/drivers/gpu/drm/i915/display/intel_vga.c
+++ b/drivers/gpu/drm/i915/display/intel_vga.c
@@ -121,9 +121,9 @@ intel_vga_set_state(struct drm_i915_private *i915, bool 
enable_decode)
 }
 
 static unsigned int
-intel_vga_set_decode(void *cookie, bool enable_decode)
+intel_vga_set_decode(struct pci_dev *pdev, bool enable_decode)
 {
-   struct drm_i915_private *i915 = cookie;
+   struct drm_i915_private *i915 = pdev_to_i915(pdev);
 
intel_vga_set_state(i915, enable_decode);
 
@@ -136,6 +136,7 @@ intel_vga_set_decode(void *cookie, bool enable_decode)
 
 int intel_vga_register(struct drm_i915_private *i915)
 {
+
struct pci_dev *pdev = to_pci_dev(i915->drm.dev);
int ret;
 
@@ -147,7 +148,7 @@ int intel_vga_register(struct drm_i915_private *i915)
 * then we do not take part in VGA arbitration and the
 * vga_client_register() fails with -ENODEV.
 */
-   ret = vga_client_register(pdev, i915, intel_vga_set_decode);
+   ret = vga_client_register(pdev, intel_vga_set_decode);
if (ret && ret != -ENODEV)
return ret;
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_vga.c 
b/drivers/gpu/drm/nouveau/nouveau_vga.c
index d071c11249a3..60cd8c0463df 100644
--- a/drivers/gpu/drm/nouveau/nouveau_vga.c
+++ b/drivers/gpu/drm/nouveau/nouveau_vga.c
@@ -11,9 +11,9 @@
 #include "nouveau_vga.h"
 
 static unsigned int
-nouveau_vga_set_decode(void *priv, bool state)
+nouveau_vga_set_decode(struct pci_dev *pdev, bool state)
 {
-   struct nouveau_drm *drm = nouveau_drm(priv);
+   struct nouveau_drm *drm = nouveau_drm(pci_get_drvdata(pdev));
struct nvif_object *device = >client.device.object;
 
if (drm->client.device.info.family == NV_DEVICE_INFO_V0_CURIE &&
@@ -94,7 +94,7 @@ nouveau_vga_init(struct nouveau_drm *drm)
return;
pdev = to_pci_dev(dev->dev);
 
-   vga_client_register(pdev, dev, nouveau_vga_set_decode);
+   vga_client_register(pdev, nouveau_vga_set_decode);
 
/* don't register Thunderbolt eGPU with vga_switcheroo */
if (pci_is_thunderbolt_attached(pdev))
diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
b/drivers/gpu/drm/radeon/radeon_device.c
index 11e8e42d99b3..cec03238e14d 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1067,15 +1067,16 @@ void radeon_combios_fini(struct radeon_device *rdev)
 /**
  * radeon_vga_set_decode - enable/disable vga decode
  *
- * @cookie: radeon_device pointer
+ * @pdev: PCI device
  * @state: enable/disable vga decode
  *
  * Enable/disable vga decode (all asics).
  * Returns VGA resource flags.
  */
-static unsigned int radeon_v

[Nouveau] [PATCH 6/7] vgaarb: remove the unused irq_set_state argument to vga_client_register

2021-07-16 Thread Christoph Hellwig

All callers pass NULL as the irq_set_state argument, so remove it and
the ->irq_set_state member in struct vga_device.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
 drivers/gpu/drm/i915/display/intel_vga.c   |  2 +-
 drivers/gpu/drm/nouveau/nouveau_vga.c  |  2 +-
 drivers/gpu/drm/radeon/radeon_device.c |  2 +-
 drivers/gpu/vga/vgaarb.c   | 23 +-
 drivers/vfio/pci/vfio_pci.c|  2 +-
 include/linux/vgaarb.h |  4 +---
 7 files changed, 7 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 53afe0198e52..e433fab6bcf6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3715,7 +3715,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
/* this will fail for cards that aren't VGA class devices, just
 * ignore it */
if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
-   vga_client_register(adev->pdev, adev, NULL, 
amdgpu_device_vga_set_decode);
+   vga_client_register(adev->pdev, adev, 
amdgpu_device_vga_set_decode);
 
if (amdgpu_device_supports_px(ddev)) {
px = true;
diff --git a/drivers/gpu/drm/i915/display/intel_vga.c 
b/drivers/gpu/drm/i915/display/intel_vga.c
index 833f9ec14493..0222719e0824 100644
--- a/drivers/gpu/drm/i915/display/intel_vga.c
+++ b/drivers/gpu/drm/i915/display/intel_vga.c
@@ -147,7 +147,7 @@ int intel_vga_register(struct drm_i915_private *i915)
 * then we do not take part in VGA arbitration and the
 * vga_client_register() fails with -ENODEV.
 */
-   ret = vga_client_register(pdev, i915, NULL, intel_vga_set_decode);
+   ret = vga_client_register(pdev, i915, intel_vga_set_decode);
if (ret && ret != -ENODEV)
return ret;
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_vga.c 
b/drivers/gpu/drm/nouveau/nouveau_vga.c
index de7a3a860139..d071c11249a3 100644
--- a/drivers/gpu/drm/nouveau/nouveau_vga.c
+++ b/drivers/gpu/drm/nouveau/nouveau_vga.c
@@ -94,7 +94,7 @@ nouveau_vga_init(struct nouveau_drm *drm)
return;
pdev = to_pci_dev(dev->dev);
 
-   vga_client_register(pdev, dev, NULL, nouveau_vga_set_decode);
+   vga_client_register(pdev, dev, nouveau_vga_set_decode);
 
/* don't register Thunderbolt eGPU with vga_switcheroo */
if (pci_is_thunderbolt_attached(pdev))
diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
b/drivers/gpu/drm/radeon/radeon_device.c
index d781914f8bcb..11e8e42d99b3 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1434,7 +1434,7 @@ int radeon_device_init(struct radeon_device *rdev,
/* if we have > 1 VGA cards, then disable the radeon VGA resources */
/* this will fail for cards that aren't VGA class devices, just
 * ignore it */
-   vga_client_register(rdev->pdev, rdev, NULL, radeon_vga_set_decode);
+   vga_client_register(rdev->pdev, rdev, radeon_vga_set_decode);
 
if (rdev->flags & RADEON_IS_PX)
runtime = true;
diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
index 85b765b80abf..4bde017f6f22 100644
--- a/drivers/gpu/vga/vgaarb.c
+++ b/drivers/gpu/vga/vgaarb.c
@@ -72,9 +72,7 @@ struct vga_device {
unsigned int io_norm_cnt;   /* normal IO count */
unsigned int mem_norm_cnt;  /* normal MEM count */
bool bridge_has_one_vga;
-   /* allow IRQ enable/disable hook */
void *cookie;
-   void (*irq_set_state)(void *cookie, bool enable);
unsigned int (*set_vga_decode)(void *cookie, bool decode);
 };
 
@@ -218,13 +216,6 @@ int vga_remove_vgacon(struct pci_dev *pdev)
 #endif
 EXPORT_SYMBOL(vga_remove_vgacon);
 
-static inline void vga_irq_set_state(struct vga_device *vgadev, bool state)
-{
-   if (vgadev->irq_set_state)
-   vgadev->irq_set_state(vgadev->cookie, state);
-}
-
-
 /* If we don't ever use VGA arb we should avoid
turning off anything anywhere due to old X servers getting
confused about the boot device not being VGA */
@@ -325,10 +316,8 @@ static struct vga_device *__vga_tryget(struct vga_device 
*vgadev,
if ((match & conflict->decodes) & VGA_RSRC_LEGACY_IO)
pci_bits |= PCI_COMMAND_IO;
 
-   if (pci_bits) {
-   vga_irq_set_state(conflict, false);
+   if (pci_bits)
flags |= PCI_VGA_STATE_CHANGE_DECODES;
-   }
}
 
if (change_bridge)
@@ -365,9 +354,6 @@ static struct vga_device *__vga_tryget(struct vga_device 
*vgadev,
 
pci_set_vga_state(vgadev->

[Nouveau] [PATCH 5/7] vgaarb: provide a vga_client_unregister wrapper

2021-07-16 Thread Christoph Hellwig

Add a trivial wrapper for the unregister case that sets all fields to
NULL.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 +-
 drivers/gpu/drm/drm_irq.c  | 4 ++--
 drivers/gpu/drm/i915/display/intel_vga.c   | 2 +-
 drivers/gpu/drm/nouveau/nouveau_vga.c  | 2 +-
 drivers/gpu/drm/radeon/radeon_device.c | 2 +-
 drivers/gpu/vga/vgaarb.c   | 3 +--
 drivers/vfio/pci/vfio_pci.c| 2 +-
 include/linux/vgaarb.h | 5 +
 8 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d303e88e3c23..53afe0198e52 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3838,7 +3838,7 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
vga_switcheroo_fini_domain_pm_ops(adev->dev);
}
if ((adev->pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA)
-   vga_client_register(adev->pdev, NULL, NULL, NULL);
+   vga_client_unregister(adev->pdev);
 
if (IS_ENABLED(CONFIG_PERF_EVENTS))
amdgpu_pmu_fini(adev);
diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
index c3bd664ea733..c87b0fb384e4 100644
--- a/drivers/gpu/drm/drm_irq.c
+++ b/drivers/gpu/drm/drm_irq.c
@@ -140,7 +140,7 @@ int drm_irq_install(struct drm_device *dev, int irq)
if (ret < 0) {
dev->irq_enabled = false;
if (drm_core_check_feature(dev, DRIVER_LEGACY))
-   vga_client_register(to_pci_dev(dev->dev), NULL, NULL, 
NULL);
+   vga_client_unregister(to_pci_dev(dev->dev));
free_irq(irq, dev);
} else {
dev->irq = irq;
@@ -203,7 +203,7 @@ int drm_irq_uninstall(struct drm_device *dev)
DRM_DEBUG("irq=%d\n", dev->irq);
 
if (drm_core_check_feature(dev, DRIVER_LEGACY))
-   vga_client_register(to_pci_dev(dev->dev), NULL, NULL, NULL);
+   vga_client_unregister(to_pci_dev(dev->dev));
 
if (dev->driver->irq_uninstall)
dev->driver->irq_uninstall(dev);
diff --git a/drivers/gpu/drm/i915/display/intel_vga.c 
b/drivers/gpu/drm/i915/display/intel_vga.c
index f002b82ba9c0..833f9ec14493 100644
--- a/drivers/gpu/drm/i915/display/intel_vga.c
+++ b/drivers/gpu/drm/i915/display/intel_vga.c
@@ -158,5 +158,5 @@ void intel_vga_unregister(struct drm_i915_private *i915)
 {
struct pci_dev *pdev = to_pci_dev(i915->drm.dev);
 
-   vga_client_register(pdev, NULL, NULL, NULL);
+   vga_client_unregister(pdev);
 }
diff --git a/drivers/gpu/drm/nouveau/nouveau_vga.c 
b/drivers/gpu/drm/nouveau/nouveau_vga.c
index 7c4b374b3eca..de7a3a860139 100644
--- a/drivers/gpu/drm/nouveau/nouveau_vga.c
+++ b/drivers/gpu/drm/nouveau/nouveau_vga.c
@@ -118,7 +118,7 @@ nouveau_vga_fini(struct nouveau_drm *drm)
return;
pdev = to_pci_dev(dev->dev);
 
-   vga_client_register(pdev, NULL, NULL, NULL);
+   vga_client_unregister(pdev);
 
if (pci_is_thunderbolt_attached(pdev))
return;
diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
b/drivers/gpu/drm/radeon/radeon_device.c
index 46eea01950cb..d781914f8bcb 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1530,7 +1530,7 @@ void radeon_device_fini(struct radeon_device *rdev)
vga_switcheroo_unregister_client(rdev->pdev);
if (rdev->flags & RADEON_IS_PX)
vga_switcheroo_fini_domain_pm_ops(rdev->dev);
-   vga_client_register(rdev->pdev, NULL, NULL, NULL);
+   vga_client_unregister(rdev->pdev);
if (rdev->rio_mem)
pci_iounmap(rdev->pdev, rdev->rio_mem);
rdev->rio_mem = NULL;
diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
index 3ed3734f66d9..85b765b80abf 100644
--- a/drivers/gpu/vga/vgaarb.c
+++ b/drivers/gpu/vga/vgaarb.c
@@ -877,8 +877,7 @@ EXPORT_SYMBOL(vga_set_legacy_decoding);
  * This function does not check whether a client for @pdev has been registered
  * already.
  *
- * To unregister just call this function with @irq_set_state and 
@set_vga_decode
- * both set to NULL for the same @pdev as originally used to register them.
+ * To unregister just call vga_client_unregister().
  *
  * Returns: 0 on success, -1 on failure
  */
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 318864d52837..47d13a1fb7fb 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -1967,7 +1967,7 @@ static void vfio_pci_vga_uninit(struct vfio_pci_device 
*vdev)
 
if (!vfio_pci_is_vga(pdev))
return;
-   vga_client_register(pdev, NULL, NULL, NULL);
+   vga

[Nouveau] [PATCH 4/7] vgaarb: cleanup vgaarb.h

2021-07-16 Thread Christoph Hellwig

Merge the different CONFIG_VGA_ARB ifdef blocks, remove superflous
externs, and regularize the stubs for !CONFIG_VGA_ARB.

Signed-off-by: Christoph Hellwig 
---
 include/linux/vgaarb.h | 90 --
 1 file changed, 42 insertions(+), 48 deletions(-)

diff --git a/include/linux/vgaarb.h b/include/linux/vgaarb.h
index fdce9007d57e..05171fc7e26a 100644
--- a/include/linux/vgaarb.h
+++ b/include/linux/vgaarb.h
@@ -33,6 +33,8 @@
 
 #include 
 
+struct pci_dev;
+
 /* Legacy VGA regions */
 #define VGA_RSRC_NONE 0x00
 #define VGA_RSRC_LEGACY_IO 0x01
@@ -42,23 +44,47 @@
 #define VGA_RSRC_NORMAL_IO 0x04
 #define VGA_RSRC_NORMAL_MEM0x08
 
-struct pci_dev;
-
-/* For use by clients */
-
-#if defined(CONFIG_VGA_ARB)
-extern void vga_set_legacy_decoding(struct pci_dev *pdev,
-   unsigned int decodes);
-#else
+#ifdef CONFIG_VGA_ARB
+void vga_set_legacy_decoding(struct pci_dev *pdev, unsigned int decodes);
+int vga_get(struct pci_dev *pdev, unsigned int rsrc, int interruptible);
+void vga_put(struct pci_dev *pdev, unsigned int rsrc);
+struct pci_dev *vga_default_device(void);
+void vga_set_default_device(struct pci_dev *pdev);
+int vga_remove_vgacon(struct pci_dev *pdev);
+int vga_client_register(struct pci_dev *pdev, void *cookie,
+   void (*irq_set_state)(void *cookie, bool state),
+   unsigned int (*set_vga_decode)(void *cookie, bool 
state));
+#else /* CONFIG_VGA_ARB */
 static inline void vga_set_legacy_decoding(struct pci_dev *pdev,
-  unsigned int decodes) { };
-#endif
-
-#if defined(CONFIG_VGA_ARB)
-extern int vga_get(struct pci_dev *pdev, unsigned int rsrc, int interruptible);
-#else
-static inline int vga_get(struct pci_dev *pdev, unsigned int rsrc, int 
interruptible) { return 0; }
-#endif
+   unsigned int decodes)
+{
+};
+static inline int vga_get(struct pci_dev *pdev, unsigned int rsrc,
+   int interruptible)
+{
+   return 0;
+}
+static inline void vga_put(struct pci_dev *pdev, unsigned int rsrc)
+{
+}
+static inline struct pci_dev *vga_default_device(void)
+{
+   return NULL;
+}
+static inline void vga_set_default_device(struct pci_dev *pdev)
+{
+}
+static inline int vga_remove_vgacon(struct pci_dev *pdev)
+{
+   return 0;
+}
+static inline int vga_client_register(struct pci_dev *pdev, void *cookie,
+ void (*irq_set_state)(void *cookie, bool 
state),
+ unsigned int (*set_vga_decode)(void 
*cookie, bool state))
+{
+   return 0;
+}
+#endif /* CONFIG_VGA_ARB */
 
 /**
  * vga_get_interruptible
@@ -90,36 +116,4 @@ static inline int vga_get_uninterruptible(struct pci_dev 
*pdev,
return vga_get(pdev, rsrc, 0);
 }
 
-#if defined(CONFIG_VGA_ARB)
-extern void vga_put(struct pci_dev *pdev, unsigned int rsrc);
-#else
-static inline void vga_put(struct pci_dev *pdev, unsigned int rsrc)
-{
-}
-#endif
-
-
-#ifdef CONFIG_VGA_ARB
-extern struct pci_dev *vga_default_device(void);
-extern void vga_set_default_device(struct pci_dev *pdev);
-extern int vga_remove_vgacon(struct pci_dev *pdev);
-#else
-static inline struct pci_dev *vga_default_device(void) { return NULL; }
-static inline void vga_set_default_device(struct pci_dev *pdev) { }
-static inline int vga_remove_vgacon(struct pci_dev *pdev) { return 0; }
-#endif
-
-#if defined(CONFIG_VGA_ARB)
-int vga_client_register(struct pci_dev *pdev, void *cookie,
-   void (*irq_set_state)(void *cookie, bool state),
-   unsigned int (*set_vga_decode)(void *cookie, bool 
state));
-#else
-static inline int vga_client_register(struct pci_dev *pdev, void *cookie,
- void (*irq_set_state)(void *cookie, bool 
state),
- unsigned int (*set_vga_decode)(void 
*cookie, bool state))
-{
-   return 0;
-}
-#endif
-
 #endif /* LINUX_VGA_H */
-- 
2.30.2

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [PATCH 3/7] vgaarb: move the kerneldoc for vga_set_legacy_decoding to vgaarb.c

2021-07-16 Thread Christoph Hellwig

Kerneldoc comments should be at the implementation side, not in the
header just declaring the prototype.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/vga/vgaarb.c | 11 +++
 include/linux/vgaarb.h   | 13 -
 2 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
index fccc7ef5153a..3ed3734f66d9 100644
--- a/drivers/gpu/vga/vgaarb.c
+++ b/drivers/gpu/vga/vgaarb.c
@@ -834,6 +834,17 @@ static void __vga_set_legacy_decoding(struct pci_dev *pdev,
spin_unlock_irqrestore(_lock, flags);
 }
 
+/**
+ * vga_set_legacy_decoding
+ * @pdev: pci device of the VGA card
+ * @decodes: bit mask of what legacy regions the card decodes
+ *
+ * Indicates to the arbiter if the card decodes legacy VGA IOs, legacy VGA
+ * Memory, both, or none. All cards default to both, the card driver (fbdev for
+ * example) should tell the arbiter if it has disabled legacy decoding, so the
+ * card can be left out of the arbitration process (and can be safe to take
+ * interrupts at any time.
+ */
 void vga_set_legacy_decoding(struct pci_dev *pdev, unsigned int decodes)
 {
__vga_set_legacy_decoding(pdev, decodes, false);
diff --git a/include/linux/vgaarb.h b/include/linux/vgaarb.h
index ca5160218538..fdce9007d57e 100644
--- a/include/linux/vgaarb.h
+++ b/include/linux/vgaarb.h
@@ -46,19 +46,6 @@ struct pci_dev;
 
 /* For use by clients */
 
-/**
- * vga_set_legacy_decoding
- *
- * @pdev: pci device of the VGA card
- * @decodes: bit mask of what legacy regions the card decodes
- *
- * Indicates to the arbiter if the card decodes legacy VGA IOs,
- * legacy VGA Memory, both, or none. All cards default to both,
- * the card driver (fbdev for example) should tell the arbiter
- * if it has disabled legacy decoding, so the card can be left
- * out of the arbitration process (and can be safe to take
- * interrupts at any time.
- */
 #if defined(CONFIG_VGA_ARB)
 extern void vga_set_legacy_decoding(struct pci_dev *pdev,
unsigned int decodes);
-- 
2.30.2

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [PATCH 2/7] vgaarb: remove vga_conflicts

2021-07-16 Thread Christoph Hellwig

vga_conflicts only has a single caller and none of the arch overrides
mentioned in the comment.  Just remove it and the thus dead check in the
caller.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/vga/vgaarb.c |  6 --
 include/linux/vgaarb.h   | 12 
 2 files changed, 18 deletions(-)

diff --git a/drivers/gpu/vga/vgaarb.c b/drivers/gpu/vga/vgaarb.c
index 949fde433ea2..fccc7ef5153a 100644
--- a/drivers/gpu/vga/vgaarb.c
+++ b/drivers/gpu/vga/vgaarb.c
@@ -284,12 +284,6 @@ static struct vga_device *__vga_tryget(struct vga_device 
*vgadev,
if (vgadev == conflict)
continue;
 
-   /* Check if the architecture allows a conflict between those
-* 2 devices or if they are on separate domains
-*/
-   if (!vga_conflicts(vgadev->pdev, conflict->pdev))
-   continue;
-
/* We have a possible conflict. before we go further, we must
 * check if we sit on the same bus as the conflicting device.
 * if we don't, then we must tie both IO and MEM resources
diff --git a/include/linux/vgaarb.h b/include/linux/vgaarb.h
index 26ec8a057d2a..ca5160218538 100644
--- a/include/linux/vgaarb.h
+++ b/include/linux/vgaarb.h
@@ -122,18 +122,6 @@ static inline void vga_set_default_device(struct pci_dev 
*pdev) { }
 static inline int vga_remove_vgacon(struct pci_dev *pdev) { return 0; }
 #endif
 
-/*
- * Architectures should define this if they have several
- * independent PCI domains that can afford concurrent VGA
- * decoding
- */
-#ifndef __ARCH_HAS_VGA_CONFLICT
-static inline int vga_conflicts(struct pci_dev *p1, struct pci_dev *p2)
-{
-   return 1;
-}
-#endif
-
 #if defined(CONFIG_VGA_ARB)
 int vga_client_register(struct pci_dev *pdev, void *cookie,
void (*irq_set_state)(void *cookie, bool state),
-- 
2.30.2

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [PATCH 1/7] vgaarb: remove VGA_DEFAULT_DEVICE

2021-07-16 Thread Christoph Hellwig

The define is entirely unused.

Signed-off-by: Christoph Hellwig 
---
 include/linux/vgaarb.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/include/linux/vgaarb.h b/include/linux/vgaarb.h
index dc6ddce92066..26ec8a057d2a 100644
--- a/include/linux/vgaarb.h
+++ b/include/linux/vgaarb.h
@@ -42,12 +42,6 @@
 #define VGA_RSRC_NORMAL_IO 0x04
 #define VGA_RSRC_NORMAL_MEM0x08
 
-/* Passing that instead of a pci_dev to use the system "default"
- * device, that is the one used by vgacon. Archs will probably
- * have to provide their own vga_default_device();
- */
-#define VGA_DEFAULT_DEVICE (NULL)
-
 struct pci_dev;
 
 /* For use by clients */
-- 
2.30.2

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] misc vgaarb cleanups

2021-07-16 Thread Christoph Hellwig

Hi all,

this series cleans up a bunch of lose ends in the vgaarb code.

Diffstat:
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   11 +-
 drivers/gpu/drm/drm_irq.c  |4 
 drivers/gpu/drm/i915/display/intel_vga.c   |9 +-
 drivers/gpu/drm/nouveau/nouveau_vga.c  |8 -
 drivers/gpu/drm/radeon/radeon_device.c |   11 +-
 drivers/gpu/vga/vgaarb.c   |   67 +---
 drivers/vfio/pci/vfio_pci.c|   11 +-
 include/linux/vgaarb.h |  118 ++---
 8 files changed, 93 insertions(+), 146 deletions(-)
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v6 01/15] swiotlb: Refactor swiotlb init functions

2021-05-11 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v6 04/15] swiotlb: Add restricted DMA pool initialization

2021-05-11 Thread Christoph Hellwig

> +#ifdef CONFIG_DMA_RESTRICTED_POOL
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#endif

I don't think any of this belongs into swiotlb.c.  Marking
swiotlb_init_io_tlb_mem non-static and having all this code in a separate
file is probably a better idea.

> +#ifdef CONFIG_DMA_RESTRICTED_POOL
> +static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
> + struct device *dev)
> +{
> + struct io_tlb_mem *mem = rmem->priv;
> + unsigned long nslabs = rmem->size >> IO_TLB_SHIFT;
> +
> + if (dev->dma_io_tlb_mem)
> + return 0;
> +
> + /* Since multiple devices can share the same pool, the private data,
> +  * io_tlb_mem struct, will be initialized by the first device attached
> +  * to it.
> +  */

This is not the normal kernel comment style.

> +#ifdef CONFIG_ARM
> + if (!PageHighMem(pfn_to_page(PHYS_PFN(rmem->base {
> + kfree(mem);
> + return -EINVAL;
> + }
> +#endif /* CONFIG_ARM */

And this is weird.  Why would ARM have such a restriction?  And if we have
such rstrictions it absolutely belongs into an arch helper.

> + swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, false);
> +
> + rmem->priv = mem;
> +
> +#ifdef CONFIG_DEBUG_FS
> + if (!debugfs_dir)
> + debugfs_dir = debugfs_create_dir("swiotlb", NULL);
> +
> + swiotlb_create_debugfs(mem, rmem->name, debugfs_dir);

Doesn't the debugfs_create_dir belong into swiotlb_create_debugfs?  Also
please use IS_ENABLEd or a stub to avoid ifdefs like this.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v6 02/15] swiotlb: Refactor swiotlb_create_debugfs

2021-05-11 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v6 08/15] swiotlb: Bounce data from/to restricted DMA pool if available

2021-05-11 Thread Christoph Hellwig

> +static inline bool is_dev_swiotlb_force(struct device *dev)
> +{
> +#ifdef CONFIG_DMA_RESTRICTED_POOL
> + if (dev->dma_io_tlb_mem)
> + return true;
> +#endif /* CONFIG_DMA_RESTRICTED_POOL */
> + return false;
> +}
> +

>   /* If SWIOTLB is active, use its maximum mapping size */
>   if (is_swiotlb_active(dev) &&
> - (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))
> + (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE ||
> +  is_dev_swiotlb_force(dev)))

This is a mess.  I think the right way is to have an always_bounce flag
in the io_tlb_mem structure instead.  Then the global swiotlb_force can
go away and be replace with this and the fact that having no
io_tlb_mem structure at all means forced no buffering (after a little
refactoring).
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v6 05/15] swiotlb: Add a new get_io_tlb_mem getter

2021-05-11 Thread Christoph Hellwig

> +static inline struct io_tlb_mem *get_io_tlb_mem(struct device *dev)
> +{
> +#ifdef CONFIG_DMA_RESTRICTED_POOL
> + if (dev && dev->dma_io_tlb_mem)
> + return dev->dma_io_tlb_mem;
> +#endif /* CONFIG_DMA_RESTRICTED_POOL */
> +
> + return io_tlb_default_mem;

Given that we're also looking into a not addressing restricted pool
I'd rather always assign the active pool to dev->dma_io_tlb_mem and
do away with this helper.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH] drm/ttm: use dma_alloc_pages for the page pool

2021-05-11 Thread Christoph Hellwig

On Tue, May 11, 2021 at 09:35:20AM +0200, Christian König wrote:
> We certainly going to need the drm_need_swiotlb() for userptr support 
> (unless we add some approach for drivers to opt out of swiotlb).

swiotlb use is driven by three things:

 1) addressing limitations of the device
 2) addressing limitations of the interconnect
 3) virtualiztion modes that require it

not sure how the driver could opt out.  What is the problem with userptr
support?

> Then while I really want to get rid of GFP_DMA32 as well I'm not 100% sure 
> if we can handle this without the flag.

Note that this is still using GFP_DMA32 underneath where required,
just in a layer that can decide that ѕensibly.

> And last we need something better to store the DMA address and order than 
> allocating a separate memory object for each page.

Yeah.  If you use __GFP_COMP for the allocations we can find the order
from the page itself, which might be useful.  For 64-bit platforms
the dma address could be store in page->private, or depending on how
the page gets used the dma_addr field in struct page that overloads
the lru field and is used by the networking page pool could be used.

Maybe we could even have a common page pool between net and drm, but
I don't want to go there myself, not being an expert on either subsystem.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] RFC: use dma_alloc_noncoherent in ttm_pool_alloc_page

2021-05-11 Thread Christoph Hellwig

Hi all,

the memory allocation for the TTM pool is a big mess with two allocation
methods that both have issues, a layering violation and odd guessing of
pools in the callers.

This patch switches to the dma_alloc_noncoherent API instead fixing all
of the above issues.

Warning:  i don't have any of the relevant hardware, so this is a compile
tested request for comments only!

Diffstat:
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |1 
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |4 
 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c   |1 
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c   |1 
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c   |1 
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   |1 
 drivers/gpu/drm/drm_cache.c |   31 -
 drivers/gpu/drm/drm_gem_vram_helper.c   |3 
 drivers/gpu/drm/nouveau/nouveau_ttm.c   |8 -
 drivers/gpu/drm/qxl/qxl_ttm.c   |3 
 drivers/gpu/drm/radeon/radeon.h |1 
 drivers/gpu/drm/radeon/radeon_device.c  |1 
 drivers/gpu/drm/radeon/radeon_ttm.c |4 
 drivers/gpu/drm/ttm/ttm_device.c|7 -
 drivers/gpu/drm/ttm/ttm_pool.c  |  178 
 drivers/gpu/drm/ttm/ttm_tt.c|   25 
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.c |4 
 include/drm/drm_cache.h |1 
 include/drm/ttm/ttm_device.h|3 
 include/drm/ttm/ttm_pool.h  |9 -
 20 files changed, 41 insertions(+), 246 deletions(-)
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [PATCH] drm/ttm: use dma_alloc_pages for the page pool

2021-05-11 Thread Christoph Hellwig

Use the dma_alloc_pages allocator for the TTM pool allocator.
This allocator is a front end to the page allocator which takes the
DMA mask of the device into account, thus offering the best of both
worlds of the two existing allocator versions.  This conversion also
removes the ugly layering violation where the TTM pool assumes what
kind of virtual address dma_alloc_attrs can return.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h |   1 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |   4 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c   |   1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c   |   1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c   |   1 -
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   |   1 -
 drivers/gpu/drm/drm_cache.c |  31 -
 drivers/gpu/drm/drm_gem_vram_helper.c   |   3 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c   |   8 +-
 drivers/gpu/drm/qxl/qxl_ttm.c   |   3 +-
 drivers/gpu/drm/radeon/radeon.h |   1 -
 drivers/gpu/drm/radeon/radeon_device.c  |   1 -
 drivers/gpu/drm/radeon/radeon_ttm.c |   4 +-
 drivers/gpu/drm/ttm/ttm_device.c|   7 +-
 drivers/gpu/drm/ttm/ttm_pool.c  | 178 
 drivers/gpu/drm/ttm/ttm_tt.c|  25 +---
 drivers/gpu/drm/vmwgfx/vmwgfx_drv.c |   4 +-
 include/drm/drm_cache.h |   1 -
 include/drm/ttm/ttm_device.h|   3 +-
 include/drm/ttm/ttm_pool.h  |   9 +-
 20 files changed, 41 insertions(+), 246 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index dc3a69296321b3..5f40527eeef1ff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -819,7 +819,6 @@ struct amdgpu_device {
int usec_timeout;
const struct amdgpu_asic_funcs  *asic_funcs;
boolshutdown;
-   boolneed_swiotlb;
boolaccel_working;
struct notifier_block   acpi_nb;
struct amdgpu_i2c_chan  *i2c_bus[AMDGPU_MAX_I2C_BUS];
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index 3bef0432cac2f7..9bf17b44cba6fe 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1705,9 +1705,7 @@ int amdgpu_ttm_init(struct amdgpu_device *adev)
/* No others user of address space so set it to 0 */
r = ttm_device_init(>mman.bdev, _bo_driver, adev->dev,
   adev_to_drm(adev)->anon_inode->i_mapping,
-  adev_to_drm(adev)->vma_offset_manager,
-  adev->need_swiotlb,
-  dma_addressing_limited(adev->dev));
+  adev_to_drm(adev)->vma_offset_manager);
if (r) {
DRM_ERROR("failed initializing buffer object driver(%d).\n", r);
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
index 405d6ad09022ca..2d4fa754513033 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c
@@ -846,7 +846,6 @@ static int gmc_v6_0_sw_init(void *handle)
dev_warn(adev->dev, "No suitable DMA available.\n");
return r;
}
-   adev->need_swiotlb = drm_need_swiotlb(44);
 
r = gmc_v6_0_init_microcode(adev);
if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
index 210ada2289ec9c..a504db24f4c2a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c
@@ -1025,7 +1025,6 @@ static int gmc_v7_0_sw_init(void *handle)
pr_warn("No suitable DMA available\n");
return r;
}
-   adev->need_swiotlb = drm_need_swiotlb(40);
 
r = gmc_v7_0_init_microcode(adev);
if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
index e4f27b3f28fb58..42e7b1eb84b3bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c
@@ -1141,7 +1141,6 @@ static int gmc_v8_0_sw_init(void *handle)
pr_warn("No suitable DMA available\n");
return r;
}
-   adev->need_swiotlb = drm_need_swiotlb(40);
 
r = gmc_v8_0_init_microcode(adev);
if (r) {
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 455bb91060d0bc..f74784b3423740 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -1548,7 +1548,6 @@ static int gmc_v9_0_sw_init(void *handle)
printk(KERN_WARNING "amdgpu: No suitable DMA available.\n");
ret

Re: [Nouveau] [PATCH v5 01/16] swiotlb: Fix the type of index

2021-04-23 Thread Christoph Hellwig

On Thu, Apr 22, 2021 at 04:14:53PM +0800, Claire Chang wrote:
> Fix the type of index from unsigned int to int since find_slots() might
> return -1.
> 
> Fixes: 0774983bc923 ("swiotlb: refactor swiotlb_tbl_map_single")
> Signed-off-by: Claire Chang 

Looks good:

Reviewed-by: Christoph Hellwig 

it really should go into 5.12.  I'm not sure if Konrad is going to
be able to queue this up due to his vacation, so I'm tempted to just
queue it up in the dma-mapping tree.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v6 8/8] nouveau/svm: Implement atomic SVM access

2021-03-15 Thread Christoph Hellwig

> - /*XXX: atomic? */
> - return (fa->access == 0 || fa->access == 3) -
> -(fb->access == 0 || fb->access == 3);
> + /* Atomic access (2) has highest priority */
> + return (-1*(fa->access == 2) + (fa->access == 0 || fa->access == 3)) -
> +(-1*(fb->access == 2) + (fb->access == 0 || fb->access == 3));

This looks really unreabable.  If the magic values 0, 2 and 3 had names
it might become a little more understadable, then factor the duplicated
calculation of the priority value into a helper and we'll have code that
mere humans can understand..

> + mutex_lock(>mutex);
> + if (mmu_interval_read_retry(>notifier,
> + notifier_seq)) {
> + mutex_unlock(>mutex);
> + continue;
> + }
> + break;
> + }

This looks good, why not:

mutex_lock(>mutex);
if (!mmu_interval_read_retry(>notifier,
 notifier_seq))
break;
mutex_unlock(>mutex);
}
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v6 5/8] mm: Device exclusive memory access

2021-03-15 Thread Christoph Hellwig

> +Not all devices support atomic access to system memory. To support atomic
> +operations to a shared virtual memory page such a device needs access to that
> +page which is exclusive of any userspace access from the CPU. The
> +``make_device_exclusive_range()`` function can be used to make a memory range
> +inaccessible from userspace.

s/Not all devices/Some devices/ ?

>  static inline int mm_has_notifiers(struct mm_struct *mm)
> @@ -528,7 +534,17 @@ static inline void mmu_notifier_range_init_migrate(
>  {
>   mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
>   start, end);
> - range->migrate_pgmap_owner = pgmap;
> + range->owner = pgmap;
> +}
> +
> +static inline void mmu_notifier_range_init_exclusive(
> + struct mmu_notifier_range *range, unsigned int flags,
> + struct vm_area_struct *vma, struct mm_struct *mm,
> + unsigned long start, unsigned long end, void *owner)
> +{
> + mmu_notifier_range_init(range, MMU_NOTIFY_EXCLUSIVE, flags, vma, mm,
> + start, end);
> + range->owner = owner;

Maybe just replace mmu_notifier_range_init_migrate with a
mmu_notifier_range_init_owner helper that takes the owner but does
not hard code a type?

>   }
> + } else if (is_device_exclusive_entry(entry)) {
> + page = pfn_swap_entry_to_page(entry);
> +
> + get_page(page);
> + rss[mm_counter(page)]++;
> +
> + if (is_writable_device_exclusive_entry(entry) &&
> + is_cow_mapping(vm_flags)) {
> + /*
> +  * COW mappings require pages in both
> +  * parent and child to be set to read.
> +  */
> + entry = make_readable_device_exclusive_entry(
> + swp_offset(entry));
> + pte = swp_entry_to_pte(entry);
> + if (pte_swp_soft_dirty(*src_pte))
> + pte = pte_swp_mksoft_dirty(pte);
> + if (pte_swp_uffd_wp(*src_pte))
> + pte = pte_swp_mkuffd_wp(pte);
> + set_pte_at(src_mm, addr, src_pte, pte);
> + }

Just cosmetic, but I wonder if should factor this code block into
a little helper.

> +
> +static bool try_to_protect_one(struct page *page, struct vm_area_struct *vma,
> + unsigned long address, void *arg)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + struct page_vma_mapped_walk pvmw = {
> + .page = page,
> + .vma = vma,
> + .address = address,
> + };
> + struct ttp_args *ttp = (struct ttp_args *) arg;

This cast should not be needed.

> + return ttp.valid && (!page_mapcount(page) ? true : false);

This can be simplified to:

return ttp.valid && !page_mapcount(page);

> + npages = get_user_pages_remote(mm, start, npages,
> +FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> +pages, NULL, NULL);
> + for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> + if (!trylock_page(pages[i])) {
> + put_page(pages[i]);
> + pages[i] = NULL;
> + continue;
> + }
> +
> + if (!try_to_protect(pages[i], mm, start, arg)) {
> + unlock_page(pages[i]);
> + put_page(pages[i]);
> + pages[i] = NULL;
> + }

Should the trylock_page go into try_to_protect to simplify the loop
a little?  Also I wonder if we need make_device_exclusive_range or
should just open code the get_user_pages_remote + try_to_protect
loop in the callers, as that might allow them to also deduct other
information about the found pages.

Otherwise looks good:

Reviewed-by: Christoph Hellwig 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v6 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-03-15 Thread Christoph Hellwig

On Fri, Mar 12, 2021 at 07:38:46PM +1100, Alistair Popple wrote:
> The behaviour of try_to_unmap_one() is difficult to follow because it
> performs different operations based on a fairly large set of flags used
> in different combinations.
> 
> TTU_MUNLOCK is one such flag. However it is exclusively used by
> try_to_munlock() which specifies no other flags. Therefore rather than
> overload try_to_unmap_one() with unrelated behaviour split this out into
> it's own function and remove the flag.
> 
> Signed-off-by: Alistair Popple 
> Reviewed-by: Ralph Campbell 
> 
> ---
> 
> Christoph - I didn't add your Reviewed-by from v3 because removal of the
> extra VM_LOCKED check in v4 changed things slightly. Let me know if
> you're still ok for me to add it. Thanks.

Still looks good to me:

Reviewed-by: Christoph Hellwig 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v6 1/8] mm: Remove special swap entry functions

2021-03-15 Thread Christoph Hellwig

On Fri, Mar 12, 2021 at 07:38:44PM +1100, Alistair Popple wrote:
> Remove the migration and device private entry_to_page() and
> entry_to_pfn() inline functions and instead open code them directly.
> This results in shorter code which is easier to understand.

I think this commit log should mention pfn_swap_entry_to_page() now.

Otherwise looks good:

Reviewed-by: Christoph Hellwig 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 4/8] mm/rmap: Split migration into its own function

2021-02-26 Thread Christoph Hellwig

Nice cleanup!

Reviewed-by: Christoph Hellwig 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 3/8] mm/rmap: Split try_to_munlock from try_to_unmap

2021-02-26 Thread Christoph Hellwig

> + while (page_vma_mapped_walk()) {
> + /*
> +  * If the page is mlock()d, we cannot swap it out.
> +  * If it's recently referenced (perhaps page_referenced
> +  * skipped over this mm) then we should reactivate it.
> +  */
> + if (vma->vm_flags & VM_LOCKED) {
> + /* PTE-mapped THP are never mlocked */
> + if (!PageTransCompound(page)) {
> + /*
> +  * Holding pte lock, we do *not* need
> +  * mmap_lock here
> +  */
> + mlock_vma_page(page);
> + }
> + ret = false;
> + page_vma_mapped_walk_done();
> + break;

Just return false here directly and remove the ret variable?

Very nice cleanup!

Reviewed-by: Christoph Hellwig 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 2/8] mm/swapops: Rework swap entry manipulation code

2021-02-26 Thread Christoph Hellwig

On Fri, Feb 26, 2021 at 06:18:26PM +1100, Alistair Popple wrote:
> Both migration and device private pages use special swap entries which
> are manipluated by a range of inline functions. The arguments to these
> are somewhat inconsitent so rework them to remove flag type arguments
> and to make the arguments similar for both a read and write entry
> creation.
> 
> Signed-off-by: Alistair Popple 

Looks good,

Reviewed-by: Christoph Hellwig 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 1/8] mm: Remove special swap entry functions

2021-02-26 Thread Christoph Hellwig

> - struct page *page = migration_entry_to_page(entry);
> + struct page *page = pfn_to_page(swp_offset(entry));

I wonder if keeping a single special_entry_to_page() helper would still
me a useful.  But I'm not entirely sure.  There are also two more open
coded copies of this in the THP migration code.

> -#define free_swap_and_cache(e) ({(is_migration_entry(e) || 
> is_device_private_entry(e));})
> -#define swapcache_prepare(e) ({(is_migration_entry(e) || 
> is_device_private_entry(e));})
> +#define free_swap_and_cache(e) is_special_entry(e)
> +#define swapcache_prepare(e) is_special_entry(e)

Staring at this I'm really, really confused at what this is doing.

Looking a little closer these are the !CONFIG_SWAP stubs, but it could
probably use a comment or two.

>   } else if (is_migration_entry(entry)) {
> - page = migration_entry_to_page(entry);
> + page = pfn_to_page(swp_offset(entry));
>  
>   rss[mm_counter(page)]++;
>  
> @@ -737,7 +737,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
> mm_struct *src_mm,
>   set_pte_at(src_mm, addr, src_pte, pte);
>   }
>   } else if (is_device_private_entry(entry)) {
> - page = device_private_entry_to_page(entry);
> + page = pfn_to_page(swp_offset(entry));
>  
>   /*
>* Update rss count even for unaddressable pages, as
> @@ -1274,7 +1274,7 @@ static unsigned long zap_pte_range(struct mmu_gather 
> *tlb,
>  
>   entry = pte_to_swp_entry(ptent);
>   if (is_device_private_entry(entry)) {
> - struct page *page = device_private_entry_to_page(entry);
> + struct page *page = pfn_to_page(swp_offset(entry));
>  
>   if (unlikely(details && details->check_mapping)) {
>   /*
> @@ -1303,7 +1303,7 @@ static unsigned long zap_pte_range(struct mmu_gather 
> *tlb,
>   else if (is_migration_entry(entry)) {
>   struct page *page;
>  
> - page = migration_entry_to_page(entry);
> + page = pfn_to_page(swp_offset(entry));
>   rss[mm_counter(page)]--;
>   }
>   if (unlikely(!free_swap_and_cache(entry)))
> @@ -3271,7 +3271,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   migration_entry_wait(vma->vm_mm, vmf->pmd,
>vmf->address);
>   } else if (is_device_private_entry(entry)) {
> - vmf->page = device_private_entry_to_page(entry);
> + vmf->page = pfn_to_page(swp_offset(entry));
>   ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
>   } else if (is_hwpoison_entry(entry)) {
>   ret = VM_FAULT_HWPOISON;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 20ca887ea769..72adcc3d8f5b 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -321,7 +321,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t 
> *ptep,
>   if (!is_migration_entry(entry))
>   goto out;
>  
> - page = migration_entry_to_page(entry);
> + page = pfn_to_page(swp_offset(entry));
>  
>   /*
>* Once page cache replacement of page migration started, page_count
> @@ -361,7 +361,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t 
> *pmd)
>   ptl = pmd_lock(mm, pmd);
>   if (!is_pmd_migration_entry(*pmd))
>   goto unlock;
> - page = migration_entry_to_page(pmd_to_swp_entry(*pmd));
> + page = pfn_to_page(swp_offset(pmd_to_swp_entry(*pmd)));
>   if (!get_page_unless_zero(page))
>   goto unlock;
>   spin_unlock(ptl);
> @@ -2437,7 +2437,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   if (!is_device_private_entry(entry))
>   goto next;
>  
> - page = device_private_entry_to_page(entry);
> + page = pfn_to_page(swp_offset(entry));
>   if (!(migrate->flags &
>   MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
>   page->pgmap->owner != migrate->pgmap_owner)
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 86e3a3688d59..34230d08556a 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>   if (!is_migration_entry(entry))
>   return false;
>  
> - pfn = migration_entry_to_pfn(entry);
> + pfn = swp_offset(entry);
>   } else if (is_swap_pte(*pvmw->pte)) {
>   swp_entry_t entry;
>  
> @@ -105,7 +105,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>   if (!is_device_private_entry(entry))
>   return false;
>  
> -

Re: [Nouveau] [PATCH v2 1/4] hmm: Device exclusive memory access

2021-02-19 Thread Christoph Hellwig

>   page = migration_entry_to_page(swpent);
>   else if (is_device_private_entry(swpent))
>   page = device_private_entry_to_page(swpent);
> + else if (is_device_exclusive_entry(swpent))
> + page = device_exclusive_entry_to_page(swpent);

>   page = migration_entry_to_page(swpent);
>   else if (is_device_private_entry(swpent))
>   page = device_private_entry_to_page(swpent);
> + else if (is_device_exclusive_entry(swpent))
> + page = device_exclusive_entry_to_page(swpent);

>   if (is_device_private_entry(entry))
>   page = device_private_entry_to_page(entry);
> +
> + if (is_device_exclusive_entry(entry))
> + page = device_exclusive_entry_to_page(entry);

Any chance we can come up with a clever scheme to avoid all this
boilerplate code (and maybe also what it gets compiled to)?

> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 866a0fa104c4..5d28ff6d4d80 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -109,6 +109,10 @@ struct hmm_range {
>   */
>  int hmm_range_fault(struct hmm_range *range);
>  
> +int hmm_exclusive_range(struct mm_struct *mm, unsigned long start,
> + unsigned long end, struct page **pages);
> +vm_fault_t hmm_remove_exclusive_entry(struct vm_fault *vmf);

Can we avoid the hmm naming for new code (we should probably also kill
it off for the existing code)?

> +#define free_swap_and_cache(e) ({(is_migration_entry(e) || 
> is_device_private_entry(e) \
> + || is_device_exclusive_entry(e)); })
> +#define swapcache_prepare(e) ({(is_migration_entry(e) || 
> is_device_private_entry(e) \
> + || is_device_exclusive_entry(e)); })

Can you turn these into properly formatted inline functions?  As-is this
becomes pretty unreadable.

> +static inline void make_device_exclusive_entry_read(swp_entry_t *entry)
> +{
> + *entry = swp_entry(SWP_DEVICE_EXCLUSIVE_READ, swp_offset(*entry));
> +}

s/make_device_exclusive_entry_read/mark_device_exclusive_entry_readable/
??

> +
> +static inline swp_entry_t make_device_exclusive_entry(struct page *page, 
> bool write)
> +{
> + return swp_entry(write ? SWP_DEVICE_EXCLUSIVE_WRITE : 
> SWP_DEVICE_EXCLUSIVE_READ,
> +  page_to_pfn(page));
> +}

I'd split this into two helpers, which is easier to follow and avoids
the pointlessly overlong lines.

> +static inline bool is_device_exclusive_entry(swp_entry_t entry)
> +{
> + int type = swp_type(entry);
> + return type == SWP_DEVICE_EXCLUSIVE_READ || type == 
> SWP_DEVICE_EXCLUSIVE_WRITE;
> +}

Another overly long line.  I also wouldn't bother with the local
variable:

return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;


> +static inline bool is_write_device_exclusive_entry(swp_entry_t entry)
> +{
> + return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
> +}

Or reuse these kind of helpers..

> +
> +static inline unsigned long device_exclusive_entry_to_pfn(swp_entry_t entry)
> +{
> + return swp_offset(entry);
> +}
> +
> +static inline struct page *device_exclusive_entry_to_page(swp_entry_t entry)
> +{
> + return pfn_to_page(swp_offset(entry));
> +}

I'd rather open code these two, and as a prep patch also kill off the
equivalents for the migration and device private entries, which would
actually clean up a lot of the mess mentioned in my first comment above.

> +static int hmm_exclusive_skip(unsigned long start,
> +   unsigned long end,
> +   __always_unused int depth,
> +   struct mm_walk *walk)
> +{
> + struct hmm_exclusive_walk *hmm_exclusive_walk = walk->private;
> + unsigned long addr;
> +
> + for (addr = start; addr < end; addr += PAGE_SIZE)
> + hmm_exclusive_walk->pages[hmm_exclusive_walk->npages++] = NULL;
> +
> + return 0;
> +}

Wouldn't pre-zeroing the array be simpler and more efficient?

> +int hmm_exclusive_range(struct mm_struct *mm, unsigned long start,
> + unsigned long end, struct page **pages)
> +{
> + struct hmm_exclusive_walk hmm_exclusive_walk = { .pages = pages, 
> .npages = 0 };
> + int i;
> +
> + /* Collect and lock candidate pages */
> + walk_page_range(mm, start, end, _exclusive_walk_ops, 
> _exclusive_walk);

Please avoid the overly long lines.

But more importantly:  Unless I'm missing something obvious this
walk_page_range call just open codes get_user_pages_fast, why can't you
use that?

> +#if defined(CONFIG_ARCH_ENABLE_THP_MIGRATION) || defined(CONFIG_HUGETLB)
> + if (PageTransHuge(page)) {
> + VM_BUG_ON_PAGE(1, page);
> + continue;
> +

Re: [Nouveau] [PATCH 0/9] Add support for SVM atomics in Nouveau

2021-02-10 Thread Christoph Hellwig

On Wed, Feb 10, 2021 at 01:59:13PM -0400, Jason Gunthorpe wrote:
> Really what you want to do here is leave the CPU page in the VMA and
> the page tables where it started and deny CPU access to the page. Then
> all the proper machinery will continue to work.
> 
> IMHO "migration" is the wrong idea if the data isn't actually moving.

Agreed.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH RFC v1 5/6] xen-swiotlb: convert variables to arrays

2021-02-07 Thread Christoph Hellwig

On Thu, Feb 04, 2021 at 09:40:23AM +0100, Christoph Hellwig wrote:
> So one thing that has been on my mind for a while:  I'd really like
> to kill the separate dma ops in Xen swiotlb.  If we compare xen-swiotlb
> to swiotlb the main difference seems to be:
> 
>  - additional reasons to bounce I/O vs the plain DMA capable
>  - the possibility to do a hypercall on arm/arm64
>  - an extra translation layer before doing the phys_to_dma and vice
>versa
>  - an special memory allocator
> 
> I wonder if inbetween a few jump labels or other no overhead enablement
> options and possibly better use of the dma_range_map we could kill
> off most of swiotlb-xen instead of maintaining all this code duplication?

So I looked at this a bit more.

For x86 with XENFEAT_auto_translated_physmap (how common is that?)
pfn_to_gfn is a nop, so plain phys_to_dma/dma_to_phys do work as-is.

xen_arch_need_swiotlb always returns true for x86, and
range_straddles_page_boundary should never be true for the
XENFEAT_auto_translated_physmap case.

So as far as I can tell the mapping fast path for the
XENFEAT_auto_translated_physmap can be trivially reused from swiotlb.

That leaves us with the next more complicated case, x86 or fully cache
coherent arm{,64} without XENFEAT_auto_translated_physmap.  In that case
we need to patch in a phys_to_dma/dma_to_phys that performs the MFN
lookup, which could be done using alternatives or jump labels.
I think if that is done right we should also be able to let that cover
the foreign pages in is_xen_swiotlb_buffer/is_swiotlb_buffer, but
in that worst case that would need another alternative / jump label.

For non-coherent arm{,64} we'd also need to use alternatives or jump
labels to for the cache maintainance ops, but that isn't a hard problem
either.

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH RFC v1 2/6] swiotlb: convert variables to arrays

2021-02-04 Thread Christoph Hellwig

On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote:
> This patch converts several swiotlb related variables to arrays, in
> order to maintain stat/status for different swiotlb buffers. Here are
> variables involved:
> 
> - io_tlb_start and io_tlb_end
> - io_tlb_nslabs and io_tlb_used
> - io_tlb_list
> - io_tlb_index
> - max_segment
> - io_tlb_orig_addr
> - no_iotlb_memory
> 
> There is no functional change and this is to prepare to enable 64-bit
> swiotlb.

Claire Chang (on Cc) already posted a patch like this a month ago,
which looks much better because it actually uses a struct instead
of all the random variables. 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH RFC v1 5/6] xen-swiotlb: convert variables to arrays

2021-02-04 Thread Christoph Hellwig

So one thing that has been on my mind for a while:  I'd really like
to kill the separate dma ops in Xen swiotlb.  If we compare xen-swiotlb
to swiotlb the main difference seems to be:

 - additional reasons to bounce I/O vs the plain DMA capable
 - the possibility to do a hypercall on arm/arm64
 - an extra translation layer before doing the phys_to_dma and vice
   versa
 - an special memory allocator

I wonder if inbetween a few jump labels or other no overhead enablement
options and possibly better use of the dma_range_map we could kill
off most of swiotlb-xen instead of maintaining all this code duplication?
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 3/6] mm: support THP migration to device private memory

2020-12-03 Thread Christoph Hellwig

[adding a few of the usual suspects]

On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> There are 4 types of ZONE_DEVICE struct pages:
> MEMORY_DEVICE_PRIVATE, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, and
> MEMORY_DEVICE_PCI_P2PDMA.
>
> Currently, memremap_pages() allocates struct pages for a physical address 
> range
> with a page_ref_count(page) of one and increments the pgmap->ref per CPU
> reference count by the number of pages created since each ZONE_DEVICE struct
> page has a pointer to the pgmap.
>
> The struct pages are not freed until memunmap_pages() is called which
> calls put_page() which calls put_dev_pagemap() which releases a reference to
> pgmap->ref. memunmap_pages() blocks waiting for pgmap->ref reference count
> to be zero. As far as I can tell, the put_page() in memunmap_pages() has to
> be the *last* put_page() (see MEMORY_DEVICE_PCI_P2PDMA).
> My RFC [1] breaks this put_page() -> put_dev_pagemap() connection so that
> the struct page reference count can go to zero and back to non-zero without
> changing the pgmap->ref reference count.
>
> Q1: Is that safe? Is there some code that depends on put_page() dropping
> the pgmap->ref reference count as part of memunmap_pages()?
> My testing of [1] seems OK but I'm sure there are lots of cases I didn't test.

It should be safe, but the audit you've done is important to make sure
we do not miss anything important.

> MEMORY_DEVICE_PCI_P2PDMA:
> Struct pages are created in pci_p2pdma_add_resource() and represent device
> memory accessible by PCIe bar address space. Memory is allocated with
> pci_alloc_p2pmem() based on a byte length but the gen_pool_alloc_owner()
> call will allocate memory in a minimum of PAGE_SIZE units.
> Reference counting is +1 per *allocation* on the pgmap->ref reference count.
> Note that this is not +1 per page which is what put_page() expects. So
> currently, a get_page()/put_page() works OK because the page reference count
> only goes 1->2 and 2->1. If it went to zero, the pgmap->ref reference count
> would be incorrect if the allocation size was greater than one page.
>
> I see pci_alloc_p2pmem() is called by nvme_alloc_sq_cmds() and
> pci_p2pmem_alloc_sgl() to create a command queue and a struct scatterlist *.
> Looks like sg_page(sg) returns the ZONE_DEVICE struct page of the scatterlist.
> There are a huge number of places sg_page() is called so it is hard to tell
> whether or not get_page()/put_page() is ever called on 
> MEMORY_DEVICE_PCI_P2PDMA
> pages.

Nothing should call get_page/put_page on them, as they are not treated
as refcountable memory.  More importantly nothing is allowed to keep
a reference longer than the time of the I/O.

> pci_p2pmem_virt_to_bus() will return the physical address and I guess
> pfn_to_page(physaddr >> PAGE_SHIFT) could return the struct page.
>
> Since there is a clear allocation/free, pci_alloc_p2pmem() can probably be
> modified to increment/decrement the MEMORY_DEVICE_PCI_P2PDMA struct page
> reference count. Or maybe just leave it at one like it is now.

And yes, doing that is probably a sensible safe guard.

> MEMORY_DEVICE_FS_DAX:
> Struct pages are created in pmem_attach_disk() and virtio_fs_setup_dax() with
> an initial reference count of one.
> The problem I see is that there are 3 states that are important:
> a) memory is free and not allocated to any file (page_ref_count() == 0).
> b) memory is allocated to a file and in the page cache (page_ref_count() == 
> 1).
> c) some gup() or I/O has a reference even after calling unmap_mapping_pages()
>(page_ref_count() > 1). ext4_break_layouts() basically waits until the
>page_ref_count() == 1 with put_page() calling wake_up_var(>_refcount)
>to wake up ext4_break_layouts().
> The current code doesn't seem to distinguish (a) and (b). If we want to use
> the 0->1 reference count to signal (c), then the page cache would have hold
> entries with a page_ref_count() == 0 which doesn't match the general page 
> cache

I think the sensible model here is to grab a reference when it is
added to the page cache.  That is exactly how normal system memory pages
work.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 3/6] mm: support THP migration to device private memory

2020-12-02 Thread Christoph Hellwig

On Fri, Nov 20, 2020 at 04:01:33PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> 
> > MEMORY_DEVICE_GENERIC:
> > Struct pages are created in dev_dax_probe() and represent non-volatile 
> > memory.
> > The device can be mmap()'ed which calls dax_mmap() which sets
> > vma->vm_flags | VM_HUGEPAGE.
> > A CPU page fault will result in a PTE, PMD, or PUD sized page
> > (but not compound) to be inserted by vmf_insert_mixed() which will call 
> > either
> > insert_pfn() or insert_page().
> > Neither insert_pfn() nor insert_page() increments the page reference
> > count.
> 
> But why was this done? It seems very strange to put a pfn with a
> struct page into a VMA and then deliberately not take the refcount for
> the duration of that pfn being in the VMA?
> 
> What prevents memunmap_pages() from progressing while VMAs still point
> at the memory?

Agreed.  Adding Roger who added MEMORY_DEVICE_GENERIC and the only
user.

> > I think just leaving the page reference count at one is better than trying
> > to use the mmu_interval_notifier or changing vmf_insert_mixed() and
> > invalidations of pfn_t_devmap(pfn) to adjust the page reference count.
> 
> Why so? The entire point of getting struct page's for this stuff was
> to be able to follow the struct page flow. I never did learn a reason
> why there is devmap stuff all over the place in the page table code...

Exactly.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 3/6] mm: support THP migration to device private memory

2020-11-09 Thread Christoph Hellwig

On Fri, Nov 06, 2020 at 01:26:50PM -0800, Ralph Campbell wrote:
>
> On 11/6/20 12:03 AM, Christoph Hellwig wrote:
>> I hate the extra pin count magic here.  IMHO we really need to finish
>> off the series to get rid of the extra references on the ZONE_DEVICE
>> pages first.
>
> First, thanks for the review comments.
>
> I don't like the extra refcount either, that is why I tried to fix that up
> before resending this series. However, you didn't like me just fixing the
> refcount only for device private pages and I don't know the dax/pmem code
> and peer-to-peer PCIe uses of ZONE_DEVICE pages well enough to say how
> long it will take me to fix all the use cases.
> So I wanted to make progress on the THP migration code in the mean time.

I think P2P is pretty trivial, given that ZONE_DEVICE pages are used like
a normal memory allocator.  DAX is the interesting case, any specific
help that you need with that?
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 3/6] mm: support THP migration to device private memory

2020-11-06 Thread Christoph Hellwig

I hate the extra pin count magic here.  IMHO we really need to finish
off the series to get rid of the extra references on the ZONE_DEVICE
pages first.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 4/6] mm/thp: add THP allocation helper

2020-11-06 Thread Christoph Hellwig

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +extern struct page *alloc_transhugepage(struct vm_area_struct *vma,
> + unsigned long addr);

No need for the extern.  And also here: do we actually need the stub,
or can the caller make sure (using IS_ENABLED and similar) that the
compiler knows the code is dead?

> +struct page *alloc_transhugepage(struct vm_area_struct *vma,
> +  unsigned long haddr)
> +{
> + gfp_t gfp;
> + struct page *page;
> +
> + gfp = alloc_hugepage_direct_gfpmask(vma);
> + page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
> + if (page)
> + prep_transhuge_page(page);
> + return page;

I think do_huge_pmd_anonymous_page should be switched to use this
helper as well.
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip()

2020-11-05 Thread Christoph Hellwig

Looks good:

Reviewed-by: Christoph Hellwig 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip()

2020-11-05 Thread Christoph Hellwig

On Thu, Nov 05, 2020 at 04:51:43PM -0800, Ralph Campbell wrote:
> Move the definition of migrate_vma_collect_skip() to make it callable
> by migrate_vma_collect_hole(). This helps make the next patch easier
> to read.
> 
> Signed-off-by: Ralph Campbell 

Looks good,

Reviewed-by: Christoph Hellwig 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

1 2 3 4 5 6 >

1 - 100 of 514 matches

Mail list logo