Re: [PATCH -next 00/11] lib/interval-tree: move to half closed intervals

2019-10-04 Thread Jason Gunthorpe
On Fri, Oct 04, 2019 at 06:15:11AM -0700, Michel Lespinasse wrote:
> Hi Jason,
> 
> On Thu, Oct 3, 2019 at 5:26 PM Jason Gunthorpe  wrote:
> > Hurm, this is not entirely accurate. Most users do actually want
> > overlapping and multiple ranges. I just studied this extensively:
> 
> (Just curious, are you the person we discussed this with after the
> Maple Tree talk at LPC 2019 ?)

Possibly!
 
> I think we have two separate API problems there:
> - overlapping vs non-overlapping intervals (the interval tree API
> supports overlapping intervals, but some users are confused about
> this)

I think we just have a bunch of confused drivers, ie the two drm
drivers sure look confused to me.

> - closed vs half-open interval definitions

I'm not sure why this is a big problem..

We may actually just have bugs in handling the '-1' as it is supposed
to be written as start + (size-1) so that start + size == ULONG_MAX+1
works properly.

> > hfi1/mmu_rb definitely needs overlapping as it is dealing with
> > userspace VA ranges under control of userspace. As do the other
> > infiniband users.
> 
> Do you have a handle on what usnic is doing with its intervals ?
> usnic_uiom_insert_interval() has some complicated logic to avoid
> having overlapping intervals, which is very confusing to me.

I don't know why it is so complicated, but I can say that it is
storing userspace VA's in that tree.

I have some feeling this driver is trying to use the IOMMU to create a
mirror of the userspace VA

Userspace can request the HW be able to access any set of overlapping
regions and so the driver must intersect all the ranges and compute a
list of VA pages to IOMMU map. Just guessing.

Jason


Re: [PATCH -next 00/11] lib/interval-tree: move to half closed intervals

2019-10-03 Thread Jason Gunthorpe
On Thu, Oct 03, 2019 at 01:18:47PM -0700, Davidlohr Bueso wrote:
> Hi,
> 
> It has been discussed[1,2] that almost all users of interval trees would 
> better
> be served if the intervals were actually not [a,b], but instead [a, b). This
> series attempts to convert all callers by way of transitioning from using
> "interval_tree_generic.h" to "interval_tree_gen.h". Once all users are 
> converted,
> we remove the former.
> 
> Patch 1: adds a call that will make patch 8 easier to review by introducing 
> stab
>  queries for the vma interval tree.
> 
> Patch 2: adds the new interval_tree_gen.h which is the same as the old one but
>  uses [a,b) intervals.
> 
> Patch 3-9: converts, in baby steps (as much as possible), each interval tree 
> to
>  the new [a,b) one. It is done this way also to maintain 
> bisectability.
>  Most conversions are pretty straightforward, however, there are some
>  creative ways in which some callers use the interval 'end' when going
>  through intersecting ranges within a tree. Ie: patch 3, 6 and 9.
> 
> Patch 10: deletes the interval_tree_generic.h header; there are no longer any 
> users.
> 
> Patch 11: finally simplifies x86 pat tree to use the new interval tree 
> machinery.
> 
> This has been lightly tested, and certainly not on driver paths that do non
> trivial conversions. Also needs more eyeballs as conversions can be easily
> missed (even when I've tried mitigating this by renaming the endpoint from 
> 'last'
> to 'end' in each corresponding structure).
> 
> Because this touches a lot of drivers, I'm Cc'ing the whole thing to a couple 
> of
> relevant lists (mm, dri, rdma); sorry if you consider this spam.
> 
> Applies on top of today's linux-next tree. Please consider for v5.5.
> 
> Thanks!
> 
> [1] 
> https://lore.kernel.org/lkml/CANN689HVDJXKEwB80yPAVwvRwnV4HfiucQVAho=dupkm_ik...@mail.gmail.com/

Hurm, this is not entirely accurate. Most users do actually want
overlapping and multiple ranges. I just studied this extensively:

radeon_mn actually wants overlapping but seems to mis-understand the
interval_tree API and actively tries hard to prevent overlapping at
great cost and complexity. I have a patch to delete all of this and
just be overlapping.

amdgpu_mn copied the wrongness from radeon_mn

All the DRM drivers are basically the same here, tracking userspace
controlled VAs, so overlapping is essential

hfi1/mmu_rb definitely needs overlapping as it is dealing with
userspace VA ranges under control of userspace. As do the other
infiniband users.

vhost probably doesn't overlap in the normal case, but again userspace
could trigger overlap in some pathalogical case.

The [start,last] allows the interval to cover up to ULONG_MAX. I don't
know if this is needed however. Many users are using userspace VAs
here. Is there any kernel configuration where ULONG_MAX is a valid
userspace pointer? Ie 32 bit 4G userspace? I don't know. 

Many users seemed to have bugs where they were taking a userspace
controlled start + length and converting them into a start/end for
interval tree without overflow protection (woops)

Also I have a series already cooking to delete several of these
interval tree users, which will terribly conflict with this :\

Is it really necessary to make such churn for such a tiny API change?

Jason


Re: [1/3] drm/tinydrm/Kconfig: Remove menuconfig DRM_TINYDRM

2019-10-01 Thread Jason Gunthorpe
On Tue, Oct 01, 2019 at 03:28:46PM +0200, Noralf Trønnes wrote:
> 
> 
> Den 01.10.2019 14.36, skrev Jason Gunthorpe:
> > On Thu, Jul 25, 2019 at 12:51:30PM +0200, Noralf Trønnes wrote:
> >> This makes the tiny drivers visible by default without having to enable a
> >> knob.
> >>
> >> Signed-off-by: Noralf Trønnes 
> >> Reviewed-by: Hans de Goede  to it once
> >>  drivers/gpu/drm/Makefile|  2 +-
> >>  drivers/gpu/drm/tinydrm/Kconfig | 37 +++--
> >>  2 files changed, 22 insertions(+), 17 deletions(-)
> > 
> > Bisection says this patch (28c47e16ea2a19adb47fe2c182cbd61cb854237c)
> > breaks kconfig stuff in v5.4-rc by creating circular
> > dependencies. Could someone send a -rc patch to fix this please?
> > 
> > THINKPAD_ACPI (defined at drivers/platform/x86/Kconfig:484), with 
> > definition...
> > ...depends on FB_SSD1307 (defined at drivers/video/fbdev/Kconfig:2259), 
> > with definition...
> > ...depends on FB (defined at drivers/video/fbdev/Kconfig:12), with 
> > definition...
> > ...depends on DRM_KMS_FB_HELPER (defined at drivers/gpu/drm/Kconfig:79), 
> > with definition...
> > ...depends on DRM_KMS_HELPER (defined at drivers/gpu/drm/Kconfig:73), with 
> > definition...
> > ...depends on TINYDRM_REPAPER (defined at 
> > drivers/gpu/drm/tinydrm/Kconfig:51), with definition...
> > ...depends on THERMAL (defined at drivers/thermal/Kconfig:6), with 
> > definition...
> > ...depends on SENSORS_NPCM7XX (defined at drivers/hwmon/Kconfig:1285), with 
> > definition...
> > ...depends on HWMON (defined at drivers/hwmon/Kconfig:6), with definition...
> > ...depends on THINKPAD_ACPI (defined at drivers/platform/x86/Kconfig:484), 
> > with definition...
> > ...depends on ACPI_VIDEO (defined at drivers/acpi/Kconfig:193), with 
> > definition...
> > ...depends on ACER_WMI (defined at drivers/platform/x86/Kconfig:19), with 
> > definition...
> > ...depends on BACKLIGHT_CLASS_DEVICE (defined at 
> > drivers/video/backlight/Kconfig:144), with definition...
> > ...depends again on THINKPAD_ACPI (defined at 
> > drivers/platform/x86/Kconfig:484)
> > 
> 
> Would this commit fix this by any chance:
> 
> drm/tiny: Kconfig: Remove always-y THERMAL dep. from TINYDRM_REPAPER
> https://cgit.freedesktop.org/drm/drm-misc/commit/?id=dfef959803c728c616ad29b008cd91b3446a993a

Yes, thank you, can someone send this to -rc to unbreak 5.4?

Jason


Re: [1/3] drm/tinydrm/Kconfig: Remove menuconfig DRM_TINYDRM

2019-10-01 Thread Jason Gunthorpe
On Thu, Jul 25, 2019 at 12:51:30PM +0200, Noralf Trønnes wrote:
> This makes the tiny drivers visible by default without having to enable a
> knob.
> 
> Signed-off-by: Noralf Trønnes 
> Reviewed-by: Hans de Goede  to it once
> ---
>  drivers/gpu/drm/Makefile|  2 +-
>  drivers/gpu/drm/tinydrm/Kconfig | 37 +++--
>  2 files changed, 22 insertions(+), 17 deletions(-)

Bisection says this patch (28c47e16ea2a19adb47fe2c182cbd61cb854237c)
breaks kconfig stuff in v5.4-rc by creating circular
dependencies. Could someone send a -rc patch to fix this please?

THINKPAD_ACPI (defined at drivers/platform/x86/Kconfig:484), with definition...
...depends on FB_SSD1307 (defined at drivers/video/fbdev/Kconfig:2259), with 
definition...
...depends on FB (defined at drivers/video/fbdev/Kconfig:12), with definition...
...depends on DRM_KMS_FB_HELPER (defined at drivers/gpu/drm/Kconfig:79), with 
definition...
...depends on DRM_KMS_HELPER (defined at drivers/gpu/drm/Kconfig:73), with 
definition...
...depends on TINYDRM_REPAPER (defined at drivers/gpu/drm/tinydrm/Kconfig:51), 
with definition...
...depends on THERMAL (defined at drivers/thermal/Kconfig:6), with definition...
...depends on SENSORS_NPCM7XX (defined at drivers/hwmon/Kconfig:1285), with 
definition...
...depends on HWMON (defined at drivers/hwmon/Kconfig:6), with definition...
...depends on THINKPAD_ACPI (defined at drivers/platform/x86/Kconfig:484), with 
definition...
...depends on ACPI_VIDEO (defined at drivers/acpi/Kconfig:193), with 
definition...
...depends on ACER_WMI (defined at drivers/platform/x86/Kconfig:19), with 
definition...
...depends on BACKLIGHT_CLASS_DEVICE (defined at 
drivers/video/backlight/Kconfig:144), with definition...
...depends again on THINKPAD_ACPI (defined at drivers/platform/x86/Kconfig:484)

Full output:

kconfiglib.KconfigError: 
Dependency loop
===

THINKPAD_ACPI (defined at drivers/platform/x86/Kconfig:484), with definition...

config THINKPAD_ACPI
tristate "ThinkPad ACPI Laptop Extras"
select HWMON
select NVRAM
select NEW_LEDS
select LEDS_CLASS
select LEDS_TRIGGERS
select LEDS_TRIGGER_AUDIO
depends on ACPI && ACPI_BATTERY && INPUT && (RFKILL || RFKILL = n) && 
(ACPI_VIDEO || ACPI_VIDEO = n) && BACKLIGHT_CLASS_DEVICE && 
X86_PLATFORM_DEVICES && X86
help
  This is a driver for the IBM and Lenovo ThinkPad laptops. It adds
  support for Fn-Fx key combinations, Bluetooth control, video
  output switching, ThinkLight control, UltraBay eject and more.
  For more information about this driver see
   and
   .
  
  This driver was formerly known as ibm-acpi.
  
  Extra functionality will be available if the rfkill (CONFIG_RFKILL)
  and/or ALSA (CONFIG_SND) subsystems are available in the kernel.
  Note that if you want ThinkPad-ACPI to be built-in instead of
  modular, ALSA and rfkill will also have to be built-in.
  
  If you have an IBM or Lenovo ThinkPad laptop, say Y or M here.

...depends on ACPI_VIDEO (defined at drivers/acpi/Kconfig:193), with 
definition...

config ACPI_VIDEO
tristate "Video"
select THERMAL
depends on X86 && BACKLIGHT_CLASS_DEVICE && INPUT && ACPI
help
  This driver implements the ACPI Extensions For Display Adapters
  for integrated graphics devices on motherboard, as specified in
  ACPI 2.0 Specification, Appendix B.  This supports basic operations
  such as defining the video POST device, retrieving EDID information,
  and setting up a video output.
  
  To compile this driver as a module, choose M here:
  the module will be called video.

(select-related dependencies: (DRM_NOUVEAU && ACPI && X86 && 
BACKLIGHT_CLASS_DEVICE && INPUT && DRM && PCI && MMU && HAS_IOMEM) || 
(DRM_NOUVEAU && ACPI && X86 && DRM && PCI && MMU && HAS_IOMEM) || (DRM_I915 && 
ACPI && DRM && X86 && PCI && HAS_IOMEM) || (DRM_GMA500 && ACPI && DRM && PCI && 
X86 && MMU && HAS_IOMEM) || (ACER_WMI && ACPI && ACPI && BACKLIGHT_CLASS_DEVICE 
&& SERIO_I8042 && INPUT && (RFKILL || RFKILL = n) && ACPI_WMI && 
X86_PLATFORM_DEVICES && X86))

...depends on ACER_WMI (defined at drivers/platform/x86/Kconfig:19), with 
definition...

config ACER_WMI
tristate "Acer WMI Laptop Extras"
select LEDS_CLASS
select NEW_LEDS
select INPUT_SPARSEKMAP
select ACPI_VIDEO if ACPI
depends on ACPI && BACKLIGHT_CLASS_DEVICE && SERIO_I8042 && INPUT && 
(RFKILL || RFKILL = n) && ACPI_WMI && X86_PLATFORM_DEVICES && X86
help
  This is a driver for newer Acer (and Wistron) laptops. It adds
  wireless radio and bluetooth control, and on some laptops,
  exposes the mail LED and LCD backlight.
  
  If you have an ACPI-WMI 

[GIT PULL] Please pull hmm related changes

2019-09-17 Thread Jason Gunthorpe
t out a new pagewalk.h header from mm.h
  pagewalk: separate function pointers from iterator data
  pagewalk: use lockdep_assert_held for locking validation

Dan Williams (1):
  libnvdimm: Enable unit test infrastructure compile checks

Daniel Vetter (6):
  mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail
  mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
  mm/mmu_notifiers: prime lockdep
  mm/mmu_notifiers: annotate with might_sleep()
  kernel.h: Add non_block_start/end()
  mm, notifier: Catch sleeping/blocking for !blockable

Jason Gunthorpe (27):
  mm/hmm: comment on VM_FAULT_RETRY semantics in handle_mm_fault
  mm/mmu_notifiers: hoist do_mmu_notifier_register down_write to the caller
  mm/mmu_notifiers: do not speculatively allocate a mmu_notifier_mm
  mm/mmu_notifiers: add a get/put scheme for the registration
  misc/sgi-gru: use mmu_notifier_get/put for struct gru_mm_struct
  hmm: use mmu_notifier_get/put for 'struct hmm'
  drm/radeon: use mmu_notifier_get/put for struct radeon_mn
  drm/amdkfd: fix a use after free race with mmu_notifer unregister
  drm/amdkfd: use mmu_notifier_put
  Merge 'notifier_get_put' into hmm.git
  RDMA/odp: Use the common interval tree library instead of generic
  RDMA/odp: Iterate over the whole rbtree directly
  RDMA/odp: Make it clearer when a umem is an implicit ODP umem
  RMDA/odp: Consolidate umem_odp initialization
  RDMA/odp: Make the three ways to create a umem_odp clear
  RDMA/odp: Split creating a umem_odp from ib_umem_get
  RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
  RDMA/odp: Check for overflow when computing the umem_odp end
  RDMA/odp: Use kvcalloc for the dma_list and page_list
  RDMA/mlx5: Use ib_umem_start instead of umem.address
  RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
  Merge branch 'odp_fixes' into hmm.git
  RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
  RDMA/odp: remove ib_ucontext from ib_umem
  mm/mmu_notifiers: remove unregister_no_release
  csky: add missing brackets in a macro for tlb.h
  drm/radeon: guard against calling an unpaired radeon_mn_unregister()

Moni Shoua (1):
  RDMA/core: Make invalidate_range a device operation

Ralph Campbell (6):
  mm/hmm: replace hmm_update with mmu_notifier_range
  mm/hmm: a few more C style and comment clean ups
  mm/hmm: remove hugetlbfs check in hmm_vma_walk_pmd
  mm/hmm: remove hmm_range vma
  mm/hmm: hmm_range_fault() NULL pointer bug
  mm/hmm: hmm_range_fault() infinite loop

Yang, Philip (1):
  mm/hmm: fix hmm_range_fault()'s handling of swapped out pages

 Documentation/vm/hmm.rst |  73 +
 arch/csky/include/asm/tlb.h  |   8 +-
 arch/openrisc/kernel/dma.c   |  23 +-
 arch/powerpc/mm/book3s64/subpage_prot.c  |  12 +-
 arch/s390/mm/gmap.c  |  35 +--
 drivers/gpu/drm/amd/amdgpu/Kconfig   |   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |   2 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c   |  15 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c  |  31 +-
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h|   3 -
 drivers/gpu/drm/amd/amdkfd/kfd_process.c |  88 +++---
 drivers/gpu/drm/nouveau/Kconfig  |   5 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c   | 456 +--
 drivers/gpu/drm/nouveau/nouveau_dmem.h   |  11 -
 drivers/gpu/drm/nouveau/nouveau_drm.c|   3 +
 drivers/gpu/drm/nouveau/nouveau_svm.c|  23 +-
 drivers/gpu/drm/radeon/radeon.h  |   3 -
 drivers/gpu/drm/radeon/radeon_device.c   |   2 -
 drivers/gpu/drm/radeon/radeon_drv.c  |   2 +
 drivers/gpu/drm/radeon/radeon_mn.c   | 156 +++--
 drivers/infiniband/Kconfig   |   1 +
 drivers/infiniband/core/device.c |   1 +
 drivers/infiniband/core/umem.c   |  54 +---
 drivers/infiniband/core/umem_odp.c   | 524 +++
 drivers/infiniband/core/uverbs_cmd.c |   5 -
 drivers/infiniband/core/uverbs_main.c|   1 +
 drivers/infiniband/hw/mlx5/main.c|   9 -
 drivers/infiniband/hw/mlx5/mem.c |  13 -
 drivers/infiniband/hw/mlx5/mr.c  |  38 ++-
 drivers/infiniband/hw/mlx5/odp.c |  88 +++---
 drivers/misc/sgi-gru/grufile.c   |   1 +
 drivers/misc/sgi-gru/grutables.h |   2 -
 drivers/misc/sgi-gru/grutlbpurge.c   |  84 ++---
 drivers/nvdimm/Kconfig   |  12 +
 drivers/nvdimm/Makefile  |   4 +
 fs/proc/task_mmu.c   |  80 ++---
 include/linux/hmm.h  | 125 ++--
 include/linux/ioport.h   |   2 +
 include/linux/kernel.h   |  23 +-
 include/linux/memremap.h |   3 +-
 include/linux/migrate.h  | 120 ++-
 include/linux/mm.h   |  46 ---
 include/li

Re: [PATCH] mm, notifier: Fix early return case for new lockdep annotations

2019-09-07 Thread Jason Gunthorpe
On Fri, Sep 06, 2019 at 07:47:30PM +0200, Daniel Vetter wrote:

> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 5a03417e5bf7..4edd98b06834 100644
> +++ b/include/linux/mmu_notifier.h
> @@ -356,13 +356,14 @@ mmu_notifier_invalidate_range_start(struct 
> mmu_notifier_range *range)
>  static inline int
>  mmu_notifier_invalidate_range_start_nonblock(struct mmu_notifier_range 
> *range)
>  {
> + int ret = 0;
>   lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
>   if (mm_has_notifiers(range->mm)) {
>   range->flags &= ~MMU_NOTIFIER_RANGE_BLOCKABLE;
> - return __mmu_notifier_invalidate_range_start(range);
> + ret = __mmu_notifier_invalidate_range_start(range);
>   }
>   lock_map_release(&__mmu_notifier_invalidate_range_start_map);
> - return 0;
> + return ret;

Gar, yes. Since nobody has grabbed hmm.git I've squashed this into the
original patch and fixed the checkpatch warning about missing line
after the ret

Everything should be in linux-next the next time it builds

Thanks,
Jason


Re: [PATCH 0/5] mmu notifer debug annotations

2019-09-05 Thread Jason Gunthorpe
On Mon, Aug 26, 2019 at 10:14:20PM +0200, Daniel Vetter wrote:
> Hi all,
> 
> Next round. Changes:
> 
> - I kept the two lockdep annotations patches since when I rebased this
>   before retesting linux-next didn't yet have them. Otherwise unchanged
>   except for a trivial conflict.
> 
> - Ack from Peter Z. on the kernel.h patch.
> 
> - Added annotations for non_block to invalidate_range_end. I can't test
>   that readily since i915 doesn't use it.
> 
> - Added might_sleep annotations to also make sure the mm side keeps up
>   it's side of the contract here around what's allowed and what's not.
> 
> Comments, feedback, review as usual very much appreciated.
> 
> 
> Daniel Vetter (5):
>   kernel.h: Add non_block_start/end()
>   mm, notifier: Catch sleeping/blocking for !blockable

These two applied to hmm.git, with the small check patch edit, thanks!

Jason


Re: [PATCH 3/5] kernel.h: Add non_block_start/end()

2019-09-03 Thread Jason Gunthorpe
On Tue, Sep 03, 2019 at 09:28:23AM +0200, Daniel Vetter wrote:

> > Cleanest would be a new header I guess, together with might_sleep().
> > But moving that is a bit much I think, there's almost 500 callers of
> > that one from a quick git grep
> >
> > > If dropping do while is the only change then I can edit it in..
> > > I think we have the acks now
> >
> > Yeah sounds simplest, thanks.
> 
> Hi Jason,
> 
> Do you expect me to resend now, or do you plan to do the patchwork
> appeasement when applying? I've seen you merged the other patches
> (thanks!), but not these two here.

Sorry, I didn't get to this before I started travelling, and deferred
it since we were having linux-next related problems with hmm.git. I
hope to do it today.

I will fix it up as promised

Thanks,
Jason


Re: [PATCH 3/5] kernel.h: Add non_block_start/end()

2019-08-28 Thread Jason Gunthorpe
On Wed, Aug 28, 2019 at 08:33:13PM +0200, Daniel Vetter wrote:
> On Wed, Aug 28, 2019 at 12:50 AM Jason Gunthorpe  wrote:
> >
> > > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > > index 4fa360a13c1e..82f84cfe372f 100644
> > > +++ b/include/linux/kernel.h
> > > @@ -217,7 +217,9 @@ extern void __cant_sleep(const char *file, int line, 
> > > int preempt_offset);
> > >   * might_sleep - annotation for functions that can sleep
> > >   *
> > >   * this macro will print a stack trace if it is executed in an atomic
> > > - * context (spinlock, irq-handler, ...).
> > > + * context (spinlock, irq-handler, ...). Additional sections where 
> > > blocking is
> > > + * not allowed can be annotated with non_block_start() and 
> > > non_block_end()
> > > + * pairs.
> > >   *
> > >   * This is a useful debugging help to be able to catch problems early 
> > > and not
> > >   * be bitten later when the calling function happens to sleep when it is 
> > > not
> > > @@ -233,6 +235,25 @@ extern void __cant_sleep(const char *file, int line, 
> > > int preempt_offset);
> > >  # define cant_sleep() \
> > >   do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
> > >  # define sched_annotate_sleep()  (current->task_state_change = 0)
> > > +/**
> > > + * non_block_start - annotate the start of section where sleeping is 
> > > prohibited
> > > + *
> > > + * This is on behalf of the oom reaper, specifically when it is calling 
> > > the mmu
> > > + * notifiers. The problem is that if the notifier were to block on, for 
> > > example,
> > > + * mutex_lock() and if the process which holds that mutex were to 
> > > perform a
> > > + * sleeping memory allocation, the oom reaper is now blocked on 
> > > completion of
> > > + * that memory allocation. Other blocking calls like wait_event() pose 
> > > similar
> > > + * issues.
> > > + */
> > > +# define non_block_start() \
> > > + do { current->non_block_count++; } while (0)
> > > +/**
> > > + * non_block_end - annotate the end of section where sleeping is 
> > > prohibited
> > > + *
> > > + * Closes a section opened by non_block_start().
> > > + */
> > > +# define non_block_end() \
> > > + do { WARN_ON(current->non_block_count-- == 0); } while (0)
> >
> > check-patch does not like these, and I agree
> >
> > #101: FILE: include/linux/kernel.h:248:
> > +# define non_block_start() \
> > +   do { current->non_block_count++; } while (0)
> >
> > /tmp/tmp1spfxufy/0006-kernel-h-Add-non_block_start-end-.patch:108: WARNING: 
> > Single statement macros should not use a do {} while (0) loop
> > #108: FILE: include/linux/kernel.h:255:
> > +# define non_block_end() \
> > +   do { WARN_ON(current->non_block_count-- == 0); } while (0)
> >
> > Please use a static inline?
> 
> We need get_current() plus the task_struct, so this gets real messy
> real fast. Not even sure which header this would fit in, or whether
> I'd need to create a new one. You're insisting on this or respinning
> with the do { } while (0) dropped ok.

My prefernce is always a static inline, but if the headers are so
twisty we need to use #define to solve a missing include, then I
wouldn't insist on it.

If dropping do while is the only change then I can edit it in..
I think we have the acks now

Jason


Re: [PATCH 2/2] mm/hmm: hmm_range_fault() infinite loop

2019-08-28 Thread Jason Gunthorpe
On Tue, Aug 27, 2019 at 01:16:13PM -0700, Ralph Campbell wrote:
> 
> On 8/27/19 11:41 AM, Jason Gunthorpe wrote:
> > On Fri, Aug 23, 2019 at 03:17:53PM -0700, Ralph Campbell wrote:
> > 
> > > Signed-off-by: Ralph Campbell 
> > >   mm/hmm.c | 3 +++
> > >   1 file changed, 3 insertions(+)
> > > 
> > > diff --git a/mm/hmm.c b/mm/hmm.c
> > > index 29371485fe94..4882b83aeccb 100644
> > > +++ b/mm/hmm.c
> > > @@ -292,6 +292,9 @@ static int hmm_vma_walk_hole_(unsigned long addr, 
> > > unsigned long end,
> > >   hmm_vma_walk->last = addr;
> > >   i = (addr - range->start) >> PAGE_SHIFT;
> > > + if (write_fault && walk->vma && !(walk->vma->vm_flags & VM_WRITE))
> > > + return -EPERM;
> > 
> > Can walk->vma be NULL here? hmm_vma_do_fault() touches it
> > unconditionally.
> > 
> > Jason
> > 
> walk->vma can be NULL. hmm_vma_do_fault() no longer touches it
> unconditionally, that is what the preceding patch fixes.
> I suppose I could change hmm_vma_walk_hole_() to check for NULL
> and fill in the pfns[] array, I just chose to handle it in
> hmm_vma_do_fault().

Okay, yes maybe long term it would be clearer to do the vma null check
closer to the start of the callback, but this is a good enough bug fix

Reviewed-by: Jason Gunthorpe 

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 0/2] mm/hmm: two bug fixes for hmm_range_fault()

2019-08-28 Thread Jason Gunthorpe
On Fri, Aug 23, 2019 at 03:17:51PM -0700, Ralph Campbell wrote:
> I have been working on converting Jerome's hmm_dummy driver and self
> tests into a stand-alone set of tests to be included in
> tools/testing/selftests/vm and came across these two bug fixes in the
> process. The tests aren't quite ready to be posted as a patch.
> I'm posting the fixes now since I thought they shouldn't wait.
> They should probably have a fixes line but with all the HMM changes,
> I wasn't sure exactly which commit to use.
> 
> These are based on top of Jason's latest hmm branch.
> 
> Ralph Campbell (2):
>   mm/hmm: hmm_range_fault() NULL pointer bug
>   mm/hmm: hmm_range_fault() infinite loop

Applied to hmm.git

Thanks,
Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/2] mm/hmm: hmm_range_fault() infinite loop

2019-08-28 Thread Jason Gunthorpe
On Fri, Aug 23, 2019 at 03:17:53PM -0700, Ralph Campbell wrote:

> Signed-off-by: Ralph Campbell 
>  mm/hmm.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 29371485fe94..4882b83aeccb 100644
> +++ b/mm/hmm.c
> @@ -292,6 +292,9 @@ static int hmm_vma_walk_hole_(unsigned long addr, 
> unsigned long end,
>   hmm_vma_walk->last = addr;
>   i = (addr - range->start) >> PAGE_SHIFT;
>  
> + if (write_fault && walk->vma && !(walk->vma->vm_flags & VM_WRITE))
> + return -EPERM;

Can walk->vma be NULL here? hmm_vma_do_fault() touches it
unconditionally.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 0/5] mmu notifer debug annotations

2019-08-27 Thread Jason Gunthorpe
On Mon, Aug 26, 2019 at 10:14:20PM +0200, Daniel Vetter wrote:
> Hi all,
> 
> Next round. Changes:
> 
> - I kept the two lockdep annotations patches since when I rebased this
>   before retesting linux-next didn't yet have them. Otherwise unchanged
>   except for a trivial conflict.
> 
> - Ack from Peter Z. on the kernel.h patch.
> 
> - Added annotations for non_block to invalidate_range_end. I can't test
>   that readily since i915 doesn't use it.
> 
> - Added might_sleep annotations to also make sure the mm side keeps up
>   it's side of the contract here around what's allowed and what's not.
> 
> Comments, feedback, review as usual very much appreciated.
> 
> 
> Daniel Vetter (5):
>   mm, notifier: Add a lockdep map for invalidate_range_start/end
>   mm, notifier: Prime lockdep
>   mm, notifier: annotate with might_sleep()

I took these ones to hmm.git as they have a small conflict with hmm's
changes.

>   kernel.h: Add non_block_start/end()
>   mm, notifier: Catch sleeping/blocking for !blockable

Lets see about the checkpatch warning and review on these two please

Thanks,
Jason


Re: [PATCH 3/5] kernel.h: Add non_block_start/end()

2019-08-27 Thread Jason Gunthorpe
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 4fa360a13c1e..82f84cfe372f 100644
> +++ b/include/linux/kernel.h
> @@ -217,7 +217,9 @@ extern void __cant_sleep(const char *file, int line, int 
> preempt_offset);
>   * might_sleep - annotation for functions that can sleep
>   *
>   * this macro will print a stack trace if it is executed in an atomic
> - * context (spinlock, irq-handler, ...).
> + * context (spinlock, irq-handler, ...). Additional sections where blocking 
> is
> + * not allowed can be annotated with non_block_start() and non_block_end()
> + * pairs.
>   *
>   * This is a useful debugging help to be able to catch problems early and not
>   * be bitten later when the calling function happens to sleep when it is not
> @@ -233,6 +235,25 @@ extern void __cant_sleep(const char *file, int line, int 
> preempt_offset);
>  # define cant_sleep() \
>   do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
>  # define sched_annotate_sleep()  (current->task_state_change = 0)
> +/**
> + * non_block_start - annotate the start of section where sleeping is 
> prohibited
> + *
> + * This is on behalf of the oom reaper, specifically when it is calling the 
> mmu
> + * notifiers. The problem is that if the notifier were to block on, for 
> example,
> + * mutex_lock() and if the process which holds that mutex were to perform a
> + * sleeping memory allocation, the oom reaper is now blocked on completion of
> + * that memory allocation. Other blocking calls like wait_event() pose 
> similar
> + * issues.
> + */
> +# define non_block_start() \
> + do { current->non_block_count++; } while (0)
> +/**
> + * non_block_end - annotate the end of section where sleeping is prohibited
> + *
> + * Closes a section opened by non_block_start().
> + */
> +# define non_block_end() \
> + do { WARN_ON(current->non_block_count-- == 0); } while (0)

check-patch does not like these, and I agree

#101: FILE: include/linux/kernel.h:248:
+# define non_block_start() \
+   do { current->non_block_count++; } while (0)

/tmp/tmp1spfxufy/0006-kernel-h-Add-non_block_start-end-.patch:108: WARNING: 
Single statement macros should not use a do {} while (0) loop
#108: FILE: include/linux/kernel.h:255:
+# define non_block_end() \
+   do { WARN_ON(current->non_block_count-- == 0); } while (0)

Please use a static inline?

Also, can we get one more ack on this patch?

Jason


Re: [PATCH 1/2] mm/hmm: hmm_range_fault() NULL pointer bug

2019-08-27 Thread Jason Gunthorpe
On Mon, Aug 26, 2019 at 11:02:12AM -0700, Ralph Campbell wrote:
> 
> On 8/24/19 3:37 PM, Christoph Hellwig wrote:
> > On Fri, Aug 23, 2019 at 03:17:52PM -0700, Ralph Campbell wrote:
> > > Although hmm_range_fault() calls find_vma() to make sure that a vma exists
> > > before calling walk_page_range(), hmm_vma_walk_hole() can still be called
> > > with walk->vma == NULL if the start and end address are not contained
> > > within the vma range.
> > 
> > Should we convert to walk_vma_range instead?  Or keep walk_page_range
> > but drop searching the vma ourselves?
> > 
> > Except for that the patch looks good to me:
> > 
> > Reviewed-by: Christoph Hellwig 
> > 
> 
> I think keeping the call to walk_page_range() makes sense.
> Jason is hoping to be able to snapshot a range with & without vmas
> and have the pfns[] filled with empty/valid entries as appropriate.
> 
> I plan to repost my patch changing hmm_range_fault() to use
> walk.test_walk which will remove the call to find_vma().
> Jason had some concerns about testing it so that's why I have
> been working on some HMM self tests before resending it.

I'm really excited to see tests for hmm_range_fault()!

Did you find this bug with the tests??

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm/hmm: hmm_range_fault handle pages swapped out

2019-08-25 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 08:52:56PM +, Yang, Philip wrote:
> hmm_range_fault may return NULL pages because some of pfns are equal to
> HMM_PFN_NONE. This happens randomly under memory pressure. The reason is
> for swapped out page pte path, hmm_vma_handle_pte doesn't update fault
> variable from cpu_flags, so it failed to call hmm_vam_do_fault to swap
> the page in.
> 
> The fix is to call hmm_pte_need_fault to update fault variable.
> 
> Change-Id: I2e8611485563d11d938881c18b7935fa1e7c91ee
> Signed-off-by: Philip Yang 
>  mm/hmm.c | 3 +++
>  1 file changed, 3 insertions(+)

Applied to hmm.git, thanks

I fixed the commit message:

Author: Yang, Philip 
Date:   Thu Aug 15 20:52:56 2019 +

mm/hmm: fix hmm_range_fault()'s handling of swapped out pages

hmm_range_fault() may return NULL pages because some of the pfns are equal
to HMM_PFN_NONE. This happens randomly under memory pressure. The reason
is during the swapped out page pte path, hmm_vma_handle_pte() doesn't
update the fault variable from cpu_flags, so it failed to call
hmm_vam_do_fault() to swap the page in.

The fix is to call hmm_pte_need_fault() to update fault variable.

Fixes: 74eee180b935 ("mm/hmm/mirror: device page fault handler")
Link: https://lore.kernel.org/r/20190815205227.7949-1-philip.y...@amd.com
Signed-off-by: Philip Yang 
Reviewed-by: "Jérôme Glisse" 
Signed-off-by: Jason Gunthorpe 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 4/4] mm, notifier: Catch sleeping/blocking for !blockable

2019-08-22 Thread Jason Gunthorpe
On Thu, Aug 22, 2019 at 10:42:39AM +0200, Daniel Vetter wrote:

> > RDMA has a mutex:
> >
> > ib_umem_notifier_invalidate_range_end
> >   rbt_ib_umem_for_each_in_range
> >invalidate_range_start_trampoline
> > ib_umem_notifier_end_account
> >   mutex_lock(_odp->umem_mutex);
> >
> > I'm working to delete this path though!
> >
> > nonblocking or not follows the start, the same flag gets placed into
> > the mmu_notifier_range struct passed to end.
> 
> Ok, makes sense.
> 
> I guess that also means the might_sleep (I started on that) in
> invalidate_range_end also needs to be conditional? Or not bother with
> a might_sleep in invalidate_range_end since you're working on removing
> the last sleep in there?

I might suggest the same pattern as used for locked, the might_sleep
unconditionally on the start, and a 2nd might sleep after the IF in
__mmu_notifier_invalidate_range_end()

Observing that by audit all the callers already have the same locking
context for start/end

Jason


Re: [PATCH v3 hmm 00/11] Add mmu_notifier_get/put for managing mmu notifier registrations

2019-08-22 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 08:15:37PM -0300, Jason Gunthorpe wrote:
 
> This series is already entangled with patches in the hmm & RDMA tree and
> will require some git topic branches for the RDMA ODP stuff. I intend for
> it to go through the hmm tree.

The RDMA related patches have been applied to the RDMA tree on a
shared topic branch, so I've merged that into hmm.git and applied the
last patches from this series on top:

>   RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
>   RDMA/odp: remove ib_ucontext from ib_umem
>   mm/mmu_notifiers: remove unregister_no_release

There was some conflict churn in the RDMA ODP patches vs what was used
to the patches from this series, I fixed it up. Now I'm waiting for
some testing feedback before pushing it to linux-next

Thanks,
Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 4/4] mm, notifier: Catch sleeping/blocking for !blockable

2019-08-22 Thread Jason Gunthorpe
On Wed, Aug 21, 2019 at 05:41:51PM +0200, Daniel Vetter wrote:

> > Hm, I thought the page table locks we're holding there already prevent any
> > sleeping, so would be redundant? But reading through code I think that's
> > not guaranteed, so yeah makes sense to add it for invalidate_range_end
> > too. I'll respin once I have the ack/nack from scheduler people.
> 
> So I started to look into this, and I'm a bit confused. There's no
> _nonblock version of this, so does this means blocking is never allowed,
> or always allowed?

RDMA has a mutex:

ib_umem_notifier_invalidate_range_end
  rbt_ib_umem_for_each_in_range
   invalidate_range_start_trampoline
ib_umem_notifier_end_account
  mutex_lock(_odp->umem_mutex);

I'm working to delete this path though!

nonblocking or not follows the start, the same flag gets placed into
the mmu_notifier_range struct passed to end.

> From a quick look through implementations I've only seen spinlocks, and
> one up_read. So I guess I should wrape this callback in some unconditional
> non_block_start/end, but I'm not sure.

For now, we should keep it the same as start, conditionally blocking.

Hopefully before LPC I can send a RFC series that eliminates most
invalidate_range_end users in favor of common locking..

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: linux-next: Tree for Aug 19 (amdgpu)

2019-08-22 Thread Jason Gunthorpe
On Wed, Aug 21, 2019 at 03:34:12PM +, Kuehling, Felix wrote:
> 
> On 2019-08-20 8:36 a.m., Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 11:45:54AM +1000, Stephen Rothwell wrote:
> >> Hi all,
> >>
> >> On Mon, 19 Aug 2019 18:34:41 -0700 Randy Dunlap  
> >> wrote:
> >>> On 8/19/19 2:18 AM, Stephen Rothwell wrote:
> >>>> Hi all,
> >>>>
> >>>> Changes since 20190816:
> >>>>
> >>> on x86_64:
> >>>
> >>> ../drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c: In function ‘amdgpu_exit’:
> >>> ../drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:1471:2: error: implicit 
> >>> declaration of function ‘mmu_notifier_synchronize’; did you mean 
> >>> ‘__sync_synchronize’? [-Werror=implicit-function-declaration]
> >>>mmu_notifier_synchronize();
> >>>^~~~
> >>>__sync_synchronize
> >>>
> >>>
> >>> Full randconfig file is attached.
> >> Caused by commit
> >>
> >>6832c9dc8358 ("hmm: use mmu_notifier_get/put for 'struct hmm'")
> >>
> >> from the hmm tree.
> >>
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c need to include 
> >> linux/mmu_notifier.h
> > Ah yes, thanks, it is because of !CONFIG_HMM_MIRROR in this
> > randconfig. I've fixed it up.
> 
> Thanks Jason. I'm trying to follow what's going on here, but I can't 
> find the commit hash quoted above in any of the public repositories I'm 
> tracking

I was able to squash the one line fix, so the commit ID is gone, it is
now:

https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=hmm=c7d8b7824ff9de866a356e1892dbe9f191aa5d06

From

https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=hmm

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: linux-next: Tree for Aug 19 (amdgpu)

2019-08-21 Thread Jason Gunthorpe
On Tue, Aug 20, 2019 at 11:45:54AM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> On Mon, 19 Aug 2019 18:34:41 -0700 Randy Dunlap  wrote:
> >
> > On 8/19/19 2:18 AM, Stephen Rothwell wrote:
> > > Hi all,
> > > 
> > > Changes since 20190816:
> > >   
> > 
> > on x86_64:
> > 
> > ../drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c: In function ‘amdgpu_exit’:
> > ../drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c:1471:2: error: implicit 
> > declaration of function ‘mmu_notifier_synchronize’; did you mean 
> > ‘__sync_synchronize’? [-Werror=implicit-function-declaration]
> >   mmu_notifier_synchronize();
> >   ^~~~
> >   __sync_synchronize
> > 
> > 
> > Full randconfig file is attached.
> 
> Caused by commit
> 
>   6832c9dc8358 ("hmm: use mmu_notifier_get/put for 'struct hmm'")
> 
> from the hmm tree.
> 
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c need to include linux/mmu_notifier.h

Ah yes, thanks, it is because of !CONFIG_HMM_MIRROR in this
randconfig. I've fixed it up.

Regards,
Jason

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 4/4] mm, notifier: Catch sleeping/blocking for !blockable

2019-08-21 Thread Jason Gunthorpe
On Tue, Aug 20, 2019 at 10:19:02AM +0200, Daniel Vetter wrote:
> We need to make sure implementations don't cheat and don't have a
> possible schedule/blocking point deeply burried where review can't
> catch it.
> 
> I'm not sure whether this is the best way to make sure all the
> might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> But it gets the job done.
> 
> Inspired by an i915 patch series which did exactly that, because the
> rules haven't been entirely clear to us.
> 
> v2: Use the shiny new non_block_start/end annotations instead of
> abusing preempt_disable/enable.
> 
> v3: Rebase on top of Glisse's arg rework.
> 
> v4: Rebase on top of more Glisse rework.
> 
> Cc: Jason Gunthorpe 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: David Rientjes 
> Cc: "Christian König" 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Reviewed-by: Christian König 
> Reviewed-by: Jérôme Glisse 
> Signed-off-by: Daniel Vetter 
>  mm/mmu_notifier.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 538d3bb87f9b..856636d06ee0 100644
> +++ b/mm/mmu_notifier.c
> @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct 
> mmu_notifier_range *range)
>   id = srcu_read_lock();
>   hlist_for_each_entry_rcu(mn, >mm->mmu_notifier_mm->list, hlist) {
>   if (mn->ops->invalidate_range_start) {
> - int _ret = mn->ops->invalidate_range_start(mn, range);
> + int _ret;
> +
> + if (!mmu_notifier_range_blockable(range))
> + non_block_start();
> + _ret = mn->ops->invalidate_range_start(mn, range);
> + if (!mmu_notifier_range_blockable(range))
> + non_block_end();

If someone Acks all the sched changes then I can pick this for
hmm.git, but I still think the existing pre-emption debugging is fine
for this use case.

Also, same comment as for the lockdep map, this needs to apply to the
non-blocking range_end also.

Anyhow, since this series has conflicts with hmm.git it would be best
to flow through the whole thing through that tree. If there are no
remarks on the first two patches I'll grab them in a few days.

Regards,
Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 4/4] mm, notifier: Catch sleeping/blocking for !blockable

2019-08-20 Thread Jason Gunthorpe
On Tue, Aug 20, 2019 at 05:18:10PM +0200, Daniel Vetter wrote:
> > > diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> > > index 538d3bb87f9b..856636d06ee0 100644
> > > +++ b/mm/mmu_notifier.c
> > > @@ -181,7 +181,13 @@ int __mmu_notifier_invalidate_range_start(struct 
> > > mmu_notifier_range *range)
> > >   id = srcu_read_lock();
> > >   hlist_for_each_entry_rcu(mn, >mm->mmu_notifier_mm->list, hlist) {
> > >   if (mn->ops->invalidate_range_start) {
> > > - int _ret = mn->ops->invalidate_range_start(mn, range);
> > > + int _ret;
> > > +
> > > + if (!mmu_notifier_range_blockable(range))
> > > + non_block_start();
> > > + _ret = mn->ops->invalidate_range_start(mn, range);
> > > + if (!mmu_notifier_range_blockable(range))
> > > + non_block_end();
> > 
> > If someone Acks all the sched changes then I can pick this for
> > hmm.git, but I still think the existing pre-emption debugging is fine
> > for this use case.
> 
> Ok, I'll ping Peter Z. for an ack, iirc he was involved.
> 
> > Also, same comment as for the lockdep map, this needs to apply to the
> > non-blocking range_end also.
> 
> Hm, I thought the page table locks we're holding there already prevent any
> sleeping, so would be redundant?

AFAIK no. All callers of invalidate_range_start/end pairs do so a few
lines apart and don't change their locking in between - thus since
start can block so can end.

Would love to know if that is not true??

Similarly I've also been idly wondering if we should add a
'might_sleep()' to invalidate_rangestart/end() to make this constraint
clear & tested to the mm side?

Jason


Re: [PATCH 2/4] mm, notifier: Prime lockdep

2019-08-20 Thread Jason Gunthorpe
On Tue, Aug 20, 2019 at 10:19:00AM +0200, Daniel Vetter wrote:
> We want to teach lockdep that mmu notifiers can be called from direct
> reclaim paths, since on many CI systems load might never reach that
> level (e.g. when just running fuzzer or small functional tests).
> 
> Motivated by a discussion with Jason.
> 
> I've put the annotation into mmu_notifier_register since only when we
> have mmu notifiers registered is there any point in teaching lockdep
> about them. Also, we already have a kmalloc(, GFP_KERNEL), so this is
> safe.
> 
> Cc: Jason Gunthorpe 
> Cc: Chris Wilson 
> Cc: Andrew Morton 
> Cc: David Rientjes 
> Cc: "Jérôme Glisse" 
> Cc: Michal Hocko 
> Cc: "Christian König" 
> Cc: Greg Kroah-Hartman 
> Cc: Daniel Vetter 
> Cc: Mike Rapoport 
> Cc: linux...@kvack.org
> Signed-off-by: Daniel Vetter 
>  mm/mmu_notifier.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index d12e3079e7a4..538d3bb87f9b 100644
> +++ b/mm/mmu_notifier.c
> @@ -256,6 +256,13 @@ static int do_mmu_notifier_register(struct mmu_notifier 
> *mn,
>  
>   BUG_ON(atomic_read(>mm_users) <= 0);
>  
> + if (IS_ENABLED(CONFIG_LOCKDEP)) {
> + fs_reclaim_acquire(GFP_KERNEL);
> + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
> + lock_map_release(&__mmu_notifier_invalidate_range_start_map);
> + fs_reclaim_release(GFP_KERNEL);
> + }

Lets try it out at least

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH 1/4] mm, notifier: Add a lockdep map for invalidate_range_start/end

2019-08-20 Thread Jason Gunthorpe
On Tue, Aug 20, 2019 at 10:18:59AM +0200, Daniel Vetter wrote:
> This is a similar idea to the fs_reclaim fake lockdep lock. It's
> fairly easy to provoke a specific notifier to be run on a specific
> range: Just prep it, and then munmap() it.
> 
> A bit harder, but still doable, is to provoke the mmu notifiers for
> all the various callchains that might lead to them. But both at the
> same time is really hard to reliable hit, especially when you want to
> exercise paths like direct reclaim or compaction, where it's not
> easy to control what exactly will be unmapped.
> 
> By introducing a lockdep map to tie them all together we allow lockdep
> to see a lot more dependencies, without having to actually hit them
> in a single challchain while testing.
> 
> On Jason's suggestion this is is rolled out for both
> invalidate_range_start and invalidate_range_end. They both have the
> same calling context, hence we can share the same lockdep map. Note
> that the annotation for invalidate_ranage_start is outside of the
> mm_has_notifiers(), to make sure lockdep is informed about all paths
> leading to this context irrespective of whether mmu notifiers are
> present for a given context. We don't do that on the
> invalidate_range_end side to avoid paying the overhead twice, there
> the lockdep annotation is pushed down behind the mm_has_notifiers()
> check.
> 
> v2: Use lock_map_acquire/release() like fs_reclaim, to avoid confusion
> with this being a real mutex (Chris Wilson).
> 
> v3: Rebase on top of Glisse's arg rework.
> 
> v4: Also annotate invalidate_range_end (Jason Gunthorpe)
> Also annotate invalidate_range_start_nonblock, I somehow missed that
> one in the first version.
> 
> Cc: Jason Gunthorpe 
> Cc: Chris Wilson 
> Cc: Andrew Morton 
> Cc: David Rientjes 
> Cc: "Jérôme Glisse" 
> Cc: Michal Hocko 
> Cc: "Christian König" 
> Cc: Greg Kroah-Hartman 
> Cc: Daniel Vetter 
> Cc: Mike Rapoport 
> Cc: linux...@kvack.org
> Signed-off-by: Daniel Vetter 
> ---
>  include/linux/mmu_notifier.h | 8 
>  mm/mmu_notifier.c| 9 +
>  2 files changed, 17 insertions(+)

Reviewed-by: Jason Gunthorpe 

Jason


Re: [PATCH v3 hmm 00/11] Add mmu_notifier_get/put for managing mmu notifier registrations

2019-08-17 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 08:15:37PM -0300, Jason Gunthorpe wrote:
> This series is already entangled with patches in the hmm & RDMA tree and
> will require some git topic branches for the RDMA ODP stuff. I intend for
> it to go through the hmm tree.

> Jason Gunthorpe (11):
>   mm/mmu_notifiers: hoist do_mmu_notifier_register down_write to the
> caller
>   mm/mmu_notifiers: do not speculatively allocate a mmu_notifier_mm
>   mm/mmu_notifiers: add a get/put scheme for the registration
>   misc/sgi-gru: use mmu_notifier_get/put for struct gru_mm_struct
>   hmm: use mmu_notifier_get/put for 'struct hmm'
>   drm/radeon: use mmu_notifier_get/put for struct radeon_mn
>   drm/amdkfd: fix a use after free race with mmu_notifer unregister
>   drm/amdkfd: use mmu_notifier_put

Other than these patches:

>   RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
>   RDMA/odp: remove ib_ucontext from ib_umem
>   mm/mmu_notifiers: remove unregister_no_release

This series has been applied.

I will apply the ODP patches when the series they depend on is merged
to the RDMA tree

Any further acks/remarks I will annotate, thanks in advance

Thanks to all reviewers,
Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: turn hmm migrate_vma upside down v3

2019-08-17 Thread Jason Gunthorpe
On Fri, Aug 16, 2019 at 08:51:41AM +0200, Christoph Hellwig wrote:
> Jason,
> 
> are you going to look into picking this up?  Unfortunately there is
> a hole pile in this area still pending, including the kvmppc secure
> memory driver from Bharata that depends on the work.

Done,

Lets see if Dan will comment on the pagemap part (looks
straightforward to me), and then after we grab that I will declare
hmm.git non-rebasing and Bharata can build his series upon it.

As a reminder, please do not send hmm.git inside another pull request
to Linus without making it very clear in that cover letter and Cc'ing
me. Thanks.

Regards,
Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 01/10] mm: turn migrate_vma upside down

2019-08-17 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 09:59:19AM +0200, Christoph Hellwig wrote:
> There isn't any good reason to pass callbacks to migrate_vma.  Instead
> we can just export the three steps done by this function to drivers and
> let them sequence the operation without callbacks.  This removes a lot
> of boilerplate code as-is, and will allow the drivers to drastically
> improve code flow and error handling further on.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Ralph Campbell 
> ---
>  Documentation/vm/hmm.rst   |  55 +-
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 122 +++--
>  include/linux/migrate.h| 118 ++--
>  mm/migrate.c   | 244 +++--
>  4 files changed, 195 insertions(+), 344 deletions(-)

The mechanical transformation looks OK, I squashed the below nits in,
let me know if I got it wrong

Reviewed-by: Jason Gunthorpe 

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 4f81c77059e305..0a5960beccf76d 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -339,9 +339,8 @@ Migration to and from device memory
 ===
 
 Because the CPU cannot access device memory, migration must use the device DMA
-engine to perform copy from and to device memory. For this we need a new to
-use migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize()
-helpers.
+engine to perform copy from and to device memory. For this we need to use
+migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() helpers.
 
 
 Memory cgroup (memcg) and rss accounting
diff --git a/mm/migrate.c b/mm/migrate.c
index 993386cb53358d..0e78ebd720c0e4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2622,10 +2622,9 @@ static void migrate_vma_unmap(struct migrate_vma 
*migrate)
  * successfully migrated, and which were not.  Successfully migrated pages will
  * have the MIGRATE_PFN_MIGRATE flag set for their src array entry.
  *
- * It is safe to update device page table from within the finalize_and_map()
- * callback because both destination and source page are still locked, and the
- * mmap_sem is held in read mode (hence no one can unmap the range being
- * migrated).
+ * It is safe to update device page table after migrate_vma_pages() because
+ * both destination and source page are still locked, and the mmap_sem is held
+ * in read mode (hence no one can unmap the range being migrated).
  *
  * Once the caller is done cleaning up things and updating its page table (if 
it
  * chose to do so, this is not an obligation) it finally calls
@@ -2657,10 +2656,11 @@ int migrate_vma_setup(struct migrate_vma *args)
args->npages = 0;
 
migrate_vma_collect(args);
-   if (args->cpages)
-   migrate_vma_prepare(args);
-   if (args->cpages)
-   migrate_vma_unmap(args);
+   if (!args->cpages)
+   return 0;
+
+   migrate_vma_prepare(args);
+   migrate_vma_unmap(args);
 
/*
 * At this point pages are locked and unmapped, and thus they have
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 10/10] mm: remove CONFIG_MIGRATE_VMA_HELPER

2019-08-17 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 09:59:28AM +0200, Christoph Hellwig wrote:
> CONFIG_MIGRATE_VMA_HELPER guards helpers that are required for proper
> devic private memory support.  Remove the option and just check for
> CONFIG_DEVICE_PRIVATE instead.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/gpu/drm/nouveau/Kconfig | 1 -
>  mm/Kconfig  | 3 ---
>  mm/migrate.c| 4 ++--
>  3 files changed, 2 insertions(+), 6 deletions(-)

Reviewed-by: Jason Gunthorpe 

Aside from it making sense, I checked no other driver enables
DEVICE_PRIVATE, so no change in kernel build.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-17 Thread Jason Gunthorpe
On Fri, Aug 16, 2019 at 10:21:41AM -0700, Dan Williams wrote:

> > We can do a get_dev_pagemap inside the page_walk and touch the pgmap,
> > or we can do the 'device mutex && retry' pattern and touch the pgmap
> > in the driver, under that lock.
> >
> > However in all cases the current get_dev_pagemap()'s in the page walk
> > are not necessary, and we can delete them.
> 
> Yes, as long as 'struct page' instances resulting from that lookup are
> not passed outside of that lock.

Indeed.

Also, I was reflecting over lunch that the hmm_range_fault should only
return DEVICE_PRIVATE pages for the caller's device (see other thread
with HCH), and in this case, the caller should also be responsible to
ensure that the driver is not calling hmm_range_fault at the same time
it is deleting it's own DEVICE_PRIVATE mapping - ie by fencing its
page fault handler.

This does not apply to PCI_P2PDMA, but, lets see how that looks when
we get there.

So the whole thing seems pretty safe.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-17 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 08:54:46PM -0700, Dan Williams wrote:

> > However, this means we cannot do any processing of ZONE_DEVICE pages
> > outside the driver lock, so eg, doing any DMA map that might rely on
> > MEMORY_DEVICE_PCI_P2PDMA has to be done in the driver lock, which is
> > a bit unfortunate.
> 
> Wouldn't P2PDMA use page pins? Not needing to hold a lock over
> ZONE_DEVICE page operations was one of the motivations for plumbing
> get_dev_pagemap() with a percpu-ref.

hmm_range_fault() doesn't use page pins at all, so if a ZONE_DEVICE
page comes out of it then it needs to use another locking pattern.

If I follow it all right:

We can do a get_dev_pagemap inside the page_walk and touch the pgmap,
or we can do the 'device mutex && retry' pattern and touch the pgmap
in the driver, under that lock.

However in all cases the current get_dev_pagemap()'s in the page walk
are not necessary, and we can delete them.

?

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: turn hmm migrate_vma upside down v3

2019-08-17 Thread Jason Gunthorpe
On Fri, Aug 16, 2019 at 08:51:41AM +0200, Christoph Hellwig wrote:
> Jason,
> 
> are you going to look into picking this up?  Unfortunately there is
> a hole pile in this area still pending, including the kvmppc secure
> memory driver from Bharata that depends on the work.
> 
> mm folks:  migrate.c is mostly a classic MM file except for the hmm
> additions.  Do you want to also look over this or just let it pass?

Yes, after you explained the functions were hmm ones, it seems OK to
go to hmm.git.

I was waiting for the dust to settle, I see Ralph tested-by, are we
good now?

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: turn hmm migrate_vma upside down v3

2019-08-17 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 09:59:18AM +0200, Christoph Hellwig wrote:
> Hi Jérôme, Ben and Jason,
> 
> below is a series against the hmm tree which starts revamping the
> migrate_vma functionality.  The prime idea is to export three slightly
> lower level functions and thus avoid the need for migrate_vma_ops
> callbacks.
> 
> Diffstat:
> 
> 7 files changed, 282 insertions(+), 614 deletions(-)

Yay, another big wack of code gone
 
Applied to hmm.git

Thanks,
Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-17 Thread Jason Gunthorpe
On Fri, Aug 16, 2019 at 06:36:52PM +0200, Daniel Vetter wrote:
> On Fri, Aug 16, 2019 at 4:38 PM Jason Gunthorpe  wrote:
> >
> > On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote:
> > > Also, aside from this patch (which is prep for the next) and some
> > > simple reordering conflicts they're all independent. So if there's no
> > > way to paint this bikeshed here (technicolor perhaps?) then I'd like
> > > to get at least the others considered.
> >
> > Sure, I think for conflict avoidance reasons I'm probably taking
> > mmu_notifier stuff via hmm.git, so:
> >
> > - Andrew had a minor remark on #1, I am ambivalent and would take it
> >   as-is. Your decision if you want to respin.
> 
> I like mine better, see also the reply from Ralph Campbell.

Sure

> > - #2/#3 is this issue, I would stand by the preempt_disable/etc path
> >   Our situation matches yours, debug tests run lockdep/etc.
>
> Since Michal requested the current flavour I think we need spin a bit
> more on these here. I guess I'll just rebase them to the end so
> they're not holding up the others.
> 
> > - #4 I like a lot, except the map should enclose range_end too,
> >   this can be done after the mm_has_notifiers inside the
> >   __mmu_notifier function
> 
> To make sure I get this right: The same lockdep context, but also
> wrapped around invalidate_range_end? 

Yes, the locking context of _range_start and _range_end should be
identical, last time I checked callers this was the case.

So, just add it to __mmu_notifier_invalidate_range_end() outside the
SRCU as there is no reason to burden debug kernel callers twice when
mmu notifiers are not enabled

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH] mm/hmm: hmm_range_fault handle pages swapped out

2019-08-17 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 08:52:56PM +, Yang, Philip wrote:
> hmm_range_fault may return NULL pages because some of pfns are equal to
> HMM_PFN_NONE. This happens randomly under memory pressure. The reason is
> for swapped out page pte path, hmm_vma_handle_pte doesn't update fault
> variable from cpu_flags, so it failed to call hmm_vam_do_fault to swap
> the page in.
>
> The fix is to call hmm_pte_need_fault to update fault variable.

> Change-Id: I2e8611485563d11d938881c18b7935fa1e7c91ee

I'll fix it for you but please be careful not to send Change-id's to
the public lists.

Also what is the Fixes line for this?

> Signed-off-by: Philip Yang 
>  mm/hmm.c | 3 +++
>  1 file changed, 3 insertions(+)

Ralph has also been looking at this area also so I'll give him a bit
to chime in, otherwise with Jerome's review this looks OK to go to
linux-next

Thanks,
Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v3 hmm 08/11] drm/radeon: use mmu_notifier_get/put for struct radeon_mn

2019-08-17 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 10:28:21AM +0200, Christian König wrote:
> Am 07.08.19 um 01:15 schrieb Jason Gunthorpe:
> > From: Jason Gunthorpe 
> > 
> > radeon is using a device global hash table to track what mmu_notifiers
> > have been registered on struct mm. This is better served with the new
> > get/put scheme instead.
> > 
> > radeon has a bug where it was not blocking notifier release() until all
> > the BO's had been invalidated. This could result in a use after free of
> > pages the BOs. This is tied into a second bug where radeon left the
> > notifiers running endlessly even once the interval tree became
> > empty. This could result in a use after free with module unload.
> > 
> > Both are fixed by changing the lifetime model, the BOs exist in the
> > interval tree with their natural lifetimes independent of the mm_struct
> > lifetime using the get/put scheme. The release runs synchronously and just
> > does invalidate_start across the entire interval tree to create the
> > required DMA fence.
> > 
> > Additions to the interval tree after release are already impossible as
> > only current->mm is used during the add.
> > 
> > Signed-off-by: Jason Gunthorpe 
> 
> Acked-by: Christian König 

Thanks!

> But I'm wondering if we shouldn't completely drop radeon userptr support.
> It's just to buggy,

I would not object :)

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-17 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 04:51:33PM -0400, Jerome Glisse wrote:

> struct page. In this case any way we can update the
> nouveau_dmem_page() to check that page page->pgmap == the
> expected pgmap.

I was also wondering if that is a problem.. just blindly doing a
container_of on the page->pgmap does seem like it assumes that only
this driver is using DEVICE_PRIVATE.

It seems like something missing in hmm_range_fault, it should be told
what DEVICE_PRIVATE is acceptable to trigger HMM_PFN_DEVICE_PRIVATE
and fault all others?

Jason 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-17 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 01:47:12PM -0700, Dan Williams wrote:
> On Thu, Aug 15, 2019 at 1:41 PM Jason Gunthorpe  wrote:
> >
> > On Thu, Aug 15, 2019 at 04:33:06PM -0400, Jerome Glisse wrote:
> >
> > > So nor HMM nor driver should dereference the struct page (i do not
> > > think any iommu driver would either),
> >
> > Er, they do technically deref the struct page:
> >
> > nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
> >  struct hmm_range *range)
> > struct page *page;
> > page = hmm_pfn_to_page(range, range->pfns[i]);
> > if (!nouveau_dmem_page(drm, page)) {
> >
> >
> > nouveau_dmem_page(struct nouveau_drm *drm, struct page *page)
> > {
> > return is_device_private_page(page) && drm->dmem == 
> > page_to_dmem(page)
> >
> >
> > Which does touch 'page->pgmap'
> >
> > Is this OK without having a get_dev_pagemap() ?
> >
> > Noting that the collision-retry scheme doesn't protect anything here
> > as we can have a concurrent invalidation while doing the above deref.
> 
> As long take_driver_page_table_lock() in Jerome's flow can replace
> percpu_ref_tryget_live() on the pagemap reference. It seems
> nouveau_dmem_convert_pfn() happens after:
>
> mutex_lock(>mutex);
> if (!nouveau_range_done()) {
> 
> ...so I would expect that to be functionally equivalent to validating
> the reference count.

Yes, OK, that makes sense, I was mostly surprised by the statement the
driver doesn't touch the struct page.. 

I suppose "doesn't touch the struct page out of the driver lock" is
the case.

However, this means we cannot do any processing of ZONE_DEVICE pages
outside the driver lock, so eg, doing any DMA map that might rely on
MEMORY_DEVICE_PCI_P2PDMA has to be done in the driver lock, which is
a bit unfortunate.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-17 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > 
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > > 
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> > 
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> 
> I still do not get why do you put FS/IO into the picture. This is really
> about __GFP_DIRECT_RECLAIM.

Like I said this is complicated, translating "no blocking on direct or
indirect dependecy on memory allocation that might sleep" into GFP
flags is hard for us outside the mm community.

So the contraint here is no __GFP_DIRECT_RECLAIM?

I bring up FS/IO because that is what Tejun mentioned when I asked him
about reclaim restrictions, and is what fs_reclaim_acquire() is
already sensitive too. It is pretty confusing if we have places using
the word 'reclaim' with different restrictions. :(

> >CPU0 CPU1
> > mutex_lock()
> > kmalloc(GFP_KERNEL)
> 
> no I mean __GFP_DIRECT_RECLAIM here.
> 
> > mutex_unlock()
> >   fs_reclaim_acquire()
> >   mutex_lock() <- illegal: lock dep assertion
> 
> I cannot really comment on how that is achieveable by lockdep.

??? I am trying to explain this is already done and working today. The
above example will already fault with lockdep enabled.

This is existing debugging we can use and improve upon rather that
invent new debugging.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-17 Thread Jason Gunthorpe
On Fri, Aug 16, 2019 at 06:44:48AM +0200, Christoph Hellwig wrote:
> On Fri, Aug 16, 2019 at 12:43:07AM +0000, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 04:51:33PM -0400, Jerome Glisse wrote:
> > 
> > > struct page. In this case any way we can update the
> > > nouveau_dmem_page() to check that page page->pgmap == the
> > > expected pgmap.
> > 
> > I was also wondering if that is a problem.. just blindly doing a
> > container_of on the page->pgmap does seem like it assumes that only
> > this driver is using DEVICE_PRIVATE.
> > 
> > It seems like something missing in hmm_range_fault, it should be told
> > what DEVICE_PRIVATE is acceptable to trigger HMM_PFN_DEVICE_PRIVATE
> > and fault all others?
> 
> The whole device private handling in hmm and migrate_vma seems pretty
> broken as far as I can tell, and I have some WIP patches.  Basically we
> should not touch (or possibly eventually call migrate to ram eventually
> in the future) device private pages not owned by the caller, where I
> try to defined the caller by the dev_pagemap_ops instance.  

I think it needs to be more elaborate.

For instance, a system may have multiple DEVICE_PRIVATE map's owned by
the same driver - but multiple physical devices using that driver.

Each physical device's driver should only ever get DEVICE_PRIVATE
pages for it's own on-device memory. Never a DEVICE_PRIVATE for
another device's memory.

The dev_pagemap_ops would not be unique enough, right?

Probably also clusters of same-driver struct device can share a
DEVICE_PRIVATE, at least high end GPU's now have private memory
coherency busses between their devices.

Since we want to trigger migration to CPU on incompatible
DEVICE_PRIVATE pages, it seems best to sort this out in the
hmm_range_fault?

Maybe some sort of unique ID inside the page->pgmap and passed as
input?

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-17 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 02:03:25PM -0400, Jerome Glisse wrote:
> On Wed, Aug 14, 2019 at 07:48:28AM -0700, Dan Williams wrote:
> > On Wed, Aug 14, 2019 at 6:28 AM Jason Gunthorpe  wrote:
> > >
> > > On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> > > > On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > > > > Section alignment constraints somewhat save us here. The only example
> > > > > I can think of a PMD not containing a uniform pgmap association for
> > > > > each pte is the case when the pgmap overlaps normal dram, i.e. shares
> > > > > the same 'struct memory_section' for a given span. Otherwise, distinct
> > > > > pgmaps arrange to manage their own exclusive sections (and now
> > > > > subsections as of v5.3). Otherwise the implementation could not
> > > > > guarantee different mapping lifetimes.
> > > > >
> > > > > That said, this seems to want a better mechanism to determine "pfn is
> > > > > ZONE_DEVICE".
> > > >
> > > > So I guess this patch is fine for now, and once you provide a better
> > > > mechanism we can switch over to it?
> > >
> > > What about the version I sent to just get rid of all the strange
> > > put_dev_pagemaps while scanning? Odds are good we will work with only
> > > a single pagemap, so it makes some sense to cache it once we find it?
> > 
> > Yes, if the scan is over a single pmd then caching it makes sense.
> 
> Quite frankly an easier an better solution is to remove the pagemap
> lookup as HMM user abide by mmu notifier it means we will not make
> use or dereference the struct page so that we are safe from any
> racing hotunplug of dax memory (as long as device driver using hmm
> do not have a bug).

Yes, I also would prefer to drop the confusing checks entirely -
Christoph can you resend this patch?

Thanks,
Jason 
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 01/10] mm: turn migrate_vma upside down

2019-08-17 Thread Jason Gunthorpe
On Sat, Aug 17, 2019 at 01:31:28PM +0200, Christoph Hellwig wrote:
> On Fri, Aug 16, 2019 at 05:11:07PM +0000, Jason Gunthorpe wrote:
> > -   if (args->cpages)
> > -   migrate_vma_prepare(args);
> > -   if (args->cpages)
> > -   migrate_vma_unmap(args);
> > +   if (!args->cpages)
> > +   return 0;
> > +
> > +   migrate_vma_prepare(args);
> > +   migrate_vma_unmap(args);
> 
> I don't think this is ok.  Both migrate_vma_prepare and migrate_vma_unmap
> can reduce args->cpages, including possibly to 0.

Oh, yes, that was far too hasty on my part, I had assumed collect set
the cpages. Thank you for checking

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 09/10] mm: remove the unused MIGRATE_PFN_DEVICE flag

2019-08-17 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 09:59:27AM +0200, Christoph Hellwig wrote:
> No one ever checks this flag, and we could easily get that information
> from the page if needed.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Ralph Campbell 
> ---
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 +--
>  include/linux/migrate.h| 1 -
>  mm/migrate.c   | 4 ++--
>  3 files changed, 3 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-17 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 04:33:06PM -0400, Jerome Glisse wrote:

> So nor HMM nor driver should dereference the struct page (i do not
> think any iommu driver would either),

Er, they do technically deref the struct page:

nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
 struct hmm_range *range)
struct page *page;
page = hmm_pfn_to_page(range, range->pfns[i]);
if (!nouveau_dmem_page(drm, page)) {


nouveau_dmem_page(struct nouveau_drm *drm, struct page *page)
{
return is_device_private_page(page) && drm->dmem == page_to_dmem(page)


Which does touch 'page->pgmap'

Is this OK without having a get_dev_pagemap() ?

Noting that the collision-retry scheme doesn't protect anything here
as we can have a concurrent invalidation while doing the above deref.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail

2019-08-16 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 10:20:23PM +0200, Daniel Vetter wrote:
> Just a bit of paranoia, since if we start pushing this deep into
> callchains it's hard to spot all places where an mmu notifier
> implementation might fail when it's not allowed to.
> 
> Inspired by some confusion we had discussing i915 mmu notifiers and
> whether we could use the newly-introduced return value to handle some
> corner cases. Until we realized that these are only for when a task
> has been killed by the oom reaper.
> 
> An alternative approach would be to split the callback into two
> versions, one with the int return value, and the other with void
> return value like in older kernels. But that's a lot more churn for
> fairly little gain I think.
> 
> Summary from the m-l discussion on why we want something at warning
> level: This allows automated tooling in CI to catch bugs without
> humans having to look at everything. If we just upgrade the existing
> pr_info to a pr_warn, then we'll have false positives. And as-is, no
> one will ever spot the problem since it's lost in the massive amounts
> of overall dmesg noise.
> 
> v2: Drop the full WARN_ON backtrace in favour of just a pr_warn for
> the problematic case (Michal Hocko).
> 
> v3: Rebase on top of Glisse's arg rework.
> 
> v4: More rebase on top of Glisse reworking everything.
> 
> v5: Fixup rebase damage and also catch failures != EAGAIN for
> !blockable (Jason). Also go back to WARN_ON as requested by Jason, so
> automatic checkers can easily catch bugs by setting panic_on_warn.
> 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: "Christian König" 
> Cc: David Rientjes 
> Cc: Daniel Vetter 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Cc: Paolo Bonzini 
> Cc: Jason Gunthorpe 
> Signed-off-by: Daniel Vetter 
> ---
>  mm/mmu_notifier.c | 2 ++
>  1 file changed, 2 insertions(+)

Applied to hmm.git, thanks

Jason


Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-16 Thread Jason Gunthorpe
On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote:
> Also, aside from this patch (which is prep for the next) and some
> simple reordering conflicts they're all independent. So if there's no
> way to paint this bikeshed here (technicolor perhaps?) then I'd like
> to get at least the others considered.

Sure, I think for conflict avoidance reasons I'm probably taking
mmu_notifier stuff via hmm.git, so:

- Andrew had a minor remark on #1, I am ambivalent and would take it
  as-is. Your decision if you want to respin.
- #2/#3 is this issue, I would stand by the preempt_disable/etc path
  Our situation matches yours, debug tests run lockdep/etc.
- #4 I like a lot, except the map should enclose range_end too,
  this can be done after the mm_has_notifiers inside the
  __mmu_notifier function
  Can you respin?
  I will propose preloading the map in another patch
- #5 is already applied in -rc

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-16 Thread Jason Gunthorpe
On Fri, Aug 16, 2019 at 02:26:25PM +0200, Michal Hocko wrote:
> On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote:
> > On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> > > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> > > > 
> > > > > > The last detail is I'm still unclear what a GFP flags a blockable
> > > > > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > > > > 
> > > > > I hope I will not make this muddy again ;)
> > > > > invalidate_range_start in the blockable mode can use/depend on any 
> > > > > sleepable
> > > > > allocation allowed in the context it is called from. 
> > > > 
> > > > 'in the context is is called from' is the magic phrase, as
> > > > invalidate_range_start is called while holding several different mm
> > > > related locks. I know at least write mmap_sem and i_mmap_rwsem
> > > > (write?)
> > > > 
> > > > Can GFP_KERNEL be called while holding those locks?
> > > 
> > > i_mmap_rwsem would be problematic because it is taken during the
> > > reclaim.
> > 
> > Okay.. So the fs_reclaim debugging does catch errors.
> 
> I do not think fs_reclaim is the udnerlying mechanism to catch this
> deadlock. 

Indeed lockdep would catch it directly as an AA deadlock, but only if
we happen to take the reclaim path under the kmalloc in the notifier
and lucked into it also choosing to lock the same lock we are holding.

The trouble is this is very difficult to hit.

lockdep allows making this less difficult. For instance with a
'might_reclaim()' annotation in the allocator we could check that the
various reclaim related locks are obtainable. Testing doesn't need to
get lucky and go down the very unlikely path.

It turns out this is already happening, it is actually a side effect
of the way fs_reclaim works now.

> > Do you have any
> > reference for what a false positive looks like? 
> 
> I believe I have given some examples when introducing __GFP_NOLOCKDEP.

Okay, I think that is 7e7844226f10 ("lockdep: allow to disable reclaim
lockup detection") Hmm, sadly the lkml link in the commit is broken.

Hum. There are no users of __GFP_NOLOCKDEP today?? Could all the false
positives have been fixed??

Keep in mind that this fs_reclaim was reworked away from the hacky
interrupt context flag to a fairly elegant real lock map.

Based on my read of the commit message, my first reaction would be to
suggest lockdep_set_class() to solve the problem described, ie
different classes for 'inside transaction' and 'outside transaction'
on xfs_refcountbt_init_cursor()

I understood that generally that is the way to handle lockdep false
positives.

Anyhow, if you are willing to consider that lockdep isn't broken, I
have some ideas on how to make this clearer and increase
coverage. Would you be willing to look at patches on this topic? (not
soon, I have to finish my mmu notifier stuff)

> > I would like to inject it into the notifier path as this is very
> > difficult for driver authors to discover and know about, but I'm
> > worried about your false positive remark.
> > 
> > I think I understand we can use only GFP_ATOMIC in the notifiers, but
> > we need a strategy to handle OOM to guarentee forward progress.
> 
> Your example is from the notifier registration IIUC. 

Yes, that is where this commit hit it.. Triggering this under an
actual notifier to get a lockdep report is hard.

> Can you pre-allocate before taking locks? Could you point me to some
> examples when the allocation is necessary in the range notifier
> callback?

Hmm. I took a careful look, I only found mlx5 as obviously allocating
memory:

 mlx5_ib_invalidate_range()
  mlx5_ib_update_xlt()
   __get_free_pages(gfp, get_order(size));

However, I think this could be changed to fall back to some small
buffer if allocation fails. The existing scheme looks sketchy

nouveau does:

 nouveau_svmm_invalidate
  nvif_object_mthd
   kmalloc(GFP_KERNEL)

But I think it reliably uses a stack buffer here

i915 I think Daniel said he audited.

amd_mn.. The actual invalidate_range_start does not allocate memory,
but it is entangled with so many locks it would need careful analysis
to be sure.

The others look generally OK, which is good, better than I hoped :)

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-16 Thread Jason Gunthorpe
On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> > 
> > > > The last detail is I'm still unclear what a GFP flags a blockable
> > > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > > 
> > > I hope I will not make this muddy again ;)
> > > invalidate_range_start in the blockable mode can use/depend on any 
> > > sleepable
> > > allocation allowed in the context it is called from. 
> > 
> > 'in the context is is called from' is the magic phrase, as
> > invalidate_range_start is called while holding several different mm
> > related locks. I know at least write mmap_sem and i_mmap_rwsem
> > (write?)
> > 
> > Can GFP_KERNEL be called while holding those locks?
> 
> i_mmap_rwsem would be problematic because it is taken during the
> reclaim.

Okay.. So the fs_reclaim debugging does catch errors. Do you have any
reference for what a false positive looks like? 

I would like to inject it into the notifier path as this is very
difficult for driver authors to discover and know about, but I'm
worried about your false positive remark.

I think I understand we can use only GFP_ATOMIC in the notifiers, but
we need a strategy to handle OOM to guarentee forward progress.

This is just more bugs to fix :(

Jason


Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-16 Thread Jason Gunthorpe
On Fri, Aug 16, 2019 at 08:20:55AM +0200, Daniel Vetter wrote:
> On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe  wrote:
> > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe  wrote:
> > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > > > So if someone can explain to me how that works with lockdep I can of
> > > > > course implement it. But afaics that doesn't exist (I tried to explain
> > > > > that somewhere else already), and I'm no really looking forward to
> > > > > hacking also on lockdep for this little series.
> > > >
> > > > Hmm, kind of looks like it is done by calling preempt_disable()
> > >
> > > Yup. That was v1, then came the suggestion that disabling preemption
> > > is maybe not the best thing (the oom reaper could still run for a long
> > > time comparatively, if it's cleaning out gigabytes of process memory
> > > or what not, hence this dedicated debug infrastructure).
> >
> > Oh, I'm coming in late, sorry
> >
> > Anyhow, I was thinking since we agreed this can trigger on some
> > CONFIG_DEBUG flag, something like
> >
> > /* This is a sleepable region, but use preempt_disable to get debugging
> >  * for calls that are not allowed to block for OOM [.. insert
> >  * Michal's explanation.. ] */
> > if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && 
> > !mmu_notifier_range_blockable(range))
> > preempt_disable();
> > ops->invalidate_range_start();
> 
> I think we also discussed that, and some expressed concerns it would
> change behaviour/timing too much for testing. Since this does does
> disable preemption for real, not just for might_sleep.

I don't follow, this is a debug kernel, it will have widly different
timing. 

Further the point of this debugging on atomic_sleep is to be as
timing-independent as possible since functions with rare sleeps should
be guarded by might_sleep() in their common paths.

I guess I don't get the push to have some low overhead debugging for
this? Is there something special you are looking for?

Jason


Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe  wrote:
> > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > So if someone can explain to me how that works with lockdep I can of
> > > course implement it. But afaics that doesn't exist (I tried to explain
> > > that somewhere else already), and I'm no really looking forward to
> > > hacking also on lockdep for this little series.
> >
> > Hmm, kind of looks like it is done by calling preempt_disable()
> 
> Yup. That was v1, then came the suggestion that disabling preemption
> is maybe not the best thing (the oom reaper could still run for a long
> time comparatively, if it's cleaning out gigabytes of process memory
> or what not, hence this dedicated debug infrastructure).

Oh, I'm coming in late, sorry

Anyhow, I was thinking since we agreed this can trigger on some
CONFIG_DEBUG flag, something like

/* This is a sleepable region, but use preempt_disable to get debugging
 * for calls that are not allowed to block for OOM [.. insert
 * Michal's explanation.. ] */
if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && 
!mmu_notifier_range_blockable(range))
preempt_disable();
ops->invalidate_range_start();

And I have also been idly mulling doing something like

   if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS) && 
   rand &&
   mmu_notifier_range_blockable(range)) {
 range->flags = 0
 if (!ops->invalidate_range_start(range))
continue

 // Failed, try again as blockable
 range->flags = MMU_NOTIFIER_RANGE_BLOCKABLE
   }
   ops->invalidate_range_start(range);

Which would give coverage for this corner case without forcing OOM.

Jason


Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:

> So if someone can explain to me how that works with lockdep I can of
> course implement it. But afaics that doesn't exist (I tried to explain
> that somewhere else already), and I'm no really looking forward to
> hacking also on lockdep for this little series.

Hmm, kind of looks like it is done by calling preempt_disable()

Probably the debug option is CONFIG_DEBUG_PREEMPT, not lockdep?

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:

> > The last detail is I'm still unclear what a GFP flags a blockable
> > invalidate_range_start() should use. Is GFP_KERNEL OK?
> 
> I hope I will not make this muddy again ;)
> invalidate_range_start in the blockable mode can use/depend on any sleepable
> allocation allowed in the context it is called from. 

'in the context is is called from' is the magic phrase, as
invalidate_range_start is called while holding several different mm
related locks. I know at least write mmap_sem and i_mmap_rwsem
(write?)

Can GFP_KERNEL be called while holding those locks?

This is the question of indirect dependency on reclaim via locks you
raised earlier.

> So in other words it is no different from any other function in the
> kernel that calls into allocator. As the API is missing gfp context
> then I hope it is not called from any restricted contexts (except
> from the oom which we have !blockable for).

Yes, the callers are exactly my concern.
 
> > Lockdep has
> > complained on that in past due to fs_reclaim - how do you know if it
> > is a false positive?
> 
> I would have to see the specific lockdep splat.

See below. I found it when trying to understand why the registration
of the mmu notififer was so oddly coded.

The situation was:

  down_write(>mmap_sem);
  mm_take_all_locks(mm);
  kmalloc(GFP_KERNEL);  <--- lockdep warning

I understood Daniel said he saw this directly on a recent kernel when
working with his lockdep patch?

Checking myself, on todays kernel I see a call chain:

shrink_all_memory
  fs_reclaim_acquire(sc.gfp_mask);
  [..]
  do_try_to_free_pages
   shrink_zones
shrink_node
 shrink_node_memcg
  shrink_list
   shrink_active_list
page_referenced
 rmap_walk
  rmap_walk_file
   i_mmap_lock_read
down_read(i_mmap_rwsem)

So it is possible that the down_read() above will block on
i_mmap_rwsem being held in the caller of invalidate_range_start which
is doing kmalloc(GPF_KERNEL).

Is this OK? The lockdep annotation says no..

Jason

commit 35cfa2b0b491c37e23527822bf365610dbb188e5
Author: Gavin Shan 
Date:   Thu Oct 25 13:38:01 2012 -0700

mm/mmu_notifier: allocate mmu_notifier in advance

While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
to work in case of tight available memory.  Eventually, that would lead to
a deadlock while the swap deamon swaps anonymous pages.  It was caused by
commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary").

  =
  [ INFO: inconsistent lock state ]
  3.7.0-rc1+ #518 Not tainted
  -
  inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
  kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes:
   (>i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0
  {RECLAIM_FS-ON-W} state was registered at:
 mark_held_locks+0x86/0x150
 lockdep_trace_alloc+0x67/0xc0
 kmem_cache_alloc_trace+0x33/0x230
 do_mmu_notifier_register+0x87/0x180
 mmu_notifier_register+0x13/0x20
 kvm_dev_ioctl+0x428/0x510
 do_vfs_ioctl+0x98/0x570
 sys_ioctl+0x91/0xb0
 system_call_fastpath+0x16/0x1b


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote:

> This is what you claim and I am saying that fs_reclaim is about a
> restricted reclaim context and it is an ugly hack. It has proven to
> report false positives. Maybe it can be extended to a generic reclaim.
> I haven't tried that. Do not aim to try it.

Okay, great, I think this has been very helpful, at least for me,
thanks. I did not know fs_reclaim was so problematic, or the special
cases about OOM 'reclaim'. 

On this patch, I have no general objection to enforcing drivers to be
non-blocking, I'd just like to see it done with the existing lockdep
can't sleep detection rather than inventing some new debugging for it.

I understand this means the debugging requires lockdep enabled and
will not run in production, but I'm of the view that is OK and in line
with general kernel practice.

The last detail is I'm still unclear what a GFP flags a blockable
invalidate_range_start() should use. Is GFP_KERNEL OK? Lockdep has
complained on that in past due to fs_reclaim - how do you know if it
is a false positive?

Thanks again,
Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 02:27:19PM -0400, Jerome Glisse wrote:
> > How exactly? This is holding the page pin, so the only way the VA
> > mapping can be changed is via explicit user action.
> > 
> > ie:
> > 
> >gpu_write_something(va, size)
> >mmap(.., va, size, MMAP_FIXED);
> >gpu_wait_done()
> > 
> > This is racy and indeterminate with both models.
> > 
> > Based on the comment in i915 it appears to be going on the model that
> > changes to the mmap by userspace when the GPU is working on it is a
> > programming bug. This is reasonable, lots of systems use this kind of
> > consistency model.
> 
> Well userspace process doing munmap(), mremap(), fork() and things like
> that are a bug from the i915 kernel and userspace contract point of view.
> 
> But things like migration or reclaim are not cover under that contract
> and for those the expectation is that CPU access to the same virtual address
> should allow to get what was last written to it either by the GPU or the
> CPU.

Okay, this is a more reasonable point - I agree the i915 registration
cache model precludes using migration and thus DEVICE_PRIVATE. This is
a strong motivation to use the hmm approach

But we started out this converstation asking if i915 is correct, and I
still say a registration cache model is a functionally correct way to
use notifiers.

> Because of the reference on the page the i915 driver can forego the mmu
> notifier end callback. The thing here is that taking a page reference
> is pointless if we have better synchronization and tracking of mmu
> notifier. Hence converting to hmm mirror allows to avoid taking a ref
> on the page while still keeping the same functionality as of today.

However, there is a huge trade off here. Drivers like this are going
to have a very complicated locking inside invalidate_range_start as
they must sleep waiting for dma buffer references to go to zero.

> GPU driver have complex usage pattern the tlb shootdown is implicit
> once the GEM object associated with the uptr is invalidated it means
> next time userspace submit command against that GEM object it will
> have to re-validate it which means re-program the GPU page table to
> point to the proper address (and re-call GUP).

I think it is a mistake to try and cram the very different approaches
to notifiers into the same API. SW ref counting of DMA buffers vs HW
async page faulting have totally different requirements and locking
schemes.

This explains why AMDGPU gets away with not using the hmm API
properly, it is probably relying on its DMA refcount, not the hmm
valid, to keep things in order?

I think the API approach in hmm_mirror is reasonable for page faulting
HW, but does not serve refcounting HW well at all.

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> > 
> > > I'm not really well versed in the details of our userptr, but both
> > > amdgpu and i915 wait for the gpu to complete from
> > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > > one, so maybe he can explain what exactly it is we're doing ...
> > 
> > amdgpu is (wrongly) using hmm for something, I can't really tell what
> > it is trying to do. The calls to dma_fence under the
> > invalidate_range_start do not give me a good feeling.
> > 
> > However, i915 shows all the signs of trying to follow the registration
> > cache model, it even has a nice comment in
> > i915_gem_userptr_get_pages() explaining that the races it has don't
> > matter because it is a user space bug to change the VA mapping in the
> > first place. That just screams registration cache to me.
> > 
> > So it is fine to run HW that way, but if you do, there is no reason to
> > fence inside the invalidate_range end. Just orphan the DMA buffer and
> > clean it up & release the page pins when all DMA buffer refs go to
> > zero. The next access to that VA should get a new DMA buffer with the
> > right mapping.
> > 
> > In other words the invalidation should be very simple without
> > complicated locking, or wait_event's. Look at hfi1 for example.
> 
> This would break the today usage model of uptr and it will
> break userspace expectation ie if GPU is writting to that
> memory and that memory then the userspace want to make sure
> that it will see what the GPU write.

How exactly? This is holding the page pin, so the only way the VA
mapping can be changed is via explicit user action.

ie:

   gpu_write_something(va, size)
   mmap(.., va, size, MMAP_FIXED);
   gpu_wait_done()

This is racy and indeterminate with both models.

Based on the comment in i915 it appears to be going on the model that
changes to the mmap by userspace when the GPU is working on it is a
programming bug. This is reasonable, lots of systems use this kind of
consistency model.

Since the driver seems to rely on a dma_fence to block DMA access, it
looks to me like the kernel has full visibility to the
'gpu_write_something()' and 'gpu_wait_done()' markers.

I think trying to use hmm_range_fault on HW that can't do HW page
faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that
is what amd gpu is trying to do.

I haven't yet seen anything that looks like 'TLB shootdown' in i915??

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:

> I'm not really well versed in the details of our userptr, but both
> amdgpu and i915 wait for the gpu to complete from
> invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> one, so maybe he can explain what exactly it is we're doing ...

amdgpu is (wrongly) using hmm for something, I can't really tell what
it is trying to do. The calls to dma_fence under the
invalidate_range_start do not give me a good feeling.

However, i915 shows all the signs of trying to follow the registration
cache model, it even has a nice comment in
i915_gem_userptr_get_pages() explaining that the races it has don't
matter because it is a user space bug to change the VA mapping in the
first place. That just screams registration cache to me.

So it is fine to run HW that way, but if you do, there is no reason to
fence inside the invalidate_range end. Just orphan the DMA buffer and
clean it up & release the page pins when all DMA buffer refs go to
zero. The next access to that VA should get a new DMA buffer with the
right mapping.

In other words the invalidation should be very simple without
complicated locking, or wait_event's. Look at hfi1 for example.

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 01:11:56PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > 
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > > 
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> > 
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> > 
> > > If you can express that in the existing lockdep machinery. All
> > > fine. But then consider deployments where lockdep is no-no because
> > > of the overhead.
> > 
> > This is all for driver debugging. The point of lockdep is to find all
> > these paths without have to hit them as actual races, using debug
> > kernels.
> > 
> > I don't think we need this kind of debugging on production kernels?
> > 
> > > > The best we got was drivers tested the VA range and returned success
> > > > if they had no interest. Which is a big win to be sure, but it looks
> > > > like getting any more is not really posssible.
> > > 
> > > And that is already a great win! Because many notifiers only do care
> > > about particular mappings. Please note that backing off unconditioanlly
> > > will simply cause that the oom reaper will have to back off not doing
> > > any tear down anything.
> > 
> > Well, I'm working to propose that we do the VA range test under core
> > mmu notifier code that cannot block and then we simply remove the idea
> > of blockable from drivers using this new 'range notifier'. 
> > 
> > I think this pretty much solves the concern?
> 
> I am not sure i follow what you propose here ? Like i pointed out in
> another email for GPU we do need to be able to sleep (we might get
> lucky and not need too but this is runtime thing) within notifier
> range_start callback. This has been something allow by notifier since
> it has been introduced in the kernel.

Sorry, I mean remove the idea of the blockable flag from the
drivers. Drivers will always be able to block, within the existing
limitation of fs_reclaim

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > 
> > > You have to wait for the gpu to finnish current processing in
> > > invalidate_range_start. Otherwise there's no point to any of this
> > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > 
> > I don't envy your task :|
> > 
> > But, what you describe sure sounds like a 'registration cache' model,
> > not the 'shadow pte' model of coherency.
> > 
> > The key difference is that a regirstationcache is allowed to become
> > incoherent with the VMA's because it holds page pins. It is a
> > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > while the device is working on that VA, but it does not harm system
> > integrity because of the page pin.
> > 
> > The cache ensures that each initiated operation sees a DMA setup that
> > matches the current VA map when the operation is initiated and allows
> > expensive device DMA setups to be re-used.
> > 
> > A 'shadow pte' model (ie hmm) *really* needs device support to
> > directly block DMA access - ie trigger 'device page fault'. ie the
> > invalidate_start should inform the device to enter a fault mode and
> > that is it.  If the device can't do that, then the driver probably
> > shouldn't persue this level of coherency. The driver would quickly get
> > into the messy locking problems like dma_fence_wait from a notifier.
> 
> I think here we do not agree on the hardware requirement. For GPU
> we will always need to be able to wait for some GPU fence from inside
> the notifier callback, there is just no way around that for many of
> the GPUs today (i do not see any indication of that changing).

I didn't say you couldn't wait, I was trying to say that the wait
should only be contigent on the HW itself. Ie you can wait on a GPU
page table lock, and you can wait on a GPU page table flush completion
via IRQ.

What is troubling is to wait till some other thread gets a GPU command
completion and decr's a kref on the DMA buffer - which kinda looks
like what this dma_fence() stuff is all about. A driver like that
would have to be super careful to ensure consistent forward progress
toward dma ref == 0 when the system is under reclaim. 

ie by running it's entire IRQ flow under fs_reclaim locking.

> associated with the mm_struct. In all GPU driver so far it is a short
> lived lock and nothing blocking is done while holding it (it is just
> about updating page table directory really wether it is filling it or
> clearing it).

The main blocking I expect in a shadow PTE flow is waiting for the HW
to complete invalidations of its PTE cache.

> > It is important to identify what model you are going for as defining a
> > 'registration cache' coherence expectation allows the driver to skip
> > blocking in invalidate_range_start. All it does is invalidate the
> > cache so that future operations pick up the new VA mapping.
> > 
> > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > well proven, within some limitations of course.
> > 
> > At least, 'registration cache' is the only use model I know of where
> > it is acceptable to skip invalidate_range_end.
> 
> Here GPU are not in the registration cache model, i know it might looks
> like it because of GUP but GUP was use just because hmm did not exist
> at the time.

It is not because of GUP, it is because of the lack of
invalidate_range_end. A driver cannot correctly implement the SPTE
model without invalidate_range_end, even if it holds the page pins via
GUP.

So, I've been assuming the few drivers without invalidate_range_end
are trying to do registration caching, rather than assuming they are
broken.

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:

> > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > __GFP_DIRECT_RECLAIM..
> >
> > This matches the existing test in __need_fs_reclaim() - so if you are
> > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > allocations during OOM, then I think fs_reclaim already matches what
> > you described?
> 
> No GFP_NOFS is equally bad. Please read my other email explaining what
> the oom_reaper actually requires. In short no blocking on direct or
> indirect dependecy on memory allocation that might sleep.

It is much easier to follow with some hints on code, so the true
requirement is that the OOM repear not block on GFP_FS and GFP_IO
allocations, great, that constraint is now clear.

> If you can express that in the existing lockdep machinery. All
> fine. But then consider deployments where lockdep is no-no because
> of the overhead.

This is all for driver debugging. The point of lockdep is to find all
these paths without have to hit them as actual races, using debug
kernels.

I don't think we need this kind of debugging on production kernels?

> > The best we got was drivers tested the VA range and returned success
> > if they had no interest. Which is a big win to be sure, but it looks
> > like getting any more is not really posssible.
> 
> And that is already a great win! Because many notifiers only do care
> about particular mappings. Please note that backing off unconditioanlly
> will simply cause that the oom reaper will have to back off not doing
> any tear down anything.

Well, I'm working to propose that we do the VA range test under core
mmu notifier code that cannot block and then we simply remove the idea
of blockable from drivers using this new 'range notifier'. 

I think this pretty much solves the concern?

> > However, we could (probably even should) make the drivers fs_reclaim
> > safe.
> > 
> > If that is enough to guarantee progress of OOM, then lets consider
> > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > allocation behavior on the driver callback and lockdep to try and keep
> > pushing on the the debugging, and dropping !blocking.
> 
> How are you going to enforce indirect dependency? E.g. a lock that is
> also used in other context which depend on sleepable memory allocation
> to move forward.

You mean like this:

   CPU0 CPU1
mutex_lock()
kmalloc(GFP_KERNEL)
mutex_unlock()
  fs_reclaim_acquire()
  mutex_lock() <- illegal: lock dep assertion

?

lockdep handles this - that is what it does, it builds a graph of all
lock dependencies and requires the graph to be acyclic. So mutex_lock
depends on fs_reclaim_lock on CPU1 while on CPU0, fs_reclaim_lock
depends on mutex_lock. This is an ABBA locking cycle and lockdep will
reliably trigger.

So, if we wanted to do this, I'd probably suggest we retool
fs_reclaim's interface be more like

  prevent_gfp_flags(__GFP_IO | __GFP_FS);
  [..]
  restore_gfp_flags(__GFP_IO | __GFP_FS);

Which is lockdep based and follows the indirect lock dependencies you
talked about.

Then OOM and reclaim can specify the different flags they want
blocked.  We could probably use the same API with WQ_MEM_RECLAIM as I
was chatting with Tejun about:

https://www.spinics.net/lists/linux-rdma/msg77362.html

IMHO this stuff is *super complicated* for those of us outside the mm
subsystem, so having some really clear annotations like the above
would go a long way to help understand these special constraints.

I'm pretty sure there are lots of driver bugs related to using the
wrong GFP flags in the kernel.

> Really, this was aimed to give a very simple debugging aid. If it is
> considered to be so controversial, even though I really do not see why,
> then let's just drop it on the floor.

My concern is that the requirement was very unclear. I'm trying to
understand all the bits of how these notifiers work and the exact
semantic of this OOM path have been vauge if it is really some GFP
flag restriction or truely related to not sleeping.

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:

> You have to wait for the gpu to finnish current processing in
> invalidate_range_start. Otherwise there's no point to any of this
> really. So the wait_event/dma_fence_wait are unavoidable really.

I don't envy your task :|

But, what you describe sure sounds like a 'registration cache' model,
not the 'shadow pte' model of coherency.

The key difference is that a regirstationcache is allowed to become
incoherent with the VMA's because it holds page pins. It is a
programming bug in userspace to change VA mappings via mmap/munmap/etc
while the device is working on that VA, but it does not harm system
integrity because of the page pin.

The cache ensures that each initiated operation sees a DMA setup that
matches the current VA map when the operation is initiated and allows
expensive device DMA setups to be re-used.

A 'shadow pte' model (ie hmm) *really* needs device support to
directly block DMA access - ie trigger 'device page fault'. ie the
invalidate_start should inform the device to enter a fault mode and
that is it.  If the device can't do that, then the driver probably
shouldn't persue this level of coherency. The driver would quickly get
into the messy locking problems like dma_fence_wait from a notifier.

It is important to identify what model you are going for as defining a
'registration cache' coherence expectation allows the driver to skip
blocking in invalidate_range_start. All it does is invalidate the
cache so that future operations pick up the new VA mapping.

Intel's HFI RDMA driver uses this model extensively, and I think it is
well proven, within some limitations of course.

At least, 'registration cache' is the only use model I know of where
it is acceptable to skip invalidate_range_end.

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 03:12:11PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe  wrote:
> >
> > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
> >
> > > As the oom reaper is the primary guarantee of the oom handling forward
> > > progress it cannot be blocked on anything that might depend on blockable
> > > memory allocations. These are not really easy to track because they
> > > might be indirect - e.g. notifier blocks on a lock which other context
> > > holds while allocating memory or waiting for a flusher that needs memory
> > > to perform its work.
> >
> > But lockdep *does* track all this and fs_reclaim_acquire() was created
> > to solve exactly this problem.
> >
> > fs_reclaim is a lock and it flows through all the usual lockdep
> > schemes like any other lock. Any time the page allocator wants to do
> > something the would deadlock with reclaim it takes the lock.
> >
> > Failure is expressed by a deadlock cycle in the lockdep map, and
> > lockdep can handle arbitary complexity through layers of locks, work
> > queues, threads, etc.
> >
> > What is missing?
> 
> Lockdep doens't seen everything by far. E.g. a wait_event will be
> caught by the annotations here, but not by lockdep. 

Sure, but the wait_event might be OK if its progress isn't contingent
on fs_reclaim, ie triggered from interrupt, so why ban it?

> And since we're talking about mmu notifiers here and gpus/dma engines.
> We have dma_fence_wait, which can wait for any hw/driver in the system
> that takes part in shared/zero-copy buffer processing. Which at least
> on the graphics side is everything. This pulls in enormous amounts of
> deadlock potential that lockdep simply is blind about and will never
> see.

It seems very risky to entagle a notifier widely like that.

It looks pretty sure that notifiers are fs_reclaim, so at a minimum
that wait_event can't be contingent on anything that is doing
GFP_KERNEL or it will deadlock - and blockable doesn't make that sleep
safe.

Avoiding an uncertain wait_event under notifiers would seem to be the
only reasonable design here..

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 03:21:27PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > > > In some special cases we must not block, but there's not a
> > > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > > > pair to annotate these.
> > > > > 
> > > > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > > > not allowed to make sure there's forward progress. Quoting Michal:
> > > > > 
> > > > > "The notifier is called from quite a restricted context - oom_reaper -
> > > > > which shouldn't depend on any locks or sleepable conditionals. The 
> > > > > code
> > > > > should be swift as well but we mostly do care about it to make a 
> > > > > forward
> > > > > progress. Checking for sleepable context is the best thing we could 
> > > > > come
> > > > > up with that would describe these demands at least partially."
> > > > 
> > > > But this describes fs_reclaim_acquire() - is there some reason we are
> > > > conflating fs_reclaim with non-sleeping?
> > > 
> > > No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> > > event supposed to I thought. To make sure we can get at the last bit of
> > > memory by flushing all the queues and waiting for everything to be cleaned
> > > out.
> > 
> > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
> > the page allocator" ie a justification that was given this !blockable
> > stuff.
> > 
> > For instance:
> > 
> >   fs_reclaim_acquire()
> >   kmalloc(GFP_KERNEL) <- lock dep assertion
> > 
> > And further, Michal's concern about indirectness through locks is also
> > handled by lockdep:
> > 
> >CPU0 CPU1
> > mutex_lock()
> > kmalloc(GFP_KERNEL)
> > mutex_unlock()
> >   fs_reclaim_acquire()
> >   mutex_lock() <- lock dep assertion
> > 
> > In other words, to prevent recursion into the page allocator you use
> > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.
> 
> fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about
> any !GFP_NOWAIT allocation context here and any {in}direct dependency on
> it. 

AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
__GFP_DIRECT_RECLAIM..

This matches the existing test in __need_fs_reclaim() - so if you are
OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
allocations during OOM, then I think fs_reclaim already matches what
you described?

> Whether fs_reclaim_acquire can be reused for that I do not know
> because I am not familiar with the lockdep machinery enough

Well, if fs_reclaim is not already testing the flags you want, then we
could add another lockdep map that does. The basic principle is the
same, if you want to detect and prevent recursion into the allocator
under certain GFP flags then then AFAIK lockdep is the best tool we
have.

> No, non-blocking is a very coarse approximation of what we really need.
> But it should give us even a stronger condition. Essentially any sleep
> other than a preemption shouldn't be allowed in that context.

But it is a nonsense API to give the driver invalidate_range_start,
the blocking alternative to the non-blocking invalidate_range and then
demand it to be non-blocking.

Inspecting the code, no drivers are actually able to progress their
side in non-blocking mode.

The best we got was drivers tested the VA range and returned success
if they had no interest. Which is a big win to be sure, but it looks
like getting any more is not really posssible.

However, we could (probably even should) make the drivers fs_reclaim
safe.

If that is enough to guarantee progress of OOM, then lets consider
something like using current_gfp_context() to force PF_MEMALLOC_NOFS
allocation behavior on the driver callback and lockdep to try and keep
pushing on the the debugging, and dropping !blocking.

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:

> As the oom reaper is the primary guarantee of the oom handling forward
> progress it cannot be blocked on anything that might depend on blockable
> memory allocations. These are not really easy to track because they
> might be indirect - e.g. notifier blocks on a lock which other context
> holds while allocating memory or waiting for a flusher that needs memory
> to perform its work.

But lockdep *does* track all this and fs_reclaim_acquire() was created
to solve exactly this problem.

fs_reclaim is a lock and it flows through all the usual lockdep
schemes like any other lock. Any time the page allocator wants to do
something the would deadlock with reclaim it takes the lock.

Failure is expressed by a deadlock cycle in the lockdep map, and
lockdep can handle arbitary complexity through layers of locks, work
queues, threads, etc.

What is missing?

Jason


Re: [PATCH 4/5] mm, notifier: Add a lockdep map for invalidate_range_start

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 09:10:14AM +0200, Daniel Vetter wrote:
> On Wed, Aug 14, 2019 at 09:09:59PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2019 at 10:20:26PM +0200, Daniel Vetter wrote:
> > > This is a similar idea to the fs_reclaim fake lockdep lock. It's
> > > fairly easy to provoke a specific notifier to be run on a specific
> > > range: Just prep it, and then munmap() it.
> > > 
> > > A bit harder, but still doable, is to provoke the mmu notifiers for
> > > all the various callchains that might lead to them. But both at the
> > > same time is really hard to reliable hit, especially when you want to
> > > exercise paths like direct reclaim or compaction, where it's not
> > > easy to control what exactly will be unmapped.
> > > 
> > > By introducing a lockdep map to tie them all together we allow lockdep
> > > to see a lot more dependencies, without having to actually hit them
> > > in a single challchain while testing.
> > > 
> > > Aside: Since I typed this to test i915 mmu notifiers I've only rolled
> > > this out for the invaliate_range_start callback. If there's
> > > interest, we should probably roll this out to all of them. But my
> > > undestanding of core mm is seriously lacking, and I'm not clear on
> > > whether we need a lockdep map for each callback, or whether some can
> > > be shared.
> > 
> > I was thinking about doing something like this..
> > 
> > IMHO only range_end needs annotation, the other ops are either already
> > non-sleeping or only used by KVM.
>
> This isnt' about sleeping, this is about locking loops. And the biggest
> risk for that is from driver code, and at least hmm_mirror only has the
> driver code callback on invalidate_range_start. Once thing I discovered
> using this (and it would be really hard to spot, it's deeply neested) is
> that i915 userptr.

Sorry, that came out wrong, what I ment is that only range_end and
range_start really need annotation.

The other places are only used by KVM and are called in non-sleeping
contexts, so they already can't recurse back onto the popular sleeping
locks like mmap_sem or what not, can't do allocations, etc.  I don't
see alot of return in investing in them.

> > BTW, I have found it strange that i915 only uses
> > invalidate_range_start. Not really sure how it is able to do
> > that. Would love to know the answer :)
> 
> I suspect it's broken :-/ Our userptr is ... not the best. Part of the
> motivation here.

I was wondering if it is what we call in RDMA a 'registration cache'
ie you assume that userspace is well behaved while DMA is ongoing and
just use the notifer to invalidate cached dma mappings.

The hallmark of this pattern is that it holds the page pin the entire
time DMA is active, which is why it isn't a bug, it is just best
described as loosely coherent.

But, in RDMA the best-practice is to do this in userspace with
userfaultfd as it is very expensive to take a syscall on command
submission to have the kernel figure it out.

> > And if we do decide on the reclaim thing in my other email then the
> > reclaim dependency can be reliably injected by doing:
> > 
> >  fs_reclaim_acquire();
> >  lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
> >  lock_map_release(&__mmu_notifier_invalidate_range_start_map);
> >  fs_reclaim_release();
> > 
> > If I understand lockdep properly..
> 
> Ime fs_reclaim injects the mmu_notifier map here reliably as soon as
> you've thrown out the first pagecache mmap on any process. That "make sure
> we inject it quickly" is why the lockdep is _outside_ of the
> mm_has_notifiers() check. So no further injection needed imo.

I suspect a lot of our automated testing, like syzkaller in restricted
kvms, probably does not reliably trigger a fs_reclaim, so I would very
much prefer to inject it 100% of the time directly if we are sure this
is a reclaim context because of the i_mmap_rwsem I mentioned before.

Jason


Re: [PATCH 3/5] mm, notifier: Catch sleeping/blocking for !blockable

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 09:02:49AM +0200, Daniel Vetter wrote:
> On Wed, Aug 14, 2019 at 09:00:29PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2019 at 10:20:25PM +0200, Daniel Vetter wrote:
> > > We need to make sure implementations don't cheat and don't have a
> > > possible schedule/blocking point deeply burried where review can't
> > > catch it.
> > > 
> > > I'm not sure whether this is the best way to make sure all the
> > > might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> > > But it gets the job done.
> > > 
> > > Inspired by an i915 patch series which did exactly that, because the
> > > rules haven't been entirely clear to us.
> > 
> > I thought lockdep already was able to detect:
> > 
> >  spin_lock()
> >  might_sleep();
> >  spin_unlock()
> > 
> > Am I mistaken? If yes, couldn't this patch just inject a dummy lockdep
> > spinlock?
> 
> Hm ... assuming I didn't get lost in the maze I think might_sleep (well
> ___might_sleep) doesn't do any lockdep checking at all. And we want
> might_sleep, since that catches a lot more than lockdep.

Don't know how it works, but it sure looks like it does:

This:
spin_lock(>uobjects_lock);
down_read(>hw_destroy_rwsem);
up_read(>hw_destroy_rwsem);
spin_unlock(>uobjects_lock);

Causes:

[   33.324729] BUG: sleeping function called from invalid context at 
kernel/locking/rwsem.c:1444
[   33.325599] in_atomic(): 1, irqs_disabled(): 0, pid: 247, name: ibv_devinfo
[   33.326115] 3 locks held by ibv_devinfo/247:
[   33.326556]  #0: 9edf8379 (_dev->disassociate_srcu){}, 
at: ib_uverbs_open+0xff/0x5f0 [ib_uverbs]
[   33.327657]  #1: 5e0eddf1 (_dev->lists_mutex){+.+.}, at: 
ib_uverbs_open+0x16c/0x5f0 [ib_uverbs]
[   33.328682]  #2: 505f509e (&(>uobjects_lock)->rlock){+.+.}, 
at: ib_uverbs_open+0x31a/0x5f0 [ib_uverbs]

And this:

spin_lock(>uobjects_lock);
might_sleep();
spin_unlock(>uobjects_lock);

Causes:

[   16.867211] BUG: sleeping function called from invalid context at 
drivers/infiniband/core/uverbs_main.c:1095
[   16.867776] in_atomic(): 1, irqs_disabled(): 0, pid: 245, name: ibv_devinfo
[   16.868098] 3 locks held by ibv_devinfo/245:
[   16.868383]  #0: 4c5954ff (_dev->disassociate_srcu){}, 
at: ib_uverbs_open+0xf8/0x600 [ib_uverbs]
[   16.868938]  #1: 20a6fae2 (_dev->lists_mutex){+.+.}, at: 
ib_uverbs_open+0x16c/0x600 [ib_uverbs]
[   16.869568]  #2: 036e6a97 (&(>uobjects_lock)->rlock){+.+.}, 
at: ib_uverbs_open+0x317/0x600 [ib_uverbs]

I think this is done in some very expensive way, so it probably only
works when lockdep is enabled..

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > In some special cases we must not block, but there's not a
> > > spinlock, preempt-off, irqs-off or similar critical section already
> > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > pair to annotate these.
> > > 
> > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > not allowed to make sure there's forward progress. Quoting Michal:
> > > 
> > > "The notifier is called from quite a restricted context - oom_reaper -
> > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > should be swift as well but we mostly do care about it to make a forward
> > > progress. Checking for sleepable context is the best thing we could come
> > > up with that would describe these demands at least partially."
> > 
> > But this describes fs_reclaim_acquire() - is there some reason we are
> > conflating fs_reclaim with non-sleeping?
> 
> No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> event supposed to I thought. To make sure we can get at the last bit of
> memory by flushing all the queues and waiting for everything to be cleaned
> out.

AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
the page allocator" ie a justification that was given this !blockable
stuff.

For instance:

  fs_reclaim_acquire()
  kmalloc(GFP_KERNEL) <- lock dep assertion

And further, Michal's concern about indirectness through locks is also
handled by lockdep:

   CPU0 CPU1
mutex_lock()
kmalloc(GFP_KERNEL)
mutex_unlock()
  fs_reclaim_acquire()
  mutex_lock() <- lock dep assertion

In other words, to prevent recursion into the page allocator you use
fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.

I asked Tejun about this once in regards to WQ_MEM_RECLAIM and he
explained that it means you can't call the allocator functions in a
way that would recurse into reclaim (ie instead use instead GFP_ATOMIC, or
tolerate allocation failure, or various other things).

So, the reason I bring it up is half the justifications you posted for
blockable had to do with not recursing into reclaim and deadlocking,
and didn't seem to have much to do with blocking.

I'm asking if *non-blocking* is really the requirement or if this is
just the usual 'do not deadlock on the allocator' thing reclaim paths
alread have?

Jason


Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-15 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 09:38:54AM +0200, Christoph Hellwig wrote:
> On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> > Section alignment constraints somewhat save us here. The only example
> > I can think of a PMD not containing a uniform pgmap association for
> > each pte is the case when the pgmap overlaps normal dram, i.e. shares
> > the same 'struct memory_section' for a given span. Otherwise, distinct
> > pgmaps arrange to manage their own exclusive sections (and now
> > subsections as of v5.3). Otherwise the implementation could not
> > guarantee different mapping lifetimes.
> > 
> > That said, this seems to want a better mechanism to determine "pfn is
> > ZONE_DEVICE".
> 
> So I guess this patch is fine for now, and once you provide a better
> mechanism we can switch over to it?

What about the version I sent to just get rid of all the strange
put_dev_pagemaps while scanning? Odds are good we will work with only
a single pagemap, so it makes some sense to cache it once we find it?

diff --git a/mm/hmm.c b/mm/hmm.c
index 9a908902e4cc38..4e30128c23a505 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -497,10 +497,6 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
}
pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags;
}
-   if (hmm_vma_walk->pgmap) {
-   put_dev_pagemap(hmm_vma_walk->pgmap);
-   hmm_vma_walk->pgmap = NULL;
-   }
hmm_vma_walk->last = end;
return 0;
 #else
@@ -604,10 +600,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
unsigned long addr,
return 0;
 
 fault:
-   if (hmm_vma_walk->pgmap) {
-   put_dev_pagemap(hmm_vma_walk->pgmap);
-   hmm_vma_walk->pgmap = NULL;
-   }
pte_unmap(ptep);
/* Fault any virtual address we were asked to fault */
return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
@@ -690,16 +682,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
return r;
}
}
-   if (hmm_vma_walk->pgmap) {
-   /*
-* We do put_dev_pagemap() here and not in hmm_vma_handle_pte()
-* so that we can leverage get_dev_pagemap() optimization which
-* will not re-take a reference on a pgmap if we already have
-* one.
-*/
-   put_dev_pagemap(hmm_vma_walk->pgmap);
-   hmm_vma_walk->pgmap = NULL;
-   }
pte_unmap(ptep - 1);
 
hmm_vma_walk->last = addr;
@@ -751,10 +733,6 @@ static int hmm_vma_walk_pud(pud_t *pudp,
pfns[i] = hmm_device_entry_from_pfn(range, pfn) |
  cpu_flags;
}
-   if (hmm_vma_walk->pgmap) {
-   put_dev_pagemap(hmm_vma_walk->pgmap);
-   hmm_vma_walk->pgmap = NULL;
-   }
hmm_vma_walk->last = end;
return 0;
}
@@ -1026,6 +1004,14 @@ long hmm_range_fault(struct hmm_range *range, unsigned 
int flags)
/* Keep trying while the range is valid. */
} while (ret == -EBUSY && range->valid);
 
+   /*
+* We do put_dev_pagemap() here so that we can leverage
+* get_dev_pagemap() optimization which will not re-take a
+* reference on a pgmap if we already have one.
+*/
+   if (hmm_vma_walk->pgmap)
+   put_dev_pagemap(hmm_vma_walk->pgmap);
+
if (ret) {
unsigned long i;
 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v3 hmm 03/11] mm/mmu_notifiers: add a get/put scheme for the registration

2019-08-15 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 02:20:31PM -0700, Ralph Campbell wrote:
> 
> On 8/6/19 4:15 PM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe 
> > 
> > Many places in the kernel have a flow where userspace will create some
> > object and that object will need to connect to the subsystem's
> > mmu_notifier subscription for the duration of its lifetime.
> > 
> > In this case the subsystem is usually tracking multiple mm_structs and it
> > is difficult to keep track of what struct mmu_notifier's have been
> > allocated for what mm's.
> > 
> > Since this has been open coded in a variety of exciting ways, provide core
> > functionality to do this safely.
> > 
> > This approach uses the strct mmu_notifier_ops * as a key to determine if
> 
> s/strct/struct

Yes, thanks for all of this, I like having comments, but I'm a
terrible proofreader :(

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v3 hmm 04/11] misc/sgi-gru: use mmu_notifier_get/put for struct gru_mm_struct

2019-08-15 Thread Jason Gunthorpe
On Thu, Aug 08, 2019 at 12:25:56PM +0200, Christoph Hellwig wrote:
> Looks good,
> 
> Reviewed-by: Christoph Hellwig 

Dimitri, are you OK with this patch?

Thanks,
Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v3 hmm 08/11] drm/radeon: use mmu_notifier_get/put for struct radeon_mn

2019-08-15 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 08:15:45PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe 
> 
> radeon is using a device global hash table to track what mmu_notifiers
> have been registered on struct mm. This is better served with the new
> get/put scheme instead.
> 
> radeon has a bug where it was not blocking notifier release() until all
> the BO's had been invalidated. This could result in a use after free of
> pages the BOs. This is tied into a second bug where radeon left the
> notifiers running endlessly even once the interval tree became
> empty. This could result in a use after free with module unload.
> 
> Both are fixed by changing the lifetime model, the BOs exist in the
> interval tree with their natural lifetimes independent of the mm_struct
> lifetime using the get/put scheme. The release runs synchronously and just
> does invalidate_start across the entire interval tree to create the
> required DMA fence.
> 
> Additions to the interval tree after release are already impossible as
> only current->mm is used during the add.
> 
> Signed-off-by: Jason Gunthorpe 
>  drivers/gpu/drm/radeon/radeon.h|   3 -
>  drivers/gpu/drm/radeon/radeon_device.c |   2 -
>  drivers/gpu/drm/radeon/radeon_drv.c|   2 +
>  drivers/gpu/drm/radeon/radeon_mn.c | 157 ++---
>  4 files changed, 38 insertions(+), 126 deletions(-)

AMD team: Are you OK with this patch?

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 5/5] mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors

2019-08-14 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 10:20:27PM +0200, Daniel Vetter wrote:
> Similar to the warning in the mmu notifer, warning if an hmm mirror
> callback gets it's blocking vs. nonblocking handling wrong, or if it
> fails with anything else than -EAGAIN.
> 
> Cc: Jason Gunthorpe 
> Cc: Ralph Campbell 
> Cc: John Hubbard 
> Cc: Dan Williams 
> Cc: Dan Carpenter 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Balbir Singh 
> Cc: Ira Weiny 
> Cc: Souptick Joarder 
> Cc: Andrew Morton 
> Cc: "Jérôme Glisse" 
> Cc: linux...@kvack.org
> Signed-off-by: Daniel Vetter 
>  mm/hmm.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 16b6731a34db..52ac59384268 100644
> +++ b/mm/hmm.c
> @@ -205,6 +205,9 @@ static int hmm_invalidate_range_start(struct mmu_notifier 
> *mn,
>   ret = -EAGAIN;
>   break;
>   }
> + WARN(ret, "%pS callback failed with %d in %sblockable 
> context\n",
> +  mirror->ops->sync_cpu_device_pagetables, ret,
> +  update.blockable ? "" : "non-");
>   }
>   up_read(>mirrors_sem);

Didn't I beat you to this?

list_for_each_entry(mirror, >mirrors, list) {
int rc;

rc = mirror->ops->sync_cpu_device_pagetables(mirror, );
if (rc) {
if (WARN_ON(update.blockable || rc != -EAGAIN))
continue;
ret = -EAGAIN;
break;
}
}

Thanks,
Jason


Re: [PATCH 4/5] mm, notifier: Add a lockdep map for invalidate_range_start

2019-08-14 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 10:20:26PM +0200, Daniel Vetter wrote:
> This is a similar idea to the fs_reclaim fake lockdep lock. It's
> fairly easy to provoke a specific notifier to be run on a specific
> range: Just prep it, and then munmap() it.
> 
> A bit harder, but still doable, is to provoke the mmu notifiers for
> all the various callchains that might lead to them. But both at the
> same time is really hard to reliable hit, especially when you want to
> exercise paths like direct reclaim or compaction, where it's not
> easy to control what exactly will be unmapped.
> 
> By introducing a lockdep map to tie them all together we allow lockdep
> to see a lot more dependencies, without having to actually hit them
> in a single challchain while testing.
> 
> Aside: Since I typed this to test i915 mmu notifiers I've only rolled
> this out for the invaliate_range_start callback. If there's
> interest, we should probably roll this out to all of them. But my
> undestanding of core mm is seriously lacking, and I'm not clear on
> whether we need a lockdep map for each callback, or whether some can
> be shared.

I was thinking about doing something like this..

IMHO only range_end needs annotation, the other ops are either already
non-sleeping or only used by KVM.

BTW, I have found it strange that i915 only uses
invalidate_range_start. Not really sure how it is able to do
that. Would love to know the answer :)

> Reviewed-by: Jérôme Glisse 
> Signed-off-by: Daniel Vetter 
>  include/linux/mmu_notifier.h | 6 ++
>  mm/mmu_notifier.c| 7 +++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index b6c004bd9f6a..9dd38c32fc53 100644
> +++ b/include/linux/mmu_notifier.h
> @@ -42,6 +42,10 @@ enum mmu_notifier_event {
>  
>  #ifdef CONFIG_MMU_NOTIFIER
>  
> +#ifdef CONFIG_LOCKDEP
> +extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
> +#endif

I wonder what the trade off is having a global map vs a map in each
mmu_notifier_mm ?

>  /*
>   * The mmu notifier_mm structure is allocated and installed in
>   * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> @@ -310,10 +314,12 @@ static inline void mmu_notifier_change_pte(struct 
> mm_struct *mm,
>  static inline void
>  mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
>  {
> + lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
>   if (mm_has_notifiers(range->mm)) {
>   range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE;
>   __mmu_notifier_invalidate_range_start(range);
>   }
> + lock_map_release(&__mmu_notifier_invalidate_range_start_map);
>  }

Also range_end should have this too - it has all the same
constraints. I think it can share the map. So 'range_start_map' is
probably not the right name.

It may also make some sense to do a dummy acquire/release under the
mm_take_all_locks() to forcibly increase map coverage and reduce the
scenario complexity required to hit bugs.

And if we do decide on the reclaim thing in my other email then the
reclaim dependency can be reliably injected by doing:

 fs_reclaim_acquire();
 lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 fs_reclaim_release();

If I understand lockdep properly..

Jason


Re: [PATCH 3/5] mm, notifier: Catch sleeping/blocking for !blockable

2019-08-14 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 10:20:25PM +0200, Daniel Vetter wrote:
> We need to make sure implementations don't cheat and don't have a
> possible schedule/blocking point deeply burried where review can't
> catch it.
> 
> I'm not sure whether this is the best way to make sure all the
> might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> But it gets the job done.
> 
> Inspired by an i915 patch series which did exactly that, because the
> rules haven't been entirely clear to us.

I thought lockdep already was able to detect:

 spin_lock()
 might_sleep();
 spin_unlock()

Am I mistaken? If yes, couldn't this patch just inject a dummy lockdep
spinlock?

Jason


Re: [PATCH 2/5] kernel.h: Add non_block_start/end()

2019-08-14 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
> 
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress. Quoting Michal:
> 
> "The notifier is called from quite a restricted context - oom_reaper -
> which shouldn't depend on any locks or sleepable conditionals. The code
> should be swift as well but we mostly do care about it to make a forward
> progress. Checking for sleepable context is the best thing we could come
> up with that would describe these demands at least partially."

But this describes fs_reclaim_acquire() - is there some reason we are
conflating fs_reclaim with non-sleeping?

ie is there some fundamental difference between the block stack
sleeping during reclaim while it waits for a driver to write out a
page and a GPU driver sleeping during OOM while it waits for it's HW
to fence DMA on a page?

Fundamentally we have invalidate_range_start() vs invalidate_range()
as the start() version is able to sleep. If drivers can do their work
without sleeping then they should be using invalidare_range() instead.

Thus, it doesn't seem to make any sense to ask a driver that requires a
sleeping API not to sleep.

AFAICT what is really going on here is that drivers care about only a
subset of the VA space, and we want to query the driver if it cares
about the range proposed to be OOM'd, so we can OOM ranges that are
do not have SPTEs.

ie if you look pretty much all drivers do exactly as
userptr_mn_invalidate_range_start() does, and bail once they detect
the VA range is of interest.

So, I'm working on a patch to lift the interval tree into the notifier
core and then do the VA test OOM needs without bothering the
driver. Drivers can retain the blocking API they require and OOM can
work on VA's that don't have SPTEs.

This approach also solves the critical bug in this path:
  https://lore.kernel.org/linux-mm/20190807191627.ga3...@ziepe.ca/

And solves a bunch of other bugs in the drivers.

> Peter also asked whether we want to catch spinlocks on top, but Michal
> said those are less of a problem because spinlocks can't have an
> indirect dependency upon the page allocator and hence close the loop
> with the oom reaper.

Again, this entirely sounds like fs_reclaim - isn't that exactly what
it is for?

I have had on my list a second and very related possible bug. I ran
into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in
advance") which says that mapping->i_mmap_mutex is under fs_reclaim().

We do hold i_mmap_rwsem while calling invalidate_range_start():

 unmap_mapping_pages
  i_mmap_lock_write(mapping); // ie i_mmap_rwsem
  unmap_mapping_range_tree
   unmap_mapping_range_vma
zap_page_range_single
 mmu_notifier_invalidate_range_start

So, if it is still true that i_mmap_rwsem is under fs_reclaim then
invalidate_range_start is *always* under fs_reclaim anyhow! (this I do
not know)

Thus we should use lockdep to force this and fix all the drivers.

.. and if we force fs_reclaim always, do we care about blockable
anymore??

Jason


Re: [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail

2019-08-14 Thread Jason Gunthorpe
On Wed, Aug 14, 2019 at 03:14:47PM -0700, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:23 +0200 Daniel Vetter  
> wrote:
> 
> > Just a bit of paranoia, since if we start pushing this deep into
> > callchains it's hard to spot all places where an mmu notifier
> > implementation might fail when it's not allowed to.
> > 
> > Inspired by some confusion we had discussing i915 mmu notifiers and
> > whether we could use the newly-introduced return value to handle some
> > corner cases. Until we realized that these are only for when a task
> > has been killed by the oom reaper.
> > 
> > An alternative approach would be to split the callback into two
> > versions, one with the int return value, and the other with void
> > return value like in older kernels. But that's a lot more churn for
> > fairly little gain I think.
> > 
> > Summary from the m-l discussion on why we want something at warning
> > level: This allows automated tooling in CI to catch bugs without
> > humans having to look at everything. If we just upgrade the existing
> > pr_info to a pr_warn, then we'll have false positives. And as-is, no
> > one will ever spot the problem since it's lost in the massive amounts
> > of overall dmesg noise.
> > 
> > ...
> >
> > +++ b/mm/mmu_notifier.c
> > @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct 
> > mmu_notifier_range *range)
> > pr_info("%pS callback failed with %d in 
> > %sblockable context.\n",
> > mn->ops->invalidate_range_start, _ret,
> > !mmu_notifier_range_blockable(range) ? 
> > "non-" : "");
> > +   WARN_ON(mmu_notifier_range_blockable(range) ||
> > +   ret != -EAGAIN);
> > ret = _ret;
> > }
> > }
> 
> A problem with WARN_ON(a || b) is that if it triggers, we don't know
> whether it was because of a or because of b.  Or both.  So I'd suggest
> 
>   WARN_ON(a);
>   WARN_ON(b);
> 

Well, we did just make a pr_info right above with the value of
blockable, that seems enough to tell the cases apart?

But you are generally right, the full logic:

if (_ret) {
  if (WARN_ON(mmu_notifier_range_blockable(range)))
continue;
  WARN_ON(_ret != -EAGAIN);
  ret = -EAGAIN;
  break;
}

would force correct API contract up the call chain once we detect a
broken driver..

But at some point it does feel like a bit much debugging logic to have
in a production code path, as this should never happen and is just to
discourage wrong driver behaviors during driver development.

If we like this version then:

Reviewed-by: Jason Gunthorpe 

Also - I have a bunch of other patches to mmu notifiers for hmm.git,
so when everyone agrees I can grab this to avoid conflicts.

Thanks,
Jason


Re: [PATCH v3 hmm 10/11] drm/amdkfd: use mmu_notifier_put

2019-08-08 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 11:47:44PM +, Kuehling, Felix wrote:
> On 2019-08-06 19:15, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe 
> >
> > The sequence of mmu_notifier_unregister_no_release(),
> > mmu_notifier_call_srcu() is identical to mmu_notifier_put() with the
> > free_notifier callback.
> >
> > As this is the last user of those APIs, converting it means we can drop
> > them.
> >
> > Signed-off-by: Jason Gunthorpe 
> 
> Reviewed-by: Felix Kuehling 
> 
> >   drivers/gpu/drm/amd/amdkfd/kfd_priv.h|  3 ---
> >   drivers/gpu/drm/amd/amdkfd/kfd_process.c | 10 --
> >   2 files changed, 4 insertions(+), 9 deletions(-)
> >
> > I'm really not sure what this is doing, but it is very strange to have a
> > release with no other callback. It would be good if this would change to use
> > get as well.
> KFD uses the MMU notifier to detect process termination and free all the 
> resources associated with the process. This was first added for APUs 
> where the IOMMUv2 is set up to perform address translations using the 
> CPU page table for device memory access. That's where the association of 
> KFD process resources with the lifetime of the mm_struct comes from.

When all the HW objects that could do DMA to this process are
destroyed then the mmu notififer should be torn down. The module
should remain locked until the DMA objects are destroyed.

I'm still unclear why this is needed, the IOMMU for PASID already has
notififers, and already blocks access when the mm_struct goes away,
why add a second layer of tracking?

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 04/15] mm: remove the pgmap field from struct hmm_vma_walk

2019-08-08 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 07:05:42PM +0300, Christoph Hellwig wrote:
> There is only a single place where the pgmap is passed over a function
> call, so replace it with local variables in the places where we deal
> with the pgmap.
> 
> Signed-off-by: Christoph Hellwig 
>  mm/hmm.c | 62 
>  1 file changed, 27 insertions(+), 35 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 9a908902e4cc..d66fa29b42e0 100644
> +++ b/mm/hmm.c
> @@ -278,7 +278,6 @@ EXPORT_SYMBOL(hmm_mirror_unregister);
>  
>  struct hmm_vma_walk {
>   struct hmm_range*range;
> - struct dev_pagemap  *pgmap;
>   unsigned long   last;
>   unsigned intflags;
>  };
> @@ -475,6 +474,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   struct hmm_vma_walk *hmm_vma_walk = walk->private;
>   struct hmm_range *range = hmm_vma_walk->range;
> + struct dev_pagemap *pgmap = NULL;
>   unsigned long pfn, npages, i;
>   bool fault, write_fault;
>   uint64_t cpu_flags;
> @@ -490,17 +490,14 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
>   pfn = pmd_pfn(pmd) + pte_index(addr);
>   for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
>   if (pmd_devmap(pmd)) {
> - hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
> -   hmm_vma_walk->pgmap);
> - if (unlikely(!hmm_vma_walk->pgmap))
> + pgmap = get_dev_pagemap(pfn, pgmap);
> + if (unlikely(!pgmap))
>   return -EBUSY;

Unrelated to this patch, but what is the point of getting checking
that the pgmap exists for the page and then immediately releasing it?
This code has this pattern in several places.

It feels racy

>   }
>   pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags;
>   }
> - if (hmm_vma_walk->pgmap) {
> - put_dev_pagemap(hmm_vma_walk->pgmap);
> - hmm_vma_walk->pgmap = NULL;

Putting the value in the hmm_vma_walk would have made some sense to me
if the pgmap was not set to NULL all over the place. Then the most
xa_loads would be eliminated, as I would expect the pgmap tends to be
mostly uniform for these use cases.

Is there some reason the pgmap ref can't be held across
faulting/sleeping? ie like below.

Anyhow, I looked over this pretty carefully and the change looks
functionally OK, I just don't know why the code is like this in the
first place.

Reviewed-by: Jason Gunthorpe 

diff --git a/mm/hmm.c b/mm/hmm.c
index 9a908902e4cc38..4e30128c23a505 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -497,10 +497,6 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
}
pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags;
}
-   if (hmm_vma_walk->pgmap) {
-   put_dev_pagemap(hmm_vma_walk->pgmap);
-   hmm_vma_walk->pgmap = NULL;
-   }
hmm_vma_walk->last = end;
return 0;
 #else
@@ -604,10 +600,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, 
unsigned long addr,
return 0;
 
 fault:
-   if (hmm_vma_walk->pgmap) {
-   put_dev_pagemap(hmm_vma_walk->pgmap);
-   hmm_vma_walk->pgmap = NULL;
-   }
pte_unmap(ptep);
/* Fault any virtual address we were asked to fault */
return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
@@ -690,16 +682,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
return r;
}
}
-   if (hmm_vma_walk->pgmap) {
-   /*
-* We do put_dev_pagemap() here and not in hmm_vma_handle_pte()
-* so that we can leverage get_dev_pagemap() optimization which
-* will not re-take a reference on a pgmap if we already have
-* one.
-*/
-   put_dev_pagemap(hmm_vma_walk->pgmap);
-   hmm_vma_walk->pgmap = NULL;
-   }
pte_unmap(ptep - 1);
 
hmm_vma_walk->last = addr;
@@ -751,10 +733,6 @@ static int hmm_vma_walk_pud(pud_t *pudp,
pfns[i] = hmm_device_entry_from_pfn(range, pfn) |
  cpu_flags;
}
-   if (hmm_vma_walk->pgmap) {
-   put_dev_pagemap(hmm_vma_walk->pgmap);
-   hmm_vma_walk->pgmap = NULL;
-   }
hmm_vma_walk->last = end;
return 0;
}
@@ -1026,6 +1004,14 @@ long hmm_range_fault(struct hmm_range *range, unsigned 
int flags)
/* Keep trying while the range is valid. *

Re: hmm cleanups, v2

2019-08-08 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 07:05:38PM +0300, Christoph Hellwig wrote:
> 
> Hi Jérôme, Ben, Felix and Jason,
> 
> below is a series against the hmm tree which cleans up various minor
> bits and allows HMM_MIRROR to be built on all architectures.
> 
> Diffstat:
> 
> 11 files changed, 94 insertions(+), 210 deletions(-)
> 
> A git tree is also available at:
> 
> git://git.infradead.org/users/hch/misc.git hmm-cleanups.2
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/hmm-cleanups.2
> 
> Changes since v1:
>  - fix the cover letter subject
>  - improve various patch descriptions
>  - use svmm->mm in nouveau_range_fault
>  - inverse the hmask field when using it
>  - select HMM_MIRROR instead of making it a user visible option
 
I think it is straightforward enough to move into -next, so applied to
the hmm.git lets get some more reviewed-bys/tested-by though.

For now I dropped 'remove the pgmap field from struct hmm_vma_walk'
just to hear the followup and 'amdgpu: remove
CONFIG_DRM_AMDGPU_USERPTR' until the AMD team Acks

Thanks,
Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 07/15] mm: remove the page_shift member from struct hmm_range

2019-08-08 Thread Jason Gunthorpe
  if (fault || write_fault) {
>   int ret;
> @@ -772,7 +771,7 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, 
> unsigned long hmask,
> struct mm_walk *walk)
>  {
>  #ifdef CONFIG_HUGETLB_PAGE
> - unsigned long addr = start, i, pfn, mask, size, pfn_inc;
> + unsigned long addr = start, i, pfn, mask;
>   struct hmm_vma_walk *hmm_vma_walk = walk->private;
>   struct hmm_range *range = hmm_vma_walk->range;
>   struct vm_area_struct *vma = walk->vma;
> @@ -783,24 +782,12 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, 
> unsigned long hmask,
>   pte_t entry;
>   int ret = 0;
>  
> - size = huge_page_size(h);
> - mask = size - 1;
> - if (range->page_shift != PAGE_SHIFT) {
> - /* Make sure we are looking at a full page. */
> - if (start & mask)
> - return -EINVAL;
> - if (end < (start + size))
> - return -EINVAL;
> - pfn_inc = size >> PAGE_SHIFT;
> - } else {
> - pfn_inc = 1;
> - size = PAGE_SIZE;
> - }
> + mask = huge_page_size(h) - 1;
>  
>   ptl = huge_pte_lock(hstate_vma(vma), walk->mm, pte);
>   entry = huge_ptep_get(pte);
>  
> - i = (start - range->start) >> range->page_shift;
> + i = (start - range->start) >> PAGE_SHIFT;
>   orig_pfn = range->pfns[i];
>   range->pfns[i] = range->values[HMM_PFN_NONE];
>   cpu_flags = pte_to_hmm_pfn_flags(range, entry);
> @@ -812,8 +799,8 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, 
> unsigned long hmask,
>   goto unlock;
>   }
>  
> - pfn = pte_pfn(entry) + ((start & mask) >> range->page_shift);
> - for (; addr < end; addr += size, i++, pfn += pfn_inc)
> + pfn = pte_pfn(entry) + ((start & mask) >> PAGE_SHIFT);
> + for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
>   range->pfns[i] = hmm_device_entry_from_pfn(range, pfn) |
>    cpu_flags;
>   hmm_vma_walk->last = end;
> @@ -850,14 +837,13 @@ static void hmm_pfns_clear(struct hmm_range *range,
>   */
>  int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror)
>  {
> - unsigned long mask = ((1UL << range->page_shift) - 1UL);
>   struct hmm *hmm = mirror->hmm;
>   unsigned long flags;
>  
>   range->valid = false;
>   range->hmm = NULL;
>  
> - if ((range->start & mask) || (range->end & mask))
> + if ((range->start & (PAGE_SIZE - 1)) || (range->end & (PAGE_SIZE - 1)))
>   return -EINVAL;

PAGE_SIZE-1 == PAGE_MASK ? If yes I can fix it

Reviewed-by: Jason Gunthorpe 

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 15/15] amdgpu: remove CONFIG_DRM_AMDGPU_USERPTR

2019-08-08 Thread Jason Gunthorpe
On Wed, Aug 07, 2019 at 06:57:24AM +, Koenig, Christian wrote:
> Am 06.08.19 um 22:03 schrieb Jason Gunthorpe:
> > On Tue, Aug 06, 2019 at 02:58:58PM -0400, Alex Deucher wrote:
> >> On Tue, Aug 6, 2019 at 1:51 PM Kuehling, Felix  
> >> wrote:
> >>> On 2019-08-06 13:44, Jason Gunthorpe wrote:
> >>>> On Tue, Aug 06, 2019 at 07:05:53PM +0300, Christoph Hellwig wrote:
> >>>>> The option is just used to select HMM mirror support and has a very
> >>>>> confusing help text.  Just pull in the HMM mirror code by default
> >>>>> instead.
> >>>>>
> >>>>> Signed-off-by: Christoph Hellwig 
> >>>>>drivers/gpu/drm/Kconfig |  2 ++
> >>>>>drivers/gpu/drm/amd/amdgpu/Kconfig  | 10 --
> >>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |  6 --
> >>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 12 
> >>>>>4 files changed, 2 insertions(+), 28 deletions(-)
> >>>> Felix, was this an effort to avoid the arch restriction on hmm or
> >>>> something? Also can't see why this was like this.
> >>> This option predates KFD's support of userptrs, which in turn predates
> >>> HMM. Radeon has the same kind of option, though it doesn't affect HMM in
> >>> that case.
> >>>
> >>> Alex, Christian, can you think of a good reason to maintain userptr
> >>> support as an option in amdgpu? I suspect it was originally meant as a
> >>> way to allow kernels with amdgpu without MMU notifiers. Now it would
> >>> allow a kernel with amdgpu without HMM or MMU notifiers. I don't know if
> >>> this is a useful thing to have.
> >> Right.  There were people that didn't have MMU notifiers that wanted
> >> support for the GPU.
> > ?? Is that even a real thing? mmu_notifier does not have much kconfig
> > dependency.
> 
> Yes, that used to be a very real thing.
> 
> Initially a lot of users didn't wanted mmu notifiers to be enabled 
> because of the performance overhead they costs.

Seems strange to hear these days, every distro ships with it on, it is
needed for kvm.

> Then we had the problem that HMM mirror wasn't available on a lot of 
> architectures.

Some patches for hmm are ready now that will fix this

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 09/15] mm: don't abuse pte_index() in hmm_vma_handle_pmd

2019-08-08 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 07:05:47PM +0300, Christoph Hellwig wrote:
> pte_index is an internal arch helper in various architectures,
> without consistent semantics.  Open code that calculation of a PMD
> index based on the virtual address instead.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/hmm.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

There sure are a lot of different ways to express this, but this one
looks OK to me, at least the switch from the PTRS_PER_PTE expression
in the x86 imlpementation to PMD_MASK looks equivalent
 
Reviewed-by: Jason Gunthorpe 

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v3 hmm 03/11] mm/mmu_notifiers: add a get/put scheme for the registration

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

Many places in the kernel have a flow where userspace will create some
object and that object will need to connect to the subsystem's
mmu_notifier subscription for the duration of its lifetime.

In this case the subsystem is usually tracking multiple mm_structs and it
is difficult to keep track of what struct mmu_notifier's have been
allocated for what mm's.

Since this has been open coded in a variety of exciting ways, provide core
functionality to do this safely.

This approach uses the strct mmu_notifier_ops * as a key to determine if
the subsystem has a notifier registered on the mm or not. If there is a
registration then the existing notifier struct is returned, otherwise the
ops->alloc_notifiers() is used to create a new per-subsystem notifier for
the mm.

The destroy side incorporates an async call_srcu based destruction which
will avoid bugs in the callers such as commit 6d7c3cde93c1 ("mm/hmm: fix
use after free with struct hmm in the mmu notifiers").

Since we are inside the mmu notifier core locking is fairly simple, the
allocation uses the same approach as for mmu_notifier_mm, the write side
of the mmap_sem makes everything deterministic and we only need to do
hlist_add_head_rcu() under the mm_take_all_locks(). The new users count
and the discoverability in the hlist is fully serialized by the
mmu_notifier_mm->lock.

Co-developed-by: Christoph Hellwig 
Signed-off-by: Christoph Hellwig 
Signed-off-by: Jason Gunthorpe 
---
 include/linux/mmu_notifier.h |  35 
 mm/mmu_notifier.c| 156 +--
 2 files changed, 185 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b6c004bd9f6ad9..31aa971315a142 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -211,6 +211,19 @@ struct mmu_notifier_ops {
 */
void (*invalidate_range)(struct mmu_notifier *mn, struct mm_struct *mm,
 unsigned long start, unsigned long end);
+
+   /*
+* These callbacks are used with the get/put interface to manage the
+* lifetime of the mmu_notifier memory. alloc_notifier() returns a new
+* notifier for use with the mm.
+*
+* free_notifier() is only called after the mmu_notifier has been
+* fully put, calls to any ops callback are prevented and no ops
+* callbacks are currently running. It is called from a SRCU callback
+* and cannot sleep.
+*/
+   struct mmu_notifier *(*alloc_notifier)(struct mm_struct *mm);
+   void (*free_notifier)(struct mmu_notifier *mn);
 };
 
 /*
@@ -227,6 +240,9 @@ struct mmu_notifier_ops {
 struct mmu_notifier {
struct hlist_node hlist;
const struct mmu_notifier_ops *ops;
+   struct mm_struct *mm;
+   struct rcu_head rcu;
+   unsigned int users;
 };
 
 static inline int mm_has_notifiers(struct mm_struct *mm)
@@ -234,6 +250,21 @@ static inline int mm_has_notifiers(struct mm_struct *mm)
return unlikely(mm->mmu_notifier_mm);
 }
 
+struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops 
*ops,
+struct mm_struct *mm);
+static inline struct mmu_notifier *
+mmu_notifier_get(const struct mmu_notifier_ops *ops, struct mm_struct *mm)
+{
+   struct mmu_notifier *ret;
+
+   down_write(>mmap_sem);
+   ret = mmu_notifier_get_locked(ops, mm);
+   up_write(>mmap_sem);
+   return ret;
+}
+void mmu_notifier_put(struct mmu_notifier *mn);
+void mmu_notifier_synchronize(void);
+
 extern int mmu_notifier_register(struct mmu_notifier *mn,
 struct mm_struct *mm);
 extern int __mmu_notifier_register(struct mmu_notifier *mn,
@@ -581,6 +612,10 @@ static inline void mmu_notifier_mm_destroy(struct 
mm_struct *mm)
 #define pudp_huge_clear_flush_notify pudp_huge_clear_flush
 #define set_pte_at_notify set_pte_at
 
+static inline void mmu_notifier_synchronize(void)
+{
+}
+
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 696810f632ade1..4a770b5211b71d 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -248,6 +248,9 @@ int __mmu_notifier_register(struct mmu_notifier *mn, struct 
mm_struct *mm)
lockdep_assert_held_write(>mmap_sem);
BUG_ON(atomic_read(>mm_users) <= 0);
 
+   mn->mm = mm;
+   mn->users = 1;
+
if (!mm->mmu_notifier_mm) {
/*
 * kmalloc cannot be called under mm_take_all_locks(), but we
@@ -295,18 +298,24 @@ int __mmu_notifier_register(struct mmu_notifier *mn, 
struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_register);
 
-/*
+/**
+ * mmu_notifier_register - Register a notifier on a mm
+ * @mn: The notifier to attach
+ * @mm : The mm to attach the notifier to
+ *
  * Must not hold mmap_sem

[PATCH v3 hmm 08/11] drm/radeon: use mmu_notifier_get/put for struct radeon_mn

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

radeon is using a device global hash table to track what mmu_notifiers
have been registered on struct mm. This is better served with the new
get/put scheme instead.

radeon has a bug where it was not blocking notifier release() until all
the BO's had been invalidated. This could result in a use after free of
pages the BOs. This is tied into a second bug where radeon left the
notifiers running endlessly even once the interval tree became
empty. This could result in a use after free with module unload.

Both are fixed by changing the lifetime model, the BOs exist in the
interval tree with their natural lifetimes independent of the mm_struct
lifetime using the get/put scheme. The release runs synchronously and just
does invalidate_start across the entire interval tree to create the
required DMA fence.

Additions to the interval tree after release are already impossible as
only current->mm is used during the add.

Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/radeon/radeon.h|   3 -
 drivers/gpu/drm/radeon/radeon_device.c |   2 -
 drivers/gpu/drm/radeon/radeon_drv.c|   2 +
 drivers/gpu/drm/radeon/radeon_mn.c | 157 ++---
 4 files changed, 38 insertions(+), 126 deletions(-)

AMD team: I wonder if kfd has similar lifetime issues?

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 32808e50be12f8..918164f90b114a 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -2451,9 +2451,6 @@ struct radeon_device {
/* tracking pinned memory */
u64 vram_pin_size;
u64 gart_pin_size;
-
-   struct mutexmn_lock;
-   DECLARE_HASHTABLE(mn_hash, 7);
 };
 
 bool radeon_is_px(struct drm_device *dev);
diff --git a/drivers/gpu/drm/radeon/radeon_device.c 
b/drivers/gpu/drm/radeon/radeon_device.c
index dceb554e567446..788b1d8a80e660 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1325,8 +1325,6 @@ int radeon_device_init(struct radeon_device *rdev,
init_rwsem(>pm.mclk_lock);
init_rwsem(>exclusive_lock);
init_waitqueue_head(>irq.vblank_queue);
-   mutex_init(>mn_lock);
-   hash_init(rdev->mn_hash);
r = radeon_gem_init(rdev);
if (r)
return r;
diff --git a/drivers/gpu/drm/radeon/radeon_drv.c 
b/drivers/gpu/drm/radeon/radeon_drv.c
index a6cbe11f79c611..b6535ac91fdb74 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -624,6 +625,7 @@ static void __exit radeon_exit(void)
 {
pci_unregister_driver(pdriver);
radeon_unregister_atpx_handler();
+   mmu_notifier_synchronize();
 }
 
 module_init(radeon_init);
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c 
b/drivers/gpu/drm/radeon/radeon_mn.c
index 8c3871ed23a9f0..fc8254273a800b 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -37,17 +37,8 @@
 #include "radeon.h"
 
 struct radeon_mn {
-   /* constant after initialisation */
-   struct radeon_device*rdev;
-   struct mm_struct*mm;
struct mmu_notifier mn;
 
-   /* only used on destruction */
-   struct work_struct  work;
-
-   /* protected by rdev->mn_lock */
-   struct hlist_node   node;
-
/* objects protected by lock */
struct mutexlock;
struct rb_root_cached   objects;
@@ -58,55 +49,6 @@ struct radeon_mn_node {
struct list_headbos;
 };
 
-/**
- * radeon_mn_destroy - destroy the rmn
- *
- * @work: previously sheduled work item
- *
- * Lazy destroys the notifier from a work item
- */
-static void radeon_mn_destroy(struct work_struct *work)
-{
-   struct radeon_mn *rmn = container_of(work, struct radeon_mn, work);
-   struct radeon_device *rdev = rmn->rdev;
-   struct radeon_mn_node *node, *next_node;
-   struct radeon_bo *bo, *next_bo;
-
-   mutex_lock(>mn_lock);
-   mutex_lock(>lock);
-   hash_del(>node);
-   rbtree_postorder_for_each_entry_safe(node, next_node,
->objects.rb_root, it.rb) {
-
-   interval_tree_remove(>it, >objects);
-   list_for_each_entry_safe(bo, next_bo, >bos, mn_list) {
-   bo->mn = NULL;
-   list_del_init(>mn_list);
-   }
-   kfree(node);
-   }
-   mutex_unlock(>lock);
-   mutex_unlock(>mn_lock);
-   mmu_notifier_unregister(>mn, rmn->mm);
-   kfree(rmn);
-}
-
-/**
- * radeon_mn_release - callback to notify about mm destruction
- *
- * @mn: our notifier
- * @mn: the mm this callback is about
- *
- * Shedule a work item to lazy destroy our notifier.
- */
-static void radeon_mn_release(st

[PATCH v3 hmm 01/11] mm/mmu_notifiers: hoist do_mmu_notifier_register down_write to the caller

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

This simplifies the code to not have so many one line functions and extra
logic. __mmu_notifier_register() simply becomes the entry point to
register the notifier, and the other one calls it under lock.

Also add a lockdep_assert to check that the callers are holding the lock
as expected.

Suggested-by: Christoph Hellwig 
Signed-off-by: Jason Gunthorpe 
---
 mm/mmu_notifier.c | 35 ++-
 1 file changed, 14 insertions(+), 21 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index b5670620aea0fc..218a6f108bc2d0 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -236,22 +236,22 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
 
-static int do_mmu_notifier_register(struct mmu_notifier *mn,
-   struct mm_struct *mm,
-   int take_mmap_sem)
+/*
+ * Same as mmu_notifier_register but here the caller must hold the
+ * mmap_sem in write mode.
+ */
+int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 {
struct mmu_notifier_mm *mmu_notifier_mm;
int ret;
 
+   lockdep_assert_held_write(>mmap_sem);
BUG_ON(atomic_read(>mm_users) <= 0);
 
-   ret = -ENOMEM;
mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
if (unlikely(!mmu_notifier_mm))
-   goto out;
+   return -ENOMEM;
 
-   if (take_mmap_sem)
-   down_write(>mmap_sem);
ret = mm_take_all_locks(mm);
if (unlikely(ret))
goto out_clean;
@@ -279,13 +279,11 @@ static int do_mmu_notifier_register(struct mmu_notifier 
*mn,
 
mm_drop_all_locks(mm);
 out_clean:
-   if (take_mmap_sem)
-   up_write(>mmap_sem);
kfree(mmu_notifier_mm);
-out:
BUG_ON(atomic_read(>mm_users) <= 0);
return ret;
 }
+EXPORT_SYMBOL_GPL(__mmu_notifier_register);
 
 /*
  * Must not hold mmap_sem nor any other VM related lock when calling
@@ -302,19 +300,14 @@ static int do_mmu_notifier_register(struct mmu_notifier 
*mn,
  */
 int mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-   return do_mmu_notifier_register(mn, mm, 1);
-}
-EXPORT_SYMBOL_GPL(mmu_notifier_register);
+   int ret;
 
-/*
- * Same as mmu_notifier_register but here the caller must hold the
- * mmap_sem in write mode.
- */
-int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
-{
-   return do_mmu_notifier_register(mn, mm, 0);
+   down_write(>mmap_sem);
+   ret = __mmu_notifier_register(mn, mm);
+   up_write(>mmap_sem);
+   return ret;
 }
-EXPORT_SYMBOL_GPL(__mmu_notifier_register);
+EXPORT_SYMBOL_GPL(mmu_notifier_register);
 
 /* this is called after the last mmu_notifier_unregister() returned */
 void __mmu_notifier_mm_destroy(struct mm_struct *mm)
-- 
2.22.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 15/15] amdgpu: remove CONFIG_DRM_AMDGPU_USERPTR

2019-08-07 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 07:05:53PM +0300, Christoph Hellwig wrote:
> The option is just used to select HMM mirror support and has a very
> confusing help text.  Just pull in the HMM mirror code by default
> instead.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/gpu/drm/Kconfig |  2 ++
>  drivers/gpu/drm/amd/amdgpu/Kconfig  | 10 --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |  6 --
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 12 
>  4 files changed, 2 insertions(+), 28 deletions(-)

Felix, was this an effort to avoid the arch restriction on hmm or
something? Also can't see why this was like this.

Reviewed-by: Jason Gunthorpe 

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v3 hmm 02/11] mm/mmu_notifiers: do not speculatively allocate a mmu_notifier_mm

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

A prior commit e0f3c3f78da2 ("mm/mmu_notifier: init notifier if necessary")
made an attempt at doing this, but had to be reverted as calling
the GFP_KERNEL allocator under the i_mmap_mutex causes deadlock, see
commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in advance").

However, we can avoid that problem by doing the allocation only under
the mmap_sem, which is already happening.

Since all writers to mm->mmu_notifier_mm hold the write side of the
mmap_sem reading it under that sem is deterministic and we can use that to
decide if the allocation path is required, without speculation.

The actual update to mmu_notifier_mm must still be done under the
mm_take_all_locks() to ensure read-side coherency.

Signed-off-by: Jason Gunthorpe 
---
 mm/mmu_notifier.c | 34 ++
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 218a6f108bc2d0..696810f632ade1 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -242,27 +242,32 @@ EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range);
  */
 int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-   struct mmu_notifier_mm *mmu_notifier_mm;
+   struct mmu_notifier_mm *mmu_notifier_mm = NULL;
int ret;
 
lockdep_assert_held_write(>mmap_sem);
BUG_ON(atomic_read(>mm_users) <= 0);
 
-   mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
-   if (unlikely(!mmu_notifier_mm))
-   return -ENOMEM;
+   if (!mm->mmu_notifier_mm) {
+   /*
+* kmalloc cannot be called under mm_take_all_locks(), but we
+* know that mm->mmu_notifier_mm can't change while we hold
+* the write side of the mmap_sem.
+*/
+   mmu_notifier_mm =
+   kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
+   if (!mmu_notifier_mm)
+   return -ENOMEM;
+
+   INIT_HLIST_HEAD(_notifier_mm->list);
+   spin_lock_init(_notifier_mm->lock);
+   }
 
ret = mm_take_all_locks(mm);
if (unlikely(ret))
goto out_clean;
 
-   if (!mm_has_notifiers(mm)) {
-   INIT_HLIST_HEAD(_notifier_mm->list);
-   spin_lock_init(_notifier_mm->lock);
-
-   mm->mmu_notifier_mm = mmu_notifier_mm;
-   mmu_notifier_mm = NULL;
-   }
+   /* Pairs with the mmdrop in mmu_notifier_unregister_* */
mmgrab(mm);
 
/*
@@ -273,14 +278,19 @@ int __mmu_notifier_register(struct mmu_notifier *mn, 
struct mm_struct *mm)
 * We can't race against any other mmu notifier method either
 * thanks to mm_take_all_locks().
 */
+   if (mmu_notifier_mm)
+   mm->mmu_notifier_mm = mmu_notifier_mm;
+
spin_lock(>mmu_notifier_mm->lock);
hlist_add_head_rcu(>hlist, >mmu_notifier_mm->list);
spin_unlock(>mmu_notifier_mm->lock);
 
mm_drop_all_locks(mm);
+   BUG_ON(atomic_read(>mm_users) <= 0);
+   return 0;
+
 out_clean:
kfree(mmu_notifier_mm);
-   BUG_ON(atomic_read(>mm_users) <= 0);
return ret;
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_register);
-- 
2.22.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v3 hmm 11/11] mm/mmu_notifiers: remove unregister_no_release

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

mmu_notifier_unregister_no_release() and mmu_notifier_call_srcu() no
longer have any users, they have all been converted to use
mmu_notifier_put().

So delete this difficult to use interface.

Signed-off-by: Jason Gunthorpe 
---
 include/linux/mmu_notifier.h |  5 -
 mm/mmu_notifier.c| 31 ---
 2 files changed, 36 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 31aa971315a142..52929e5ef70826 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -271,8 +271,6 @@ extern int __mmu_notifier_register(struct mmu_notifier *mn,
   struct mm_struct *mm);
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
struct mm_struct *mm);
-extern void mmu_notifier_unregister_no_release(struct mmu_notifier *mn,
-  struct mm_struct *mm);
 extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
@@ -513,9 +511,6 @@ static inline void mmu_notifier_range_init(struct 
mmu_notifier_range *range,
set_pte_at(___mm, ___address, __ptep, ___pte);  \
 })
 
-extern void mmu_notifier_call_srcu(struct rcu_head *rcu,
-  void (*func)(struct rcu_head *rcu));
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 struct mmu_notifier_range {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 4a770b5211b71d..2ec48f8ba9e288 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -21,18 +21,6 @@
 /* global SRCU for all MMs */
 DEFINE_STATIC_SRCU(srcu);
 
-/*
- * This function allows mmu_notifier::release callback to delay a call to
- * a function that will free appropriate resources. The function must be
- * quick and must not block.
- */
-void mmu_notifier_call_srcu(struct rcu_head *rcu,
-   void (*func)(struct rcu_head *rcu))
-{
-   call_srcu(, rcu, func);
-}
-EXPORT_SYMBOL_GPL(mmu_notifier_call_srcu);
-
 /*
  * This function can't run concurrently against mmu_notifier_register
  * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
@@ -453,25 +441,6 @@ void mmu_notifier_unregister(struct mmu_notifier *mn, 
struct mm_struct *mm)
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_unregister);
 
-/*
- * Same as mmu_notifier_unregister but no callback and no srcu synchronization.
- */
-void mmu_notifier_unregister_no_release(struct mmu_notifier *mn,
-   struct mm_struct *mm)
-{
-   spin_lock(>mmu_notifier_mm->lock);
-   /*
-* Can not use list_del_rcu() since __mmu_notifier_release
-* can delete it before we hold the lock.
-*/
-   hlist_del_init_rcu(>hlist);
-   spin_unlock(>mmu_notifier_mm->lock);
-
-   BUG_ON(atomic_read(>mm_count) <= 0);
-   mmdrop(mm);
-}
-EXPORT_SYMBOL_GPL(mmu_notifier_unregister_no_release);
-
 static void mmu_notifier_free_rcu(struct rcu_head *rcu)
 {
struct mmu_notifier *mn = container_of(rcu, struct mmu_notifier, rcu);
-- 
2.22.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 03/15] nouveau: pass struct nouveau_svmm to nouveau_range_fault

2019-08-07 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 07:05:41PM +0300, Christoph Hellwig wrote:
> We'll need the nouveau_svmm structure to improve the function soon.
> For now this allows using the svmm->mm reference to unlock the
> mmap_sem, and thus the same dereference chain that the caller uses
> to lock and unlock it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/gpu/drm/nouveau/nouveau_svm.c | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v3 hmm 04/11] misc/sgi-gru: use mmu_notifier_get/put for struct gru_mm_struct

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

GRU is already using almost the same algorithm as get/put, it even
helpfully has a 10 year old comment to make this algorithm common, which
is finally happening.

There are a few differences and fixes from this conversion:
- GRU used rcu not srcu to read the hlist
- Unclear how the locking worked to prevent gru_register_mmu_notifier()
  from running concurrently with gru_drop_mmu_notifier() - this version is
  safe
- GRU had a release function which only set a variable without any locking
  that skiped the synchronize_srcu during unregister which looks racey,
  but this makes it reliable via the integrated call_srcu().
- It is unclear if the mmap_sem is actually held when
  __mmu_notifier_register() was called, lockdep will now warn if this is
  wrong

Signed-off-by: Jason Gunthorpe 
---
 drivers/misc/sgi-gru/grufile.c |  1 +
 drivers/misc/sgi-gru/grutables.h   |  2 -
 drivers/misc/sgi-gru/grutlbpurge.c | 84 +-
 3 files changed, 25 insertions(+), 62 deletions(-)

diff --git a/drivers/misc/sgi-gru/grufile.c b/drivers/misc/sgi-gru/grufile.c
index a2a142ae087bfa..9d042310214ff9 100644
--- a/drivers/misc/sgi-gru/grufile.c
+++ b/drivers/misc/sgi-gru/grufile.c
@@ -573,6 +573,7 @@ static void __exit gru_exit(void)
gru_free_tables();
misc_deregister(_miscdev);
gru_proc_exit();
+   mmu_notifier_synchronize();
 }
 
 static const struct file_operations gru_fops = {
diff --git a/drivers/misc/sgi-gru/grutables.h b/drivers/misc/sgi-gru/grutables.h
index 438191c220570c..a7e44b2eb413f6 100644
--- a/drivers/misc/sgi-gru/grutables.h
+++ b/drivers/misc/sgi-gru/grutables.h
@@ -307,10 +307,8 @@ struct gru_mm_tracker {/* pack 
to reduce size */
 
 struct gru_mm_struct {
struct mmu_notifier ms_notifier;
-   atomic_tms_refcnt;
spinlock_t  ms_asid_lock;   /* protects ASID assignment */
atomic_tms_range_active;/* num range_invals active */
-   charms_released;
wait_queue_head_t   ms_wait_queue;
DECLARE_BITMAP(ms_asidmap, GRU_MAX_GRUS);
struct gru_mm_tracker   ms_asids[GRU_MAX_GRUS];
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c 
b/drivers/misc/sgi-gru/grutlbpurge.c
index 59ba0adf23cee4..10921cd2608dfa 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -235,83 +235,47 @@ static void gru_invalidate_range_end(struct mmu_notifier 
*mn,
gms, range->start, range->end);
 }
 
-static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm)
+static struct mmu_notifier *gru_alloc_notifier(struct mm_struct *mm)
 {
-   struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
-ms_notifier);
+   struct gru_mm_struct *gms;
+
+   gms = kzalloc(sizeof(*gms), GFP_KERNEL);
+   if (!gms)
+   return ERR_PTR(-ENOMEM);
+   STAT(gms_alloc);
+   spin_lock_init(>ms_asid_lock);
+   init_waitqueue_head(>ms_wait_queue);
 
-   gms->ms_released = 1;
-   gru_dbg(grudev, "gms %p\n", gms);
+   return >ms_notifier;
 }
 
+static void gru_free_notifier(struct mmu_notifier *mn)
+{
+   kfree(container_of(mn, struct gru_mm_struct, ms_notifier));
+   STAT(gms_free);
+}
 
 static const struct mmu_notifier_ops gru_mmuops = {
.invalidate_range_start = gru_invalidate_range_start,
.invalidate_range_end   = gru_invalidate_range_end,
-   .release= gru_release,
+   .alloc_notifier = gru_alloc_notifier,
+   .free_notifier  = gru_free_notifier,
 };
 
-/* Move this to the basic mmu_notifier file. But for now... */
-static struct mmu_notifier *mmu_find_ops(struct mm_struct *mm,
-   const struct mmu_notifier_ops *ops)
-{
-   struct mmu_notifier *mn, *gru_mn = NULL;
-
-   if (mm->mmu_notifier_mm) {
-   rcu_read_lock();
-   hlist_for_each_entry_rcu(mn, >mmu_notifier_mm->list,
-hlist)
-   if (mn->ops == ops) {
-   gru_mn = mn;
-   break;
-   }
-   rcu_read_unlock();
-   }
-   return gru_mn;
-}
-
 struct gru_mm_struct *gru_register_mmu_notifier(void)
 {
-   struct gru_mm_struct *gms;
struct mmu_notifier *mn;
-   int err;
-
-   mn = mmu_find_ops(current->mm, _mmuops);
-   if (mn) {
-   gms = container_of(mn, struct gru_mm_struct, ms_notifier);
-   atomic_inc(>ms_refcnt);
-   } else {
-   gms = kzalloc(sizeof(*gms), GFP_KERNEL);
-   if (!gms)
-   return ERR_PTR(-ENOMEM);
-   STAT(gms_alloc);
-   spin_lock_init(>ms_asid_lock);
-   gms->ms_notifier.ops

Re: [PATCH 14/15] mm: make HMM_MIRROR an implicit option

2019-08-07 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 07:05:52PM +0300, Christoph Hellwig wrote:
> Make HMM_MIRROR an option that is selected by drivers wanting to use it
> instead of a user visible option as it is just a low-level
> implementation detail.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/gpu/drm/amd/amdgpu/Kconfig |  4 +++-
>  drivers/gpu/drm/nouveau/Kconfig|  4 +++-
>  mm/Kconfig | 14 ++
>  3 files changed, 12 insertions(+), 10 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 11/15] mm: cleanup the hmm_vma_handle_pmd stub

2019-08-07 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 07:05:49PM +0300, Christoph Hellwig wrote:
> Stub out the whole function when CONFIG_TRANSPARENT_HUGEPAGE is not set
> to make the function easier to read.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  mm/hmm.c | 18 --
>  1 file changed, 8 insertions(+), 10 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v3 hmm 10/11] drm/amdkfd: use mmu_notifier_put

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

The sequence of mmu_notifier_unregister_no_release(),
mmu_notifier_call_srcu() is identical to mmu_notifier_put() with the
free_notifier callback.

As this is the last user of those APIs, converting it means we can drop
them.

Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h|  3 ---
 drivers/gpu/drm/amd/amdkfd/kfd_process.c | 10 --
 2 files changed, 4 insertions(+), 9 deletions(-)

I'm really not sure what this is doing, but it is very strange to have a
release with no other callback. It would be good if this would change to use
get as well.

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h 
b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index 3933fb6a371efb..9450e20d17093b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -686,9 +686,6 @@ struct kfd_process {
/* We want to receive a notification when the mm_struct is destroyed */
struct mmu_notifier mmu_notifier;
 
-   /* Use for delayed freeing of kfd_process structure */
-   struct rcu_head rcu;
-
unsigned int pasid;
unsigned int doorbell_index;
 
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index c06e6190f21ffa..e5e326f2f2675e 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -486,11 +486,9 @@ static void kfd_process_ref_release(struct kref *ref)
queue_work(kfd_process_wq, >release_work);
 }
 
-static void kfd_process_destroy_delayed(struct rcu_head *rcu)
+static void kfd_process_free_notifier(struct mmu_notifier *mn)
 {
-   struct kfd_process *p = container_of(rcu, struct kfd_process, rcu);
-
-   kfd_unref_process(p);
+   kfd_unref_process(container_of(mn, struct kfd_process, mmu_notifier));
 }
 
 static void kfd_process_notifier_release(struct mmu_notifier *mn,
@@ -542,12 +540,12 @@ static void kfd_process_notifier_release(struct 
mmu_notifier *mn,
 
mutex_unlock(>mutex);
 
-   mmu_notifier_unregister_no_release(>mmu_notifier, mm);
-   mmu_notifier_call_srcu(>rcu, _process_destroy_delayed);
+   mmu_notifier_put(>mmu_notifier);
 }
 
 static const struct mmu_notifier_ops kfd_process_mmu_notifier_ops = {
.release = kfd_process_notifier_release,
+   .free_notifier = kfd_process_free_notifier,
 };
 
 static int kfd_process_init_cwsr_apu(struct kfd_process *p, struct file *filep)
-- 
2.22.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v3 hmm 00/11] Add mmu_notifier_get/put for managing mmu notifier registrations

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

This series introduces a new registration flow for mmu_notifiers based on
the idea that the user would like to get a single refcounted piece of
memory for a mm, keyed to its use.

For instance many users of mmu_notifiers use an interval tree or similar
to dispatch notifications to some object. There are many objects but only
one notifier subscription per mm holding the tree.

Of the 12 places that call mmu_notifier_register:
 - 7 are maintaining some kind of obvious mapping of mm_struct to
   mmu_notifier registration, ie in some linked list or hash table. Of
   the 7 this series converts 4 (gru, hmm, RDMA, radeon)

 - 3 (hfi1, gntdev, vhost) are registering multiple notifiers, but each
   one immediately does some VA range filtering, ie with an interval tree.
   These would be better with a global subsystem-wide range filter and
   could convert to this API.

 - 2 (kvm, amd_iommu) are deliberately using a single mm at a time, and
   really can't use this API. One of the intel-svm's modes is also in this
   list

The 3/7 unconverted drivers are:
 - intel-svm
   This driver tracks mm's in a global linked list 'global_svm_list'
   and would benefit from this API.

   Its flow is a bit complex, since it also wants a set of non-shared
   notifiers.

 - i915_gem_usrptr
   This driver tracks mm's in a per-device hash
   table (dev_priv->mm_structs), but only has an optional use of
   mmu_notifiers.  Since it still seems to need the hash table it is
   difficult to convert.

 - amdkfd/kfd_process
   This driver is using a global SRCU hash table to track mm's

   The control flow here is very complicated and the driver is relying on
   this hash table to be fast on the ioctl syscall path.

   It would definitely benefit, but only if the ioctl path didn't need to
   do the search so often.

This series is already entangled with patches in the hmm & RDMA tree and
will require some git topic branches for the RDMA ODP stuff. I intend for
it to go through the hmm tree.

There is a git version here:

https://github.com/jgunthorpe/linux/commits/mmu_notifier

Which has the required pre-patches for the RDMA ODP conversion that are
still being reviewed.

Jason Gunthorpe (11):
  mm/mmu_notifiers: hoist do_mmu_notifier_register down_write to the
caller
  mm/mmu_notifiers: do not speculatively allocate a mmu_notifier_mm
  mm/mmu_notifiers: add a get/put scheme for the registration
  misc/sgi-gru: use mmu_notifier_get/put for struct gru_mm_struct
  hmm: use mmu_notifier_get/put for 'struct hmm'
  RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
  RDMA/odp: remove ib_ucontext from ib_umem
  drm/radeon: use mmu_notifier_get/put for struct radeon_mn
  drm/amdkfd: fix a use after free race with mmu_notifer unregister
  drm/amdkfd: use mmu_notifier_put
  mm/mmu_notifiers: remove unregister_no_release

 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |   1 +
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h|   3 -
 drivers/gpu/drm/amd/amdkfd/kfd_process.c |  88 -
 drivers/gpu/drm/nouveau/nouveau_drm.c|   3 +
 drivers/gpu/drm/radeon/radeon.h  |   3 -
 drivers/gpu/drm/radeon/radeon_device.c   |   2 -
 drivers/gpu/drm/radeon/radeon_drv.c  |   2 +
 drivers/gpu/drm/radeon/radeon_mn.c   | 157 
 drivers/infiniband/core/umem.c   |   4 +-
 drivers/infiniband/core/umem_odp.c   | 183 ++
 drivers/infiniband/core/uverbs_cmd.c |   3 -
 drivers/infiniband/core/uverbs_main.c|   1 +
 drivers/infiniband/hw/mlx5/main.c|   5 -
 drivers/misc/sgi-gru/grufile.c   |   1 +
 drivers/misc/sgi-gru/grutables.h |   2 -
 drivers/misc/sgi-gru/grutlbpurge.c   |  84 +++--
 include/linux/hmm.h  |  12 +-
 include/linux/mm_types.h |   6 -
 include/linux/mmu_notifier.h |  40 +++-
 include/rdma/ib_umem.h   |   2 +-
 include/rdma/ib_umem_odp.h   |  10 +-
 include/rdma/ib_verbs.h  |   3 -
 kernel/fork.c|   1 -
 mm/hmm.c | 121 +++-
 mm/mmu_notifier.c| 230 +--
 25 files changed, 408 insertions(+), 559 deletions(-)

-- 
2.22.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH 08/15] mm: remove the mask variable in hmm_vma_walk_hugetlb_entry

2019-08-07 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 07:05:46PM +0300, Christoph Hellwig wrote:
> The pagewalk code already passes the value as the hmask parameter.
> 
> Signed-off-by: Christoph Hellwig 
>  mm/hmm.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)

Reviewed-by: Jason Gunthorpe 

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v3 hmm 06/11] RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

This is a significant simplification, no extra list is kept per FD, and
the interval tree is now shared between all the ucontexts, reducing
overhead if there are multiple ucontexts active.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/core/umem_odp.c| 170 --
 drivers/infiniband/core/uverbs_cmd.c  |   3 -
 drivers/infiniband/core/uverbs_main.c |   1 +
 drivers/infiniband/hw/mlx5/main.c |   5 -
 include/rdma/ib_umem_odp.h|  10 +-
 include/rdma/ib_verbs.h   |   3 -
 6 files changed, 54 insertions(+), 138 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index c6a992392ee2b8..a02e6e3d7b72fb 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -82,7 +82,7 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
struct rb_node *node;
 
down_read(_mm->umem_rwsem);
-   if (!per_mm->active)
+   if (!per_mm->mn.users)
goto out;
 
for (node = rb_first_cached(_mm->umem_tree); node;
@@ -132,7 +132,7 @@ static int ib_umem_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
else if (!down_read_trylock(_mm->umem_rwsem))
return -EAGAIN;
 
-   if (!per_mm->active) {
+   if (!per_mm->mn.users) {
up_read(_mm->umem_rwsem);
/*
 * At this point active is permanently set and visible to this
@@ -165,7 +165,7 @@ static void ib_umem_notifier_invalidate_range_end(struct 
mmu_notifier *mn,
struct ib_ucontext_per_mm *per_mm =
container_of(mn, struct ib_ucontext_per_mm, mn);
 
-   if (unlikely(!per_mm->active))
+   if (unlikely(!per_mm->mn.users))
return;
 
rbt_ib_umem_for_each_in_range(_mm->umem_tree, range->start,
@@ -174,125 +174,47 @@ static void ib_umem_notifier_invalidate_range_end(struct 
mmu_notifier *mn,
up_read(_mm->umem_rwsem);
 }
 
-static const struct mmu_notifier_ops ib_umem_notifiers = {
-   .release= ib_umem_notifier_release,
-   .invalidate_range_start = ib_umem_notifier_invalidate_range_start,
-   .invalidate_range_end   = ib_umem_notifier_invalidate_range_end,
-};
-
-static void remove_umem_from_per_mm(struct ib_umem_odp *umem_odp)
-{
-   struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
-
-   if (umem_odp->is_implicit_odp)
-   return;
-
-   down_write(_mm->umem_rwsem);
-   interval_tree_remove(_odp->interval_tree, _mm->umem_tree);
-   complete_all(_odp->notifier_completion);
-   up_write(_mm->umem_rwsem);
-}
-
-static struct ib_ucontext_per_mm *alloc_per_mm(struct ib_ucontext *ctx,
-  struct mm_struct *mm)
+static struct mmu_notifier *ib_umem_alloc_notifier(struct mm_struct *mm)
 {
struct ib_ucontext_per_mm *per_mm;
-   int ret;
 
per_mm = kzalloc(sizeof(*per_mm), GFP_KERNEL);
if (!per_mm)
return ERR_PTR(-ENOMEM);
 
-   per_mm->context = ctx;
-   per_mm->mm = mm;
per_mm->umem_tree = RB_ROOT_CACHED;
init_rwsem(_mm->umem_rwsem);
-   per_mm->active = true;
 
+   WARN_ON(mm != current->mm);
rcu_read_lock();
per_mm->tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
rcu_read_unlock();
-
-   WARN_ON(mm != current->mm);
-
-   per_mm->mn.ops = _umem_notifiers;
-   ret = mmu_notifier_register(_mm->mn, per_mm->mm);
-   if (ret) {
-   dev_err(>device->dev,
-   "Failed to register mmu_notifier %d\n", ret);
-   goto out_pid;
-   }
-
-   list_add(_mm->ucontext_list, >per_mm_list);
-   return per_mm;
-
-out_pid:
-   put_pid(per_mm->tgid);
-   kfree(per_mm);
-   return ERR_PTR(ret);
+   return _mm->mn;
 }
 
-static struct ib_ucontext_per_mm *get_per_mm(struct ib_umem_odp *umem_odp)
+static void ib_umem_free_notifier(struct mmu_notifier *mn)
 {
-   struct ib_ucontext *ctx = umem_odp->umem.context;
-   struct ib_ucontext_per_mm *per_mm;
-
-   lockdep_assert_held(>per_mm_list_lock);
-
-   /*
-* Generally speaking we expect only one or two per_mm in this list,
-* so no reason to optimize this search today.
-*/
-   list_for_each_entry(per_mm, >per_mm_list, ucontext_list) {
-   if (per_mm->mm == umem_odp->umem.owning_mm)
-   return per_mm;
-   }
-
-   return alloc_per_mm(ctx, umem_odp->umem.owning_mm);
-}
-
-static void free_per_mm(struct rcu_head *rcu)
-{
-   kfree(container_of(rcu, struct ib_ucontext_per_mm, rcu));
-}
-
-static void put_per_mm(struct ib_umem_odp *umem_odp)
-{
-   struct ib_ucontext_per_mm *per_m

[PATCH v3 hmm 07/11] RDMA/odp: remove ib_ucontext from ib_umem

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

At this point the ucontext is only being stored to access the ib_device,
so just store the ib_device directly instead. This is more natural and
logical as the umem has nothing to do with the ucontext.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/core/umem.c |  4 ++--
 drivers/infiniband/core/umem_odp.c | 13 ++---
 include/rdma/ib_umem.h |  2 +-
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index c59aa57d36510f..5ab9165ffbef0a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -242,7 +242,7 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
unsigned long addr,
return ERR_PTR(-ENOMEM);
}
 
-   umem->context= context;
+   umem->ibdev = context->device;
umem->length = size;
umem->address= addr;
umem->writable   = ib_access_writable(access);
@@ -370,7 +370,7 @@ void ib_umem_release(struct ib_umem *umem)
return;
}
 
-   __ib_umem_release(umem->context->device, umem, 1);
+   __ib_umem_release(umem->ibdev, umem, 1);
 
atomic64_sub(ib_umem_num_pages(umem), >owning_mm->pinned_vm);
__ib_umem_release_tail(umem);
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index a02e6e3d7b72fb..da72318e17592f 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -103,7 +103,7 @@ static void ib_umem_notifier_release(struct mmu_notifier 
*mn,
 */
smp_wmb();
complete_all(_odp->notifier_completion);
-   umem_odp->umem.context->device->ops.invalidate_range(
+   umem_odp->umem.ibdev->ops.invalidate_range(
umem_odp, ib_umem_start(umem_odp),
ib_umem_end(umem_odp));
}
@@ -116,7 +116,7 @@ static int invalidate_range_start_trampoline(struct 
ib_umem_odp *item,
 u64 start, u64 end, void *cookie)
 {
ib_umem_notifier_start_account(item);
-   item->umem.context->device->ops.invalidate_range(item, start, end);
+   item->umem.ibdev->ops.invalidate_range(item, start, end);
return 0;
 }
 
@@ -319,7 +319,7 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct 
ib_udata *udata,
if (!umem_odp)
return ERR_PTR(-ENOMEM);
umem = _odp->umem;
-   umem->context = context;
+   umem->ibdev = context->device;
umem->writable = ib_access_writable(access);
umem->owning_mm = current->mm;
umem_odp->is_implicit_odp = 1;
@@ -364,7 +364,7 @@ struct ib_umem_odp *ib_umem_odp_alloc_child(struct 
ib_umem_odp *root,
if (!odp_data)
return ERR_PTR(-ENOMEM);
umem = _data->umem;
-   umem->context= root->umem.context;
+   umem->ibdev = root->umem.ibdev;
umem->length = size;
umem->address= addr;
umem->writable   = root->umem.writable;
@@ -477,8 +477,7 @@ static int ib_umem_odp_map_dma_single_page(
u64 access_mask,
unsigned long current_seq)
 {
-   struct ib_ucontext *context = umem_odp->umem.context;
-   struct ib_device *dev = context->device;
+   struct ib_device *dev = umem_odp->umem.ibdev;
dma_addr_t dma_addr;
int remove_existing_mapping = 0;
int ret = 0;
@@ -691,7 +690,7 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp 
*umem_odp, u64 virt,
 {
int idx;
u64 addr;
-   struct ib_device *dev = umem_odp->umem.context->device;
+   struct ib_device *dev = umem_odp->umem.ibdev;
 
virt = max_t(u64, virt, ib_umem_start(umem_odp));
bound = min_t(u64, bound, ib_umem_end(umem_odp));
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 1052d0d62be7d2..a91b2af64ec47b 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -42,7 +42,7 @@ struct ib_ucontext;
 struct ib_umem_odp;
 
 struct ib_umem {
-   struct ib_ucontext *context;
+   struct ib_device   *ibdev;
struct mm_struct   *owning_mm;
size_t  length;
unsigned long   address;
-- 
2.22.0

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH v3 hmm 09/11] drm/amdkfd: fix a use after free race with mmu_notifer unregister

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

When using mmu_notifer_unregister_no_release() the caller must ensure
there is a SRCU synchronize before the mn memory is freed, otherwise use
after free races are possible, for instance:

 CPU0  CPU1
  invalidate_range_start
 hlist_for_each_entry_rcu(..)
 mmu_notifier_unregister_no_release(>mn)
 kfree(mn)
  if (mn->ops->invalidate_range_end)

The error unwind in amdkfd misses the SRCU synchronization.

amdkfd keeps the kfd_process around until the mm is released, so split the
flow to fully initialize the kfd_process and register it for find_process,
and with the notifier. Past this point the kfd_process does not need to be
cleaned up as it is fully ready.

The final failable step does a vm_mmap() and does not seem to impact the
kfd_process global state. Since it also cannot be undone (and already has
problems with undo if it internally fails), it has to be last.

This way we don't have to try to unwind the mmu_notifier_register() and
avoid the problem with the SRCU.

Along the way this also fixes various other error unwind bugs in the flow.

Fixes: 45102048f77e ("amdkfd: Add process queue manager module")
Reviewed-by: Felix Kuehling 
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/amd/amdkfd/kfd_process.c | 78 +++-
 1 file changed, 37 insertions(+), 41 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 8f1076c0c88a25..c06e6190f21ffa 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -62,8 +62,8 @@ static struct workqueue_struct *kfd_restore_wq;
 
 static struct kfd_process *find_process(const struct task_struct *thread);
 static void kfd_process_ref_release(struct kref *ref);
-static struct kfd_process *create_process(const struct task_struct *thread,
-   struct file *filep);
+static struct kfd_process *create_process(const struct task_struct *thread);
+static int kfd_process_init_cwsr_apu(struct kfd_process *p, struct file 
*filep);
 
 static void evict_process_worker(struct work_struct *work);
 static void restore_process_worker(struct work_struct *work);
@@ -289,7 +289,15 @@ struct kfd_process *kfd_create_process(struct file *filep)
if (process) {
pr_debug("Process already found\n");
} else {
-   process = create_process(thread, filep);
+   process = create_process(thread);
+   if (IS_ERR(process))
+   goto out;
+
+   ret = kfd_process_init_cwsr_apu(process, filep);
+   if (ret) {
+   process = ERR_PTR(ret);
+   goto out;
+   }
 
if (!procfs.kobj)
goto out;
@@ -609,81 +617,69 @@ static int kfd_process_device_init_cwsr_dgpu(struct 
kfd_process_device *pdd)
return 0;
 }
 
-static struct kfd_process *create_process(const struct task_struct *thread,
-   struct file *filep)
+/*
+ * On return the kfd_process is fully operational and will be freed when the
+ * mm is released
+ */
+static struct kfd_process *create_process(const struct task_struct *thread)
 {
struct kfd_process *process;
int err = -ENOMEM;
 
process = kzalloc(sizeof(*process), GFP_KERNEL);
-
if (!process)
goto err_alloc_process;
 
-   process->pasid = kfd_pasid_alloc();
-   if (process->pasid == 0)
-   goto err_alloc_pasid;
-
-   if (kfd_alloc_process_doorbells(process) < 0)
-   goto err_alloc_doorbells;
-
kref_init(>ref);
-
mutex_init(>mutex);
-
process->mm = thread->mm;
-
-   /* register notifier */
-   process->mmu_notifier.ops = _process_mmu_notifier_ops;
-   err = mmu_notifier_register(>mmu_notifier, process->mm);
-   if (err)
-   goto err_mmu_notifier;
-
-   hash_add_rcu(kfd_processes_table, >kfd_processes,
-   (uintptr_t)process->mm);
-
process->lead_thread = thread->group_leader;
-   get_task_struct(process->lead_thread);
-
INIT_LIST_HEAD(>per_device_data);
-
+   INIT_DELAYED_WORK(>eviction_work, evict_process_worker);
+   INIT_DELAYED_WORK(>restore_work, restore_process_worker);
+   process->last_restore_timestamp = get_jiffies_64();
kfd_event_init_process(process);
+   process->is_32bit_user_mode = in_compat_syscall();
+
+   process->pasid = kfd_pasid_alloc();
+   if (process->pasid == 0)
+   goto err_alloc_pasid;
+
+   if (kfd_alloc_process_doorbells(process) < 0)
+   goto err_alloc_doorbells;
 

[PATCH v3 hmm 05/11] hmm: use mmu_notifier_get/put for 'struct hmm'

2019-08-07 Thread Jason Gunthorpe
From: Jason Gunthorpe 

This is a significant simplification, it eliminates all the remaining
'hmm' stuff in mm_struct, eliminates krefing along the critical notifier
paths, and takes away all the ugly locking and abuse of page_table_lock.

mmu_notifier_get() provides the single struct hmm per struct mm which
eliminates mm->hmm.

It also directly guarantees that no mmu_notifier op callback is callable
while concurrent free is possible, this eliminates all the krefs inside
the mmu_notifier callbacks.

The remaining krefs in the range code were overly cautious, drivers are
already not permitted to free the mirror while a range exists.

Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c |   1 +
 drivers/gpu/drm/nouveau/nouveau_drm.c   |   3 +
 include/linux/hmm.h |  12 +--
 include/linux/mm_types.h|   6 --
 kernel/fork.c   |   1 -
 mm/hmm.c| 121 ++--
 6 files changed, 33 insertions(+), 111 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index f2e8b4238efd49..d50774a5f98ef7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1464,6 +1464,7 @@ static void __exit amdgpu_exit(void)
amdgpu_unregister_atpx_handler();
amdgpu_sync_fini();
amdgpu_fence_slab_fini();
+   mmu_notifier_synchronize();
 }
 
 module_init(amdgpu_init);
diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c 
b/drivers/gpu/drm/nouveau/nouveau_drm.c
index 7c2fcaba42d6c3..a0e48a482452d7 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1292,6 +1293,8 @@ nouveau_drm_exit(void)
 #ifdef CONFIG_NOUVEAU_PLATFORM_DRIVER
platform_driver_unregister(_platform_driver);
 #endif
+   if (IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM))
+   mmu_notifier_synchronize();
 }
 
 module_init(nouveau_drm_init);
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 82265118d94abd..c3902449db6412 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -84,15 +84,12 @@
  * @notifiers: count of active mmu notifiers
  */
 struct hmm {
-   struct mm_struct*mm;
-   struct kref kref;
+   struct mmu_notifier mmu_notifier;
spinlock_t  ranges_lock;
struct list_headranges;
struct list_headmirrors;
-   struct mmu_notifier mmu_notifier;
struct rw_semaphore mirrors_sem;
wait_queue_head_t   wq;
-   struct rcu_head rcu;
longnotifiers;
 };
 
@@ -436,13 +433,6 @@ long hmm_range_dma_unmap(struct hmm_range *range,
  */
 #define HMM_RANGE_DEFAULT_TIMEOUT 1000
 
-/* Below are for HMM internal use only! Not to be used by device driver! */
-static inline void hmm_mm_init(struct mm_struct *mm)
-{
-   mm->hmm = NULL;
-}
-#else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 #endif /* LINUX_HMM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3a37a89eb7a7c3..525d25d93330f2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -25,7 +25,6 @@
 
 struct address_space;
 struct mem_cgroup;
-struct hmm;
 
 /*
  * Each physical page in the system has a struct page associated with
@@ -502,11 +501,6 @@ struct mm_struct {
atomic_long_t hugetlb_usage;
 #endif
struct work_struct async_put_work;
-
-#ifdef CONFIG_HMM_MIRROR
-   /* HMM needs to track a few things per mm */
-   struct hmm *hmm;
-#endif
} __randomize_layout;
 
/*
diff --git a/kernel/fork.c b/kernel/fork.c
index d8ae0f1b414802..bd4a0762f12f3e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1007,7 +1007,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
mm_init_owner(mm, p);
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_mm_init(mm);
-   hmm_mm_init(mm);
init_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
diff --git a/mm/hmm.c b/mm/hmm.c
index 9a908902e4cc38..00f94f94906afc 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -26,101 +26,37 @@
 #include 
 #include 
 
-static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
-
-/**
- * hmm_get_or_create - register HMM against an mm (HMM internal)
- *
- * @mm: mm struct to attach to
- * Return: an HMM object, either by referencing the existing
- *  (per-process) object, or by creating a new one.
- *
- * This is not intended to be used directly by device drivers. If mm already
- * has an HMM struct then it get a reference on it and r

Re: [PATCH 15/15] amdgpu: remove CONFIG_DRM_AMDGPU_USERPTR

2019-08-07 Thread Jason Gunthorpe
On Tue, Aug 06, 2019 at 02:58:58PM -0400, Alex Deucher wrote:
> On Tue, Aug 6, 2019 at 1:51 PM Kuehling, Felix  wrote:
> >
> > On 2019-08-06 13:44, Jason Gunthorpe wrote:
> > > On Tue, Aug 06, 2019 at 07:05:53PM +0300, Christoph Hellwig wrote:
> > >> The option is just used to select HMM mirror support and has a very
> > >> confusing help text.  Just pull in the HMM mirror code by default
> > >> instead.
> > >>
> > >> Signed-off-by: Christoph Hellwig 
> > >>   drivers/gpu/drm/Kconfig |  2 ++
> > >>   drivers/gpu/drm/amd/amdgpu/Kconfig  | 10 --
> > >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |  6 --
> > >>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h | 12 
> > >>   4 files changed, 2 insertions(+), 28 deletions(-)
> > > Felix, was this an effort to avoid the arch restriction on hmm or
> > > something? Also can't see why this was like this.
> >
> > This option predates KFD's support of userptrs, which in turn predates
> > HMM. Radeon has the same kind of option, though it doesn't affect HMM in
> > that case.
> >
> > Alex, Christian, can you think of a good reason to maintain userptr
> > support as an option in amdgpu? I suspect it was originally meant as a
> > way to allow kernels with amdgpu without MMU notifiers. Now it would
> > allow a kernel with amdgpu without HMM or MMU notifiers. I don't know if
> > this is a useful thing to have.
> 
> Right.  There were people that didn't have MMU notifiers that wanted
> support for the GPU.

?? Is that even a real thing? mmu_notifier does not have much kconfig
dependency.

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[PATCH hmm] drm/amdkfd: fix a use after free race with mmu_notififer unregister

2019-08-05 Thread Jason Gunthorpe
When using mmu_notififer_unregister_no_release() the caller must ensure
there is a SRCU synchronize before the mn memory is freed, otherwise use
after free races are possible, for instance:

 CPU0  CPU1
  invalidate_range_start
 hlist_for_each_entry_rcu(..)
 mmu_notifier_unregister_no_release(>mn)
 kfree(mn)
  if (mn->ops->invalidate_range_end)

The error unwind in amdkfd misses the SRCU synchronization.

amdkfd keeps the kfd_process around until the mm is released, so split the
flow to fully initialize the kfd_process and register it for find_process,
and with the notifier. Past this point the kfd_process does not need to be
cleaned up as it is fully ready.

The final failable step does a vm_mmap() and does not seem to impact the
kfd_process global state. Since it also cannot be undone (and already has
problems with undo if it internally fails), it has to be last.

This way we don't have to try to unwind the mmu_notifier_register() and
avoid the problem with the SRCU.

Along the way this also fixes various other error unwind bugs in the flow.

Fixes: 45102048f77e ("amdkfd: Add process queue manager module")
Signed-off-by: Jason Gunthorpe 
---
 drivers/gpu/drm/amd/amdkfd/kfd_process.c | 74 +++-
 1 file changed, 35 insertions(+), 39 deletions(-)

amdkfd folks, this little bug is blocking some rework I have for the
mmu notifiers (ie mm/mmu_notifiers: remove unregister_no_release)

Can I get your help to review and if needed polish this change? I'd
like to send this patch through the hmm tree along with the rework,
thanks

You can see the larger series here:

https://github.com/jgunthorpe/linux/commits/mmu_notifier

Jason

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c 
b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
index 8f1076c0c88a25..81e3ee3f1813bf 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c
@@ -62,8 +62,8 @@ static struct workqueue_struct *kfd_restore_wq;
 
 static struct kfd_process *find_process(const struct task_struct *thread);
 static void kfd_process_ref_release(struct kref *ref);
-static struct kfd_process *create_process(const struct task_struct *thread,
-   struct file *filep);
+static struct kfd_process *create_process(const struct task_struct *thread);
+static int kfd_process_init_cwsr_apu(struct kfd_process *p, struct file 
*filep);
 
 static void evict_process_worker(struct work_struct *work);
 static void restore_process_worker(struct work_struct *work);
@@ -289,7 +289,15 @@ struct kfd_process *kfd_create_process(struct file *filep)
if (process) {
pr_debug("Process already found\n");
} else {
-   process = create_process(thread, filep);
+   process = create_process(thread);
+   if (IS_ERR(process))
+   goto out;
+
+   ret = kfd_process_init_cwsr_apu(process, filep);
+   if (ret) {
+   process = ERR_PTR(ret);
+   goto out;
+   }
 
if (!procfs.kobj)
goto out;
@@ -609,64 +617,56 @@ static int kfd_process_device_init_cwsr_dgpu(struct 
kfd_process_device *pdd)
return 0;
 }
 
-static struct kfd_process *create_process(const struct task_struct *thread,
-   struct file *filep)
+/*
+ * On return the kfd_process is fully operational and will be freed when the
+ * mm is released
+ */
+static struct kfd_process *create_process(const struct task_struct *thread)
 {
struct kfd_process *process;
int err = -ENOMEM;
 
process = kzalloc(sizeof(*process), GFP_KERNEL);
-
if (!process)
goto err_alloc_process;
 
-   process->pasid = kfd_pasid_alloc();
-   if (process->pasid == 0)
-   goto err_alloc_pasid;
-
-   if (kfd_alloc_process_doorbells(process) < 0)
-   goto err_alloc_doorbells;
-
kref_init(>ref);
-
mutex_init(>mutex);
-
process->mm = thread->mm;
-
-   /* register notifier */
-   process->mmu_notifier.ops = _process_mmu_notifier_ops;
-   err = mmu_notifier_register(>mmu_notifier, process->mm);
-   if (err)
-   goto err_mmu_notifier;
-
-   hash_add_rcu(kfd_processes_table, >kfd_processes,
-   (uintptr_t)process->mm);
-
process->lead_thread = thread->group_leader;
-   get_task_struct(process->lead_thread);
-
INIT_LIST_HEAD(>per_device_data);
-
+   INIT_DELAYED_WORK(>eviction_work, evict_process_worker);
+   INIT_DELAYED_WORK(>restore_work, restore_process_worker);
+   process->last_restore_timestamp = get_jiff

  1   2   3   4   >