Re: [PATCH v3 8/8] nvdimm: Fix firmware activation deadlock scenarios

2022-04-22 Thread Ira Weiny
On Thu, Apr 21, 2022 at 08:33:51AM -0700, Dan Williams wrote:
> Lockdep reports the following deadlock scenarios for CXL root device
> power-management, device_prepare(), operations, and device_shutdown()
> operations for 'nd_region' devices:
> 
> ---
>  Chain exists of:
>&nvdimm_region_key --> &nvdimm_bus->reconfig_mutex --> 
> system_transition_mutex
> 
>   Possible unsafe locking scenario:
> 
> CPU0CPU1
> 
>lock(system_transition_mutex);
> lock(&nvdimm_bus->reconfig_mutex);
> lock(system_transition_mutex);
>lock(&nvdimm_region_key);
> 
> --
> 
>  Chain exists of:
>&cxl_nvdimm_bridge_key --> acpi_scan_lock --> &cxl_root_key
> 
>   Possible unsafe locking scenario:
> 
> CPU0CPU1
> 
>lock(&cxl_root_key);
> lock(acpi_scan_lock);
> lock(&cxl_root_key);
>lock(&cxl_nvdimm_bridge_key);
> 
> ---
> 
> These stem from holding nvdimm_bus_lock() over hibernate_quiet_exec()
> which walks the entire system device topology taking device_lock() along
> the way. The nvdimm_bus_lock() is protecting against unregistration,
> multiple simultaneous ops callers, and preventing activate_show() from
> racing activate_store(). For the first 2, the lock is redundant.
> Unregistration already flushes all ops users, and sysfs already prevents
> multiple threads to be active in an ops handler at the same time. For
> the last userspace should already be waiting for its last
> activate_store() to complete, and does not need activate_show() to flush
> the write side, so this lock usage can be deleted in these attributes.
>

I'm sorry if this is obvious but why can't the locking be removed from
capability_show() and nvdimm_bus_firmware_visible() as well?

Effectively it sounds like we don't care if the cap read is racing any state
change?  And we know the device can't go away while sysfs is calling those
functions.

Ira

> 
> Fixes: 48001ea50d17 ("PM, libnvdimm: Add runtime firmware activation support")
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/core.c |4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/nvdimm/core.c b/drivers/nvdimm/core.c
> index 144926b7451c..7c7f4a43fd4f 100644
> --- a/drivers/nvdimm/core.c
> +++ b/drivers/nvdimm/core.c
> @@ -395,10 +395,8 @@ static ssize_t activate_show(struct device *dev,
>   if (!nd_desc->fw_ops)
>   return -EOPNOTSUPP;
>  
> - nvdimm_bus_lock(dev);
>   cap = nd_desc->fw_ops->capability(nd_desc);
>   state = nd_desc->fw_ops->activate_state(nd_desc);
> - nvdimm_bus_unlock(dev);
>  
>   if (cap < NVDIMM_FWA_CAP_QUIESCE)
>   return -EOPNOTSUPP;
> @@ -443,7 +441,6 @@ static ssize_t activate_store(struct device *dev,
>   else
>   return -EINVAL;
>  
> - nvdimm_bus_lock(dev);
>   state = nd_desc->fw_ops->activate_state(nd_desc);
>  
>   switch (state) {
> @@ -461,7 +458,6 @@ static ssize_t activate_store(struct device *dev,
>   default:
>   rc = -ENXIO;
>   }
> - nvdimm_bus_unlock(dev);
>  
>   if (rc == 0)
>   rc = len;
> 
> 



Re: [PATCH v3 7/8] device-core: Kill the lockdep_mutex

2022-04-22 Thread Ira Weiny
On Thu, Apr 21, 2022 at 08:33:45AM -0700, Dan Williams wrote:
> Per Peter [1], the lockdep API has native support for all the use cases
> lockdep_mutex was attempting to enable. Now that all lockdep_mutex users
> have been converted to those APIs, drop this lock.
> 
> Link: 
> https://lore.kernel.org/r/ylf0dewci8myl...@hirez.programming.kicks-ass.net [1]
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 
> Suggested-by: Peter Zijlstra 
> Signed-off-by: Dan Williams 

Reviewed-by: Ira Weiny 

> ---
>  drivers/base/core.c|3 ---
>  include/linux/device.h |5 -
>  2 files changed, 8 deletions(-)
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 3d6430eb0c6a..2eede2ec3d64 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -2864,9 +2864,6 @@ void device_initialize(struct device *dev)
>   kobject_init(&dev->kobj, &device_ktype);
>   INIT_LIST_HEAD(&dev->dma_pools);
>   mutex_init(&dev->mutex);
> -#ifdef CONFIG_PROVE_LOCKING
> - mutex_init(&dev->lockdep_mutex);
> -#endif
>   lockdep_set_novalidate_class(&dev->mutex);
>   spin_lock_init(&dev->devres_lock);
>   INIT_LIST_HEAD(&dev->devres_head);
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 82c9d307e7bd..c00ab223da50 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -400,8 +400,6 @@ struct dev_msi_info {
>   *   This identifies the device type and carries type-specific
>   *   information.
>   * @mutex:   Mutex to synchronize calls to its driver.
> - * @lockdep_mutex: An optional debug lock that a subsystem can use as a
> - *   peer lock to gain localized lockdep coverage of the device_lock.
>   * @bus: Type of bus device is on.
>   * @driver:  Which driver has allocated this
>   * @platform_data: Platform data specific to the device.
> @@ -499,9 +497,6 @@ struct device {
>  core doesn't touch it */
>   void*driver_data;   /* Driver data, set and get with
>  dev_set_drvdata/dev_get_drvdata */
> -#ifdef CONFIG_PROVE_LOCKING
> - struct mutexlockdep_mutex;
> -#endif
>   struct mutexmutex;  /* mutex to synchronize calls to
>* its driver.
>*/
> 
> 



Re: [PATCH v3 6/8] nvdimm: Drop nd_device_lock()

2022-04-22 Thread Ira Weiny
On Thu, Apr 21, 2022 at 08:33:39AM -0700, Dan Williams wrote:
> Now that all NVDIMM subsystem locking is validated with custom lock
> classes, there is no need for the custom usage of the lockdep_mutex.
> 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Ira Weiny 
> Signed-off-by: Dan Williams 

Reviewed-by: Ira Weiny 

> ---
>  drivers/nvdimm/btt_devs.c   |   16 +
>  drivers/nvdimm/bus.c|   24 +-
>  drivers/nvdimm/core.c   |   10 +++---
>  drivers/nvdimm/dimm_devs.c  |8 ++---
>  drivers/nvdimm/namespace_devs.c |   36 +++--
>  drivers/nvdimm/nd-core.h|   66 
> ---
>  drivers/nvdimm/pfn_devs.c   |   24 +++---
>  drivers/nvdimm/pmem.c   |2 +
>  drivers/nvdimm/region.c |2 +
>  drivers/nvdimm/region_devs.c|   16 +
>  lib/Kconfig.debug   |   17 --
>  11 files changed, 66 insertions(+), 155 deletions(-)
> 
> diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
> index 120821796668..fabbb31f2c35 100644
> --- a/drivers/nvdimm/btt_devs.c
> +++ b/drivers/nvdimm/btt_devs.c
> @@ -50,14 +50,14 @@ static ssize_t sector_size_store(struct device *dev,
>   struct nd_btt *nd_btt = to_nd_btt(dev);
>   ssize_t rc;
>  
> - nd_device_lock(dev);
> + device_lock(dev);
>   nvdimm_bus_lock(dev);
>   rc = nd_size_select_store(dev, buf, &nd_btt->lbasize,
>   btt_lbasize_supported);
>   dev_dbg(dev, "result: %zd wrote: %s%s", rc, buf,
>   buf[len - 1] == '\n' ? "" : "\n");
>   nvdimm_bus_unlock(dev);
> - nd_device_unlock(dev);
> + device_unlock(dev);
>  
>   return rc ? rc : len;
>  }
> @@ -79,11 +79,11 @@ static ssize_t uuid_store(struct device *dev,
>   struct nd_btt *nd_btt = to_nd_btt(dev);
>   ssize_t rc;
>  
> - nd_device_lock(dev);
> + device_lock(dev);
>   rc = nd_uuid_store(dev, &nd_btt->uuid, buf, len);
>   dev_dbg(dev, "result: %zd wrote: %s%s", rc, buf,
>   buf[len - 1] == '\n' ? "" : "\n");
> - nd_device_unlock(dev);
> + device_unlock(dev);
>  
>   return rc ? rc : len;
>  }
> @@ -108,13 +108,13 @@ static ssize_t namespace_store(struct device *dev,
>   struct nd_btt *nd_btt = to_nd_btt(dev);
>   ssize_t rc;
>  
> - nd_device_lock(dev);
> + device_lock(dev);
>   nvdimm_bus_lock(dev);
>   rc = nd_namespace_store(dev, &nd_btt->ndns, buf, len);
>   dev_dbg(dev, "result: %zd wrote: %s%s", rc, buf,
>   buf[len - 1] == '\n' ? "" : "\n");
>   nvdimm_bus_unlock(dev);
> - nd_device_unlock(dev);
> + device_unlock(dev);
>  
>   return rc;
>  }
> @@ -126,14 +126,14 @@ static ssize_t size_show(struct device *dev,
>   struct nd_btt *nd_btt = to_nd_btt(dev);
>   ssize_t rc;
>  
> - nd_device_lock(dev);
> + device_lock(dev);
>   if (dev->driver)
>   rc = sprintf(buf, "%llu\n", nd_btt->size);
>   else {
>   /* no size to convey if the btt instance is disabled */
>   rc = -ENXIO;
>   }
> - nd_device_unlock(dev);
> + device_unlock(dev);
>  
>   return rc;
>  }
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index 85ffa04e93c2..a4fc17db707c 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -88,10 +88,7 @@ static int nvdimm_bus_probe(struct device *dev)
>   dev->driver->name, dev_name(dev));
>  
>   nvdimm_bus_probe_start(nvdimm_bus);
> - debug_nvdimm_lock(dev);
>   rc = nd_drv->probe(dev);
> - debug_nvdimm_unlock(dev);
> -
>   if ((rc == 0 || rc == -EOPNOTSUPP) &&
>   dev->parent && is_nd_region(dev->parent))
>   nd_region_advance_seeds(to_nd_region(dev->parent), dev);
> @@ -111,11 +108,8 @@ static void nvdimm_bus_remove(struct device *dev)
>   struct module *provider = to_bus_provider(dev);
>   struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
>  
> - if (nd_drv->remove) {
> - debug_nvdimm_lock(dev);
> + if (nd_drv->remove)
>   nd_drv->remove(dev);
> - debug_nvdimm_unlock(dev);
> - }
>  
>   dev_dbg(&nvdimm_bus->dev, "%s.remove(%s)\n", dev->driver->name,
>   dev_name(dev));
> @@ -139,7 +133,7 @@ static void nvdimm_bus_shutdown(struct device *dev)
>  
>  void nd_device_notify(struct device *dev, enum nvdimm_event event)
>  {
> - nd_device_lock(dev);
> + device_lock(dev);
>   if (dev->driver) {
>   struct nd_device_driver *nd_drv;
>  
> @@ -147,7 +141,7 @@ void nd_device_notify(struct device *dev, enum 
> nvdimm_event event)
>   if (nd_drv->notify)
>   nd_drv->notify(dev, event);
>   }
> - nd_device_unlock(dev);
> + device_unlock(dev);
>  }
>  EXPORT_SYMBOL(nd_device_notify);
>  
> @@ -569,9 +563,9 @@ void nd_device_unregister(

Re: [PATCH v3 5/8] ACPI: NFIT: Drop nfit_device_lock()

2022-04-22 Thread Ira Weiny
On Thu, Apr 21, 2022 at 08:33:34AM -0700, Dan Williams wrote:
> The nfit_device_lock() helper was added to provide lockdep coverage for
> the NFIT driver's usage of device_lock() on the nvdimm_bus object. Now
> that nvdimm_bus objects have their own lock class this wrapper can be
> dropped.
> 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Ira Weiny 
> Signed-off-by: Dan Williams 

Reviewed-by: Ira Weiny 

> ---
>  drivers/acpi/nfit/core.c |   30 +++---
>  drivers/acpi/nfit/nfit.h |   24 
>  2 files changed, 15 insertions(+), 39 deletions(-)
> 
> diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
> index fe61f617a943..ae5f4acf2675 100644
> --- a/drivers/acpi/nfit/core.c
> +++ b/drivers/acpi/nfit/core.c
> @@ -1230,7 +1230,7 @@ static ssize_t hw_error_scrub_store(struct device *dev,
>   if (rc)
>   return rc;
>  
> - nfit_device_lock(dev);
> + device_lock(dev);
>   nd_desc = dev_get_drvdata(dev);
>   if (nd_desc) {
>   struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
> @@ -1247,7 +1247,7 @@ static ssize_t hw_error_scrub_store(struct device *dev,
>   break;
>   }
>   }
> - nfit_device_unlock(dev);
> + device_unlock(dev);
>   if (rc)
>   return rc;
>   return size;
> @@ -1267,10 +1267,10 @@ static ssize_t scrub_show(struct device *dev,
>   ssize_t rc = -ENXIO;
>   bool busy;
>  
> - nfit_device_lock(dev);
> + device_lock(dev);
>   nd_desc = dev_get_drvdata(dev);
>   if (!nd_desc) {
> - nfit_device_unlock(dev);
> + device_unlock(dev);
>   return rc;
>   }
>   acpi_desc = to_acpi_desc(nd_desc);
> @@ -1287,7 +1287,7 @@ static ssize_t scrub_show(struct device *dev,
>   }
>  
>   mutex_unlock(&acpi_desc->init_mutex);
> - nfit_device_unlock(dev);
> + device_unlock(dev);
>   return rc;
>  }
>  
> @@ -1304,14 +1304,14 @@ static ssize_t scrub_store(struct device *dev,
>   if (val != 1)
>   return -EINVAL;
>  
> - nfit_device_lock(dev);
> + device_lock(dev);
>   nd_desc = dev_get_drvdata(dev);
>   if (nd_desc) {
>   struct acpi_nfit_desc *acpi_desc = to_acpi_desc(nd_desc);
>  
>   rc = acpi_nfit_ars_rescan(acpi_desc, ARS_REQ_LONG);
>   }
> - nfit_device_unlock(dev);
> + device_unlock(dev);
>   if (rc)
>   return rc;
>   return size;
> @@ -1697,9 +1697,9 @@ static void acpi_nvdimm_notify(acpi_handle handle, u32 
> event, void *data)
>   struct acpi_device *adev = data;
>   struct device *dev = &adev->dev;
>  
> - nfit_device_lock(dev->parent);
> + device_lock(dev->parent);
>   __acpi_nvdimm_notify(dev, event);
> - nfit_device_unlock(dev->parent);
> + device_unlock(dev->parent);
>  }
>  
>  static bool acpi_nvdimm_has_method(struct acpi_device *adev, char *method)
> @@ -3152,8 +3152,8 @@ static int acpi_nfit_flush_probe(struct 
> nvdimm_bus_descriptor *nd_desc)
>   struct device *dev = acpi_desc->dev;
>  
>   /* Bounce the device lock to flush acpi_nfit_add / acpi_nfit_notify */
> - nfit_device_lock(dev);
> - nfit_device_unlock(dev);
> + device_lock(dev);
> + device_unlock(dev);
>  
>   /* Bounce the init_mutex to complete initial registration */
>   mutex_lock(&acpi_desc->init_mutex);
> @@ -3305,8 +3305,8 @@ void acpi_nfit_shutdown(void *data)
>* acpi_nfit_ars_rescan() submissions have had a chance to
>* either submit or see ->cancel set.
>*/
> - nfit_device_lock(bus_dev);
> - nfit_device_unlock(bus_dev);
> + device_lock(bus_dev);
> + device_unlock(bus_dev);
>  
>   flush_workqueue(nfit_wq);
>  }
> @@ -3449,9 +3449,9 @@ EXPORT_SYMBOL_GPL(__acpi_nfit_notify);
>  
>  static void acpi_nfit_notify(struct acpi_device *adev, u32 event)
>  {
> - nfit_device_lock(&adev->dev);
> + device_lock(&adev->dev);
>   __acpi_nfit_notify(&adev->dev, adev->handle, event);
> - nfit_device_unlock(&adev->dev);
> + device_unlock(&adev->dev);
>  }
>  
>  static const struct acpi_device_id acpi_nfit_ids[] = {
> diff --git a/drivers/acpi/nfit/nfit.h b/drivers/acpi/nfit/nfit.h
> index 50882bdbeb96..6023ad61831a 100644
> --- a/drivers/acpi/nfit/nfit.h
> +++ b/drivers/acpi/nfit/nfit.h
> @@ -337,30 +337,6 @@ static inline struct acpi_nfit_desc *to_acpi_desc(
>   return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
>  }
>  
> -#ifdef CONFIG_PROVE_LOCKING
> -static inline void nfit_device_lock(struct device *dev)
> -{
> - device_lock(dev);
> - mutex_lock(&dev->lockdep_mutex);
> -}
> -
> -static inline void nfit_device_unlock(struct device *dev)
> -{
> - mutex_unlock(&dev->lockdep_mutex);
> - device_unlock(dev);
> -}
> -#else
> -static inline void nfit_device_lock(struct device *dev)
> -{
> - device_lock(dev);
> -}
> -
> -static inline void nfit_device_un

Re: [PATCH v3 4/8] nvdimm: Replace lockdep_mutex with local lock classes

2022-04-22 Thread Ira Weiny
On Thu, Apr 21, 2022 at 08:33:29AM -0700, Dan Williams wrote:
> In response to an attempt to expand dev->lockdep_mutex for device_lock()
> validation [1], Peter points out [2] that the lockdep API already has
> the ability to assign a dedicated lock class per subsystem device-type.
> 
> Use lockdep_set_class() to override the default device_lock()
> '__lockdep_no_validate__' class for each NVDIMM subsystem device-type. This
> enables lockdep to detect deadlocks and recursive locking within the
> device-driver core and the subsystem.
> 
> Link: 
> https://lore.kernel.org/r/164982968798.684294.15817853329823976469.st...@dwillia2-desk3.amr.corp.intel.com
>  [1]
> Link: 
> https://lore.kernel.org/r/ylf0dewci8myl...@hirez.programming.kicks-ass.net [2]
> Suggested-by: Peter Zijlstra 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Ira Weiny 
> Signed-off-by: Dan Williams 

Reviewed-by: Ira Weiny 

> ---
>  drivers/nvdimm/btt_devs.c   |7 +--
>  drivers/nvdimm/bus.c|   14 +++---
>  drivers/nvdimm/dax_devs.c   |4 ++--
>  drivers/nvdimm/dimm_devs.c  |4 
>  drivers/nvdimm/namespace_devs.c |   10 +-
>  drivers/nvdimm/nd-core.h|2 +-
>  drivers/nvdimm/pfn_devs.c   |7 +--
>  drivers/nvdimm/region_devs.c|4 
>  8 files changed, 37 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
> index e5a58520d398..120821796668 100644
> --- a/drivers/nvdimm/btt_devs.c
> +++ b/drivers/nvdimm/btt_devs.c
> @@ -178,6 +178,8 @@ bool is_nd_btt(struct device *dev)
>  }
>  EXPORT_SYMBOL(is_nd_btt);
>  
> +static struct lock_class_key nvdimm_btt_key;
> +
>  static struct device *__nd_btt_create(struct nd_region *nd_region,
> unsigned long lbasize, uuid_t *uuid,
> struct nd_namespace_common *ndns)
> @@ -205,6 +207,7 @@ static struct device *__nd_btt_create(struct nd_region 
> *nd_region,
>   dev->parent = &nd_region->dev;
>   dev->type = &nd_btt_device_type;
>   device_initialize(&nd_btt->dev);
> + lockdep_set_class(&nd_btt->dev.mutex, &nvdimm_btt_key);
>   if (ndns && !__nd_attach_ndns(&nd_btt->dev, ndns, &nd_btt->ndns)) {
>   dev_dbg(&ndns->dev, "failed, already claimed by %s\n",
>   dev_name(ndns->claim));
> @@ -225,7 +228,7 @@ struct device *nd_btt_create(struct nd_region *nd_region)
>  {
>   struct device *dev = __nd_btt_create(nd_region, 0, NULL, NULL);
>  
> - __nd_device_register(dev);
> + nd_device_register(dev);
>   return dev;
>  }
>  
> @@ -324,7 +327,7 @@ static int __nd_btt_probe(struct nd_btt *nd_btt,
>   if (!nd_btt->uuid)
>   return -ENOMEM;
>  
> - __nd_device_register(&nd_btt->dev);
> + nd_device_register(&nd_btt->dev);
>  
>   return 0;
>  }
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index 7b0d1443217a..85ffa04e93c2 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -334,6 +334,8 @@ struct nvdimm_bus *nvdimm_to_bus(struct nvdimm *nvdimm)
>  }
>  EXPORT_SYMBOL_GPL(nvdimm_to_bus);
>  
> +static struct lock_class_key nvdimm_bus_key;
> +
>  struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
>   struct nvdimm_bus_descriptor *nd_desc)
>  {
> @@ -360,6 +362,7 @@ struct nvdimm_bus *nvdimm_bus_register(struct device 
> *parent,
>   nvdimm_bus->dev.bus = &nvdimm_bus_type;
>   nvdimm_bus->dev.of_node = nd_desc->of_node;
>   device_initialize(&nvdimm_bus->dev);
> + lockdep_set_class(&nvdimm_bus->dev.mutex, &nvdimm_bus_key);
>   device_set_pm_not_required(&nvdimm_bus->dev);
>   rc = dev_set_name(&nvdimm_bus->dev, "ndbus%d", nvdimm_bus->id);
>   if (rc)
> @@ -511,7 +514,7 @@ static void nd_async_device_unregister(void *d, 
> async_cookie_t cookie)
>   put_device(dev);
>  }
>  
> -void __nd_device_register(struct device *dev)
> +void nd_device_register(struct device *dev)
>  {
>   if (!dev)
>   return;
> @@ -537,12 +540,6 @@ void __nd_device_register(struct device *dev)
>   async_schedule_dev_domain(nd_async_device_register, dev,
> &nd_async_domain);
>  }
> -
> -void nd_device_register(struct device *dev)
> -{
> - device_initialize(dev);
> - __nd_device_register(dev);
> -}
>  EXPORT_SYMBOL(nd_device_register);
>  
>  void nd_device_unregister(struct device *dev, enum nd_async_mode mode)
> @@ -724,6 +721,8 @@ static void ndctl_release(struct device *dev)
>   kfree(dev);
>  }
>  
> +static struct lock_class_key nvdimm_ndctl_key;
> +
>  int nvdimm_bus_create_ndctl(struct nvdimm_bus *nvdimm_bus)
>  {
>   dev_t devt = MKDEV(nvdimm_bus_major, nvdimm_bus->id);
> @@ -734,6 +733,7 @@ int nvdimm_bus_create_ndctl(struct nvdimm_bus *nvdimm_bus)
>   if (!dev)
>   return -ENOMEM;
>   device_initialize(dev);
> + lockdep_set_class(&dev->mutex, &nvdimm_ndctl_key)

Re: [PATCH v3 2/8] cxl/acpi: Add root device lockdep validation

2022-04-22 Thread Dan Williams
On Fri, Apr 22, 2022 at 4:58 PM Ira Weiny  wrote:
>
> On Thu, Apr 21, 2022 at 08:33:18AM -0700, Dan Williams wrote:
> > The CXL "root" device, ACPI0017, is an attach point for coordinating
> > platform level CXL resources and is the parent device for a CXL port
> > topology tree. As such it has distinct locking rules relative to other
> > CXL subsystem objects, but because it is an ACPI device the lock class
> > is established well before it is given to the cxl_acpi driver.
>
> This final sentence gave me pause because it implied that the device lock 
> class
> was set to something other than no validate.  But I don't see that anywhere in
> the acpi code.  So given that it looks to me like ACPI is just using the
> default no validate class...

Oh, good observation. *If* ACPI had set a custom lock class then
cxl_acpi would need to be careful to restore that ACPI-specific class
and not reset it to "no validate" on exit, or skip setting its own
custom class. However, I think for generic buses like ACPI that feed
devices into other subsystems it likely has little reason to set its
own class. For safety, since device_lock_set_class() is general
purpose, I'll have it emit a debug message and fail if the class is
not "no validate" on entry.

Thanks Ira!



Re: [PATCH v3 3/8] cxl: Drop cxl_device_lock()

2022-04-22 Thread Ira Weiny
On Thu, Apr 21, 2022 at 08:33:23AM -0700, Dan Williams wrote:
> Now that all CXL subsystem locking is validated with custom lock
> classes, there is no need for the custom usage of the lockdep_mutex.
> 
> Cc: Alison Schofield 
> Cc: Vishal Verma 
> Cc: Ira Weiny 
> Cc: Ben Widawsky 
> Cc: Jonathan Cameron 
> Signed-off-by: Dan Williams 

Reviewed-by: Ira Weiny 

> ---
>  drivers/cxl/core/pmem.c |4 +-
>  drivers/cxl/core/port.c |   55 ++---
>  drivers/cxl/cxl.h   |   78 
> ---
>  drivers/cxl/mem.c   |4 +-
>  drivers/cxl/pmem.c  |   12 ---
>  lib/Kconfig.debug   |6 
>  6 files changed, 33 insertions(+), 126 deletions(-)
> 
> diff --git a/drivers/cxl/core/pmem.c b/drivers/cxl/core/pmem.c
> index e825e261278d..bec7cfb54ebf 100644
> --- a/drivers/cxl/core/pmem.c
> +++ b/drivers/cxl/core/pmem.c
> @@ -124,10 +124,10 @@ static void unregister_nvb(void *_cxl_nvb)
>* work to flush. Once the state has been changed to 'dead' then no new
>* work can be queued by user-triggered bind.
>*/
> - cxl_device_lock(&cxl_nvb->dev);
> + device_lock(&cxl_nvb->dev);
>   flush = cxl_nvb->state != CXL_NVB_NEW;
>   cxl_nvb->state = CXL_NVB_DEAD;
> - cxl_device_unlock(&cxl_nvb->dev);
> + device_unlock(&cxl_nvb->dev);
>  
>   /*
>* Even though the device core will trigger device_release_driver()
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 750aac95ed5f..ea60abda6500 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -312,10 +312,10 @@ static void cxl_port_release(struct device *dev)
>   struct cxl_port *port = to_cxl_port(dev);
>   struct cxl_ep *ep, *_e;
>  
> - cxl_device_lock(dev);
> + device_lock(dev);
>   list_for_each_entry_safe(ep, _e, &port->endpoints, list)
>   cxl_ep_release(ep);
> - cxl_device_unlock(dev);
> + device_unlock(dev);
>   ida_free(&cxl_port_ida, port->id);
>   kfree(port);
>  }
> @@ -556,7 +556,7 @@ static int match_root_child(struct device *dev, const 
> void *match)
>   return 0;
>  
>   port = to_cxl_port(dev);
> - cxl_device_lock(dev);
> + device_lock(dev);
>   list_for_each_entry(dport, &port->dports, list) {
>   iter = match;
>   while (iter) {
> @@ -566,7 +566,7 @@ static int match_root_child(struct device *dev, const 
> void *match)
>   }
>   }
>  out:
> - cxl_device_unlock(dev);
> + device_unlock(dev);
>  
>   return !!iter;
>  }
> @@ -625,13 +625,13 @@ static int add_dport(struct cxl_port *port, struct 
> cxl_dport *new)
>  static void cond_cxl_root_lock(struct cxl_port *port)
>  {
>   if (is_cxl_root(port))
> - cxl_device_lock(&port->dev);
> + device_lock(&port->dev);
>  }
>  
>  static void cond_cxl_root_unlock(struct cxl_port *port)
>  {
>   if (is_cxl_root(port))
> - cxl_device_unlock(&port->dev);
> + device_unlock(&port->dev);
>  }
>  
>  static void cxl_dport_remove(void *data)
> @@ -738,15 +738,15 @@ static int add_ep(struct cxl_port *port, struct cxl_ep 
> *new)
>  {
>   struct cxl_ep *dup;
>  
> - cxl_device_lock(&port->dev);
> + device_lock(&port->dev);
>   if (port->dead) {
> - cxl_device_unlock(&port->dev);
> + device_unlock(&port->dev);
>   return -ENXIO;
>   }
>   dup = find_ep(port, new->ep);
>   if (!dup)
>   list_add_tail(&new->list, &port->endpoints);
> - cxl_device_unlock(&port->dev);
> + device_unlock(&port->dev);
>  
>   return dup ? -EEXIST : 0;
>  }
> @@ -856,12 +856,12 @@ static void delete_endpoint(void *data)
>   goto out;
>   parent = &parent_port->dev;
>  
> - cxl_device_lock(parent);
> + device_lock(parent);
>   if (parent->driver && endpoint->uport) {
>   devm_release_action(parent, cxl_unlink_uport, endpoint);
>   devm_release_action(parent, unregister_port, endpoint);
>   }
> - cxl_device_unlock(parent);
> + device_unlock(parent);
>   put_device(parent);
>  out:
>   put_device(&endpoint->dev);
> @@ -922,7 +922,7 @@ static void cxl_detach_ep(void *data)
>   }
>  
>   parent_port = to_cxl_port(port->dev.parent);
> - cxl_device_lock(&parent_port->dev);
> + device_lock(&parent_port->dev);
>   if (!parent_port->dev.driver) {
>   /*
>* The bottom-up race to delete the port lost to a
> @@ -930,12 +930,12 @@ static void cxl_detach_ep(void *data)
>* parent_port ->remove() will have cleaned up all
>* descendants.
>*/
> - cxl_device_unlock(&parent_port->dev);
> + device_unlock(&parent_port->dev);
>

Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

2022-04-22 Thread Dave Chinner
On Fri, Apr 22, 2022 at 02:27:32PM -0700, Dan Williams wrote:
> On Thu, Apr 21, 2022 at 12:47 AM Dave Chinner  wrote:
> >
> > On Wed, Apr 20, 2022 at 10:54:59PM -0700, Christoph Hellwig wrote:
> > > On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> > > > Sure, I'm not a maintainer and just the stand-in patch shepherd for
> > > > a single release. However, being unable to cleanly merge code we
> > > > need integrated into our local subsystem tree for integration
> > > > testing because a patch dependency with another subsystem won't gain
> > > > a stable commit ID until the next merge window is  distinctly
> > > > suboptimal.
> > >
> > > Yes.  Which is why we've taken a lot of mm patchs through other trees,
> > > sometimes specilly crafted for that.  So I guess in this case we'll
> > > just need to take non-trivial dependencies into the XFS tree, and just
> > > deal with small merge conflicts for the trivial ones.
> >
> > OK. As Naoyo has pointed out, the first dependency/conflict Ruan has
> > listed looks trivial to resolve.
> >
> > The second dependency, OTOH, is on a new function added in the patch
> > pointed to. That said, at first glance it looks to be independent of
> > the first two patches in that series so I might just be able to pull
> > that one patch in and have that leave us with a working
> > fsdax+reflink tree.
> >
> > Regardless, I'll wait to see how much work the updated XFS/DAX
> > reflink enablement patchset still requires when Ruan posts it before
> > deciding what to do here.  If it isn't going to be a merge
> > candidate, what to do with this patchset is moot because there's
> > little to test without reflink enabled...
> 
> I do have a use case for this work absent the reflink work.  Recall we
> had a conversation about how to communicate "dax-device has been
> ripped away from the fs" events and we ended up on the idea of reusing
> ->notify_failure(), but with the device's entire logical address range
> as the notification span. That will let me unwind and delete the
> PTE_DEVMAP infrastructure for taking extra device references to hold
> off device-removal. Instead ->notify_failure() arranges for all active
> DAX mappings to be invalidated and allow the removal to proceed
> especially since physical removal does not care about software pins.

Sure. My point is that if the reflink enablement isn't ready to go,
then from an XFS POV none of this matters in this cycle and we can
just leave the dependencies to commit via Andrew's tree. Hence by
the time we get to the reflink enablement all the prior dependencies
will have been merged and have stable commit IDs, and we can just
stage this series and the reflink enablement as we normally would in
the next cycle.

However, if we don't get the XFS reflink dax enablement sorted out
in the next week or two, then we don't need this patchset in this
cycle. Hence if you still need this patchset for other code you need
to merge in this cycle, then you're the poor schmuck that has to run
the mm-tree conflict guantlet to get a stable commit ID for the
dependent patches in this cycle, not me

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com



Re: [PATCH v3 2/8] cxl/acpi: Add root device lockdep validation

2022-04-22 Thread Ira Weiny
On Thu, Apr 21, 2022 at 08:33:18AM -0700, Dan Williams wrote:
> The CXL "root" device, ACPI0017, is an attach point for coordinating
> platform level CXL resources and is the parent device for a CXL port
> topology tree. As such it has distinct locking rules relative to other
> CXL subsystem objects, but because it is an ACPI device the lock class
> is established well before it is given to the cxl_acpi driver.
 
This final sentence gave me pause because it implied that the device lock class
was set to something other than no validate.  But I don't see that anywhere in
the acpi code.  So given that it looks to me like ACPI is just using the
default no validate class...

Reviewed-by: Ira Weiny 

> However, the lockdep API does support changing the lock class "live" for
> situations like this. Add a device_lock_set_class() helper that a driver
> can use in ->probe() to set a custom lock class, and
> device_lock_reset_class() to return to the default "no validate" class
> before the custom lock class key goes out of scope after ->remove().
> 
> Note the helpers are all macros to support dead code elimination in the
> CONFIG_PROVE_LOCKING=n case.
> 
> Suggested-by: Peter Zijlstra 
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 
> Cc: Ingo Molnar 
> Cc: Will Deacon 
> Cc: Waiman Long 
> Cc: Boqun Feng 
> Cc: Alison Schofield 
> Cc: Vishal Verma 
> Cc: Ira Weiny 
> Cc: Ben Widawsky 
> Cc: Jonathan Cameron 
> Signed-off-by: Dan Williams 
> ---
>  drivers/cxl/acpi.c |   15 +++
>  include/linux/device.h |   25 +
>  2 files changed, 40 insertions(+)
> 
> diff --git a/drivers/cxl/acpi.c b/drivers/cxl/acpi.c
> index d15a6aec0331..e19cea27387e 100644
> --- a/drivers/cxl/acpi.c
> +++ b/drivers/cxl/acpi.c
> @@ -275,6 +275,15 @@ static int add_root_nvdimm_bridge(struct device *match, 
> void *data)
>   return 1;
>  }
>  
> +static struct lock_class_key cxl_root_key;
> +
> +static void cxl_acpi_lock_reset_class(void *_dev)
> +{
> + struct device *dev = _dev;
> +
> + device_lock_reset_class(dev);
> +}
> +
>  static int cxl_acpi_probe(struct platform_device *pdev)
>  {
>   int rc;
> @@ -283,6 +292,12 @@ static int cxl_acpi_probe(struct platform_device *pdev)
>   struct acpi_device *adev = ACPI_COMPANION(host);
>   struct cxl_cfmws_context ctx;
>  
> + device_lock_set_class(&pdev->dev, &cxl_root_key);
> + rc = devm_add_action_or_reset(&pdev->dev, cxl_acpi_lock_reset_class,
> +   &pdev->dev);
> + if (rc)
> + return rc;
> +
>   root_port = devm_cxl_add_port(host, host, CXL_RESOURCE_NONE, NULL);
>   if (IS_ERR(root_port))
>   return PTR_ERR(root_port);
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 93459724dcde..82c9d307e7bd 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -850,6 +850,31 @@ static inline bool device_supports_offline(struct device 
> *dev)
>   return dev->bus && dev->bus->offline && dev->bus->online;
>  }
>  
> +#define __device_lock_set_class(dev, name, key) \
> + lock_set_class(&(dev)->mutex.dep_map, name, key, 0, _THIS_IP_)
> +
> +/**
> + * device_lock_set_class - Specify a temporary lock class while a device
> + *  is attached to a driver
> + * @dev: device to modify
> + * @key: lock class key data
> + *
> + * This must be called with the device_lock() already held, for example
> + * from driver ->probe().
> + */
> +#define device_lock_set_class(dev, key)  \
> + __device_lock_set_class(dev, #key, key)
> +
> +/**
> + * device_lock_reset_class - Return a device to the default lockdep 
> novalidate state
> + * @dev: device to modify
> + *
> + * This must be called with the device_lock() already held, for example
> + * from driver ->remove().
> + */
> +#define device_lock_reset_class(dev) \
> + device_lock_set_class(dev, &__lockdep_no_validate__)
> +
>  void lock_device_hotplug(void);
>  void unlock_device_hotplug(void);
>  int lock_device_hotplug_sysfs(void);
> 



Re: [PATCH v3 1/8] cxl: Replace lockdep_mutex with local lock classes

2022-04-22 Thread Ira Weiny
On Thu, Apr 21, 2022 at 08:33:13AM -0700, Dan Williams wrote:
> In response to an attempt to expand dev->lockdep_mutex for device_lock()
> validation [1], Peter points out [2] that the lockdep API already has
> the ability to assign a dedicated lock class per subsystem device-type.
> 
> Use lockdep_set_class() to override the default device_lock()
> '__lockdep_no_validate__' class for each CXL subsystem device-type. This
> enables lockdep to detect deadlocks and recursive locking within the
> device-driver core and the subsystem. The
> lockdep_set_class_and_subclass() API is used for port objects that
> recursively lock the 'cxl_port_key' class by hierarchical topology
> depth.
> 
> Link: 
> https://lore.kernel.org/r/164982968798.684294.15817853329823976469.st...@dwillia2-desk3.amr.corp.intel.com
>  [1]
> Link: 
> https://lore.kernel.org/r/ylf0dewci8myl...@hirez.programming.kicks-ass.net [2]
> Suggested-by: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Will Deacon 
> Cc: Waiman Long 
> Cc: Boqun Feng 
> Cc: Alison Schofield 
> Cc: Vishal Verma 
> Cc: Ira Weiny 
> Cc: Ben Widawsky 
> Cc: Jonathan Cameron 
> Signed-off-by: Dan Williams 

Reviewed-by: Ira Weiny 

> ---
>  drivers/cxl/core/memdev.c |3 +++
>  drivers/cxl/core/pmem.c   |6 ++
>  drivers/cxl/core/port.c   |   13 +
>  3 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
> index 1f76b28f9826..f7cdcd33504a 100644
> --- a/drivers/cxl/core/memdev.c
> +++ b/drivers/cxl/core/memdev.c
> @@ -228,6 +228,8 @@ static void detach_memdev(struct work_struct *work)
>   put_device(&cxlmd->dev);
>  }
>  
> +static struct lock_class_key cxl_memdev_key;
> +
>  static struct cxl_memdev *cxl_memdev_alloc(struct cxl_dev_state *cxlds,
>  const struct file_operations *fops)
>  {
> @@ -247,6 +249,7 @@ static struct cxl_memdev *cxl_memdev_alloc(struct 
> cxl_dev_state *cxlds,
>  
>   dev = &cxlmd->dev;
>   device_initialize(dev);
> + lockdep_set_class(&dev->mutex, &cxl_memdev_key);
>   dev->parent = cxlds->dev;
>   dev->bus = &cxl_bus_type;
>   dev->devt = MKDEV(cxl_mem_major, cxlmd->id);
> diff --git a/drivers/cxl/core/pmem.c b/drivers/cxl/core/pmem.c
> index 8de240c4d96b..e825e261278d 100644
> --- a/drivers/cxl/core/pmem.c
> +++ b/drivers/cxl/core/pmem.c
> @@ -80,6 +80,8 @@ struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct 
> cxl_nvdimm *cxl_nvd)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_find_nvdimm_bridge, CXL);
>  
> +static struct lock_class_key cxl_nvdimm_bridge_key;
> +
>  static struct cxl_nvdimm_bridge *cxl_nvdimm_bridge_alloc(struct cxl_port 
> *port)
>  {
>   struct cxl_nvdimm_bridge *cxl_nvb;
> @@ -99,6 +101,7 @@ static struct cxl_nvdimm_bridge 
> *cxl_nvdimm_bridge_alloc(struct cxl_port *port)
>   cxl_nvb->port = port;
>   cxl_nvb->state = CXL_NVB_NEW;
>   device_initialize(dev);
> + lockdep_set_class(&dev->mutex, &cxl_nvdimm_bridge_key);
>   device_set_pm_not_required(dev);
>   dev->parent = &port->dev;
>   dev->bus = &cxl_bus_type;
> @@ -214,6 +217,8 @@ struct cxl_nvdimm *to_cxl_nvdimm(struct device *dev)
>  }
>  EXPORT_SYMBOL_NS_GPL(to_cxl_nvdimm, CXL);
>  
> +static struct lock_class_key cxl_nvdimm_key;
> +
>  static struct cxl_nvdimm *cxl_nvdimm_alloc(struct cxl_memdev *cxlmd)
>  {
>   struct cxl_nvdimm *cxl_nvd;
> @@ -226,6 +231,7 @@ static struct cxl_nvdimm *cxl_nvdimm_alloc(struct 
> cxl_memdev *cxlmd)
>   dev = &cxl_nvd->dev;
>   cxl_nvd->cxlmd = cxlmd;
>   device_initialize(dev);
> + lockdep_set_class(&dev->mutex, &cxl_nvdimm_key);
>   device_set_pm_not_required(dev);
>   dev->parent = &cxlmd->dev;
>   dev->bus = &cxl_bus_type;
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 2ab1ba4499b3..750aac95ed5f 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -391,6 +391,8 @@ static int devm_cxl_link_uport(struct device *host, 
> struct cxl_port *port)
>   return devm_add_action_or_reset(host, cxl_unlink_uport, port);
>  }
>  
> +static struct lock_class_key cxl_port_key;
> +
>  static struct cxl_port *cxl_port_alloc(struct device *uport,
>  resource_size_t component_reg_phys,
>  struct cxl_port *parent_port)
> @@ -415,9 +417,10 @@ static struct cxl_port *cxl_port_alloc(struct device 
> *uport,
>* description.
>*/
>   dev = &port->dev;
> - if (parent_port)
> + if (parent_port) {
>   dev->parent = &parent_port->dev;
> - else
> + port->depth = parent_port->depth + 1;
> + } else
>   dev->parent = uport;
>  
>   port->uport = uport;
> @@ -427,6 +430,7 @@ static struct cxl_port *cxl_port_alloc(struct device 
> *uport,
>   INIT_LIST_HEAD(&port->endpoints);
>  
>   device_initialize(dev);
> + lockdep_set_class_and_subclass(&dev->mutex, &cxl_po

Re: [PATCH v13 0/7] fsdax: introduce fs query to support reflink

2022-04-22 Thread Dan Williams
On Thu, Apr 21, 2022 at 12:47 AM Dave Chinner  wrote:
>
> On Wed, Apr 20, 2022 at 10:54:59PM -0700, Christoph Hellwig wrote:
> > On Thu, Apr 21, 2022 at 02:35:02PM +1000, Dave Chinner wrote:
> > > Sure, I'm not a maintainer and just the stand-in patch shepherd for
> > > a single release. However, being unable to cleanly merge code we
> > > need integrated into our local subsystem tree for integration
> > > testing because a patch dependency with another subsystem won't gain
> > > a stable commit ID until the next merge window is  distinctly
> > > suboptimal.
> >
> > Yes.  Which is why we've taken a lot of mm patchs through other trees,
> > sometimes specilly crafted for that.  So I guess in this case we'll
> > just need to take non-trivial dependencies into the XFS tree, and just
> > deal with small merge conflicts for the trivial ones.
>
> OK. As Naoyo has pointed out, the first dependency/conflict Ruan has
> listed looks trivial to resolve.
>
> The second dependency, OTOH, is on a new function added in the patch
> pointed to. That said, at first glance it looks to be independent of
> the first two patches in that series so I might just be able to pull
> that one patch in and have that leave us with a working
> fsdax+reflink tree.
>
> Regardless, I'll wait to see how much work the updated XFS/DAX
> reflink enablement patchset still requires when Ruan posts it before
> deciding what to do here.  If it isn't going to be a merge
> candidate, what to do with this patchset is moot because there's
> little to test without reflink enabled...

I do have a use case for this work absent the reflink work. Recall we
had a conversation about how to communicate "dax-device has been
ripped away from the fs" events and we ended up on the idea of reusing
->notify_failure(), but with the device's entire logical address range
as the notification span. That will let me unwind and delete the
PTE_DEVMAP infrastructure for taking extra device references to hold
off device-removal. Instead ->notify_failure() arranges for all active
DAX mappings to be invalidated and allow the removal to proceed
especially since physical removal does not care about software pins.



Re: [PATCH v13 3/7] pagemap,pmem: Introduce ->memory_failure()

2022-04-22 Thread Shiyang Ruan




在 2022/4/21 16:24, Miaohe Lin 写道:

On 2022/4/19 12:50, Shiyang Ruan wrote:

When memory-failure occurs, we call this function which is implemented
by each kind of devices.  For the fsdax case, pmem device driver
implements it.  Pmem device driver will find out the filesystem in which
the corrupted page located in.

With dax_holder notify support, we are able to notify the memory failure
from pmem driver to upper layers.  If there is something not support in
the notify routine, memory_failure will fall back to the generic hanlder.

Signed-off-by: Shiyang Ruan 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Dan Williams 
---
  drivers/nvdimm/pmem.c| 17 +
  include/linux/memremap.h | 12 
  mm/memory-failure.c  | 14 ++
  3 files changed, 43 insertions(+)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 58d95242a836..bd502957cfdf 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -366,6 +366,21 @@ static void pmem_release_disk(void *__pmem)
blk_cleanup_disk(pmem->disk);
  }
  
+static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,

+   unsigned long pfn, unsigned long nr_pages, int mf_flags)
+{
+   struct pmem_device *pmem =
+   container_of(pgmap, struct pmem_device, pgmap);
+   u64 offset = PFN_PHYS(pfn) - pmem->phys_addr - pmem->data_offset;
+   u64 len = nr_pages << PAGE_SHIFT;
+
+   return dax_holder_notify_failure(pmem->dax_dev, offset, len, mf_flags);
+}
+
+static const struct dev_pagemap_ops fsdax_pagemap_ops = {
+   .memory_failure = pmem_pagemap_memory_failure,
+};
+
  static int pmem_attach_disk(struct device *dev,
struct nd_namespace_common *ndns)
  {
@@ -427,6 +442,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pfn_flags = PFN_DEV;
if (is_nd_pfn(dev)) {
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+   pmem->pgmap.ops = &fsdax_pagemap_ops;
addr = devm_memremap_pages(dev, &pmem->pgmap);
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
@@ -440,6 +456,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->pgmap.range.end = res->end;
pmem->pgmap.nr_range = 1;
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
+   pmem->pgmap.ops = &fsdax_pagemap_ops;
addr = devm_memremap_pages(dev, &pmem->pgmap);
pmem->pfn_flags |= PFN_MAP;
bb_range = pmem->pgmap.range;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index ad6062d736cd..bcfb6bf4ce5a 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -79,6 +79,18 @@ struct dev_pagemap_ops {
 * the page back to a CPU accessible page.
 */
vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+   /*
+* Handle the memory failure happens on a range of pfns.  Notify the
+* processes who are using these pfns, and try to recover the data on
+* them if necessary.  The mf_flags is finally passed to the recover
+* function through the whole notify routine.
+*
+* When this is not implemented, or it returns -EOPNOTSUPP, the caller
+* will fall back to a common handler called mf_generic_kill_procs().
+*/
+   int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn,
+ unsigned long nr_pages, int mf_flags);
  };
  
  #define PGMAP_ALTMAP_VALID	(1 << 0)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7c8c047bfdc8..a40e79e634a4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1741,6 +1741,20 @@ static int memory_failure_dev_pagemap(unsigned long pfn, 
int flags,
if (!pgmap_pfn_valid(pgmap, pfn))
goto out;
  
+	/*

+* Call driver's implementation to handle the memory failure, otherwise
+* fall back to generic handler.
+*/
+   if (pgmap->ops->memory_failure) {
+   rc = pgmap->ops->memory_failure(pgmap, pfn, 1, flags);
+   /*
+* Fall back to generic handler too if operation is not
+* supported inside the driver/device/filesystem.
+*/
+   if (rc != -EOPNOTSUPP)
+   goto out;
+   }
+


Thanks for your patch. There are two questions:

1.Is dax_lock_page + dax_unlock_page pair needed here?


They are moved into mf_generic_kill_procs() in Patch2.  Callback will 
implement its own dax lock/unlock method.  For example, for 
mf_dax_kill_procs() in Patch4, we implemented 
dax_lock_mapping_entry()/dax_unlock_mapping_entry() for it.



2.hwpoison_filter and SetPageHWPoison will be handled by the callback or 
they're just ignored deliberately?


SetPageHWPoison() will be handled by callback or by mf_generic_kill_procs().

hwpoison_filter() is moved int