Re: [PATCH v2 03/12] cxl/core: Refactor a cxl_lock_class() out of cxl_nested_lock()

2022-04-13 Thread Dan Williams
On Wed, Apr 13, 2022 at 2:40 AM Peter Zijlstra  wrote:
>
>
> *Completely* untested (I wouldn't know where to begin and probably odn't
> have the hardware anyway), and known incomplete.
>
> What's wrong with something like this then?
>

I'll give it a shot, I think it solves all the cringeworthy aspects of
what I proposed.



Re: [PATCH v2 02/12] device-core: Add dev->lock_class to enable device_lock() lockdep validation

2022-04-13 Thread Dan Williams
On Wed, Apr 13, 2022 at 1:43 AM Peter Zijlstra  wrote:
>
> On Tue, Apr 12, 2022 at 11:01:38PM -0700, Dan Williams wrote:
> > The device_lock() is hidden from lockdep by default because, for
> > example, a device subsystem may do something like:
> >
> > ---
> > device_add(dev1);
> > ...in driver core...
> > device_lock(dev1);
> > bus->probe(dev1); /* where bus->probe() calls driver1_probe() */
> >
> > driver1_probe(struct device *dev)
> > {
> >   ...do some enumeration...
> >   dev2->parent = dev;
> >   /* this triggers probe under device_lock(dev2); */
> >   device_add(dev2);
> > }
> > ---
> >
> > To lockdep, that device_lock(dev2) looks like a deadlock because lockdep
>
> Recursion, you're meaning to say it looks like same lock recursion.

Yes, wrong terminology on my part.

>
> > only sees lock classes, not individual lock instances. All device_lock()
> > instances across the entire kernel are the same class. However, this is
> > not a deadlock in practice because the locking is strictly hierarchical.
> > I.e. device_lock(dev1) is held over device_lock(dev2), but never the
> > reverse.
>
> I have some very vague memories from a conversation with Alan Stern,
> some maybe 10 years ago, where I think he was explaining to me this was
> not in fact a simple hierarchy.
>
> > In order for lockdep to be satisfied and see that it is
> > hierarchical in practice the mutex_lock() call in device_lock() needs to
> > be moved to mutex_lock_nested() where the @subclass argument to
> > mutex_lock_nested() represents the nesting level, i.e.:
>
> That's not an obvious conclusion; lockdep has lots of funny annotations,
> subclasses is just one.
>
> I think the big new development in lockdep since that time with Alan
> Stern is that lockdep now has support for dynamic keys; that is lock
> keys in heap memory (as opposed to static storage).

Ah, I was not aware of that, that should allow for deep cleanups of
this proposal.

>
> > s/device_lock(dev1)/mutex_lock_nested(&dev1->mutex, 1)/
> >
> > s/device_lock(dev2)/mutex_lock_nested(&dev2->mutex, 2)/
> >
> > Now, what if the internals of the device_lock() could be annotated with
> > the right @subclass argument to call mutex_lock_nested()?
> >
> > With device_set_lock_class() a subsystem can optionally add that
> > metadata. The device_lock() still takes dev->mutex, but when
> > dev->lock_class is >= 0 it additionally takes dev->lockdep_mutex with
> > the proper nesting. Unlike dev->mutex, dev->lockdep_mutex is not marked
> > lockdep_set_novalidate_class() and lockdep will become useful... at
> > least for one subsystem at a time.
> >
> > It is still the case that only one subsystem can be using lockdep with
> > lockdep_mutex at a time because different subsystems will collide class
> > numbers. You might say "well, how about subsystem1 gets class ids 0 to 9
> > and subsystem2 gets class ids 10 to 20?". MAX_LOCKDEP_SUBCLASSES is 8,
> > and 8 is just enough class ids for one subsystem of moderate complexity.
>
> Again, that doesn't seem like an obvious suggestion at all. Why not give
> each subsystem a different lock key?
>

Yes, that would also save a source of merge conflicts if every
subsystem needed to add conditional extensions to 'struct device' for
an array of lock metadata.

>
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index af2576ace130..6083e757e804 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -402,6 +402,7 @@ struct dev_msi_info {
> >   * @mutex:   Mutex to synchronize calls to its driver.
> >   * @lockdep_mutex: An optional debug lock that a subsystem can use as a
> >   *   peer lock to gain localized lockdep coverage of the 
> > device_lock.
> > + * @lock_class: per-subsystem annotated device lock class
> >   * @bus: Type of bus device is on.
> >   * @driver:  Which driver has allocated this
> >   * @platform_data: Platform data specific to the device.
> > @@ -501,6 +502,7 @@ struct device {
> >  dev_set_drvdata/dev_get_drvdata */
> >  #ifdef CONFIG_PROVE_LOCKING
> >   struct mutexlockdep_mutex;
> > + int lock_class;
> >  #endif
> >   struct mutexmutex;  /* mutex to synchronize calls to
> >* its driver.
> > @@ -762,18 +764,100 @@ static inline bool dev_pm_test_driver_flags(struct 
> > device *dev, u32 flags)
> >   return !!(dev->power.driver_flags & flags);
> >  }
> >
> > +static inline void device_lock_assert(struct device *dev)
> > +{
> > + lockdep_assert_held(&dev->mutex);
> > +}
> > +
> >  #ifdef CONFIG_PROVE_LOCKING
> >  static inline void device_lockdep_init(struct device *dev)
> >  {
> >   mutex_init(&dev->lockdep_mutex);
> > + dev->lock_class = -1;
> >   lockdep_set_novalidate_class(&dev->mutex);
> >  }
> > -#else
> > +
> > +static inline void device_lock(struct device *dev)
> > +{
> > + /*
> > +  * For double

Re: [PATCH v12 6/7] xfs: Implement ->notify_failure() for XFS

2022-04-13 Thread Christoph Hellwig
On Wed, Apr 13, 2022 at 10:09:40AM -0700, Dan Williams wrote:
> Yes, sounds like we're on the same page. I had mistakenly interpreted
> "Hence these notifications need to be delayed until after the
> filesystem is mounted" as something the producer would need to handle,
> but yes, consumer is free to drop if the notification arrives at an
> inopportune time.

A SB_BORN check might be all that we need.



Re: [PATCH v12 6/7] xfs: Implement ->notify_failure() for XFS

2022-04-13 Thread Dan Williams
On Tue, Apr 12, 2022 at 11:10 PM Dave Chinner  wrote:
>
> On Tue, Apr 12, 2022 at 07:06:40PM -0700, Dan Williams wrote:
> > On Tue, Apr 12, 2022 at 5:04 PM Dave Chinner  wrote:
> > > On Mon, Apr 11, 2022 at 12:09:03AM +0800, Shiyang Ruan wrote:
> > > > Introduce xfs_notify_failure.c to handle failure related works, such as
> > > > implement ->notify_failure(), register/unregister dax holder in xfs, and
> > > > so on.
> > > >
> > > > If the rmap feature of XFS enabled, we can query it to find files and
> > > > metadata which are associated with the corrupt data.  For now all we do
> > > > is kill processes with that file mapped into their address spaces, but
> > > > future patches could actually do something about corrupt metadata.
> > > >
> > > > After that, the memory failure needs to notify the processes who are
> > > > using those files.
> ...
> > > > @@ -1964,8 +1965,8 @@ xfs_alloc_buftarg(
> > > >   btp->bt_mount = mp;
> > > >   btp->bt_dev =  bdev->bd_dev;
> > > >   btp->bt_bdev = bdev;
> > > > - btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off, 
> > > > NULL,
> > > > - NULL);
> > > > + btp->bt_daxdev = fs_dax_get_by_bdev(bdev, &btp->bt_dax_part_off, 
> > > > mp,
> > > > + &xfs_dax_holder_operations);
> > >
> > > I see a problem with this: we are setting up notify callbacks before
> > > we've even read in the superblock during mount. i.e. we don't even
> > > kow yet if we've got an XFS filesystem on this block device.
> > > Hence these notifications need to be delayed until after the
> > > filesystem is mounted, all the internal structures have been set up
> > > and log recovery has completed.
> >
> > So I think this gets back to the fact that there will eventually be 2
> > paths into this notifier.
>
> I'm not really concerned by how the notifications are generated;
> my concern is purely that notifications can be handled safely.
>
> > All that to say, I think it is ok / expected for the filesystem to
> > drop notifications on the floor when it is not ready to handle them.
>
> Well, yes. The whole point of notifications is the consumer makes
> the decision on what to do with the notification it receives - the
> producer of the notification does not (and can not) dictate what
> policy the consumer(s) implement...
>
> > For example there are no processes to send SIGBUS to if the filesystem
> > has not even finished mount.
>
> There may be not processes to send SIGBUS to even if the filesystem
> has finished mount. But we still want the notifications to be
> delivered and we still need to handle them safely.
>
> IOWs, while we might start by avoiding notifications during mount,
> this doesn't mean we will never have reason to process events during
> mount. What we do with this notification is going to evolve over
> time as we add new and adapt existing functionality

Yes, sounds like we're on the same page. I had mistakenly interpreted
"Hence these notifications need to be delayed until after the
filesystem is mounted" as something the producer would need to handle,
but yes, consumer is free to drop if the notification arrives at an
inopportune time.



Re: [PATCH v2 00/12] device-core: Enable device_lock() lockdep validation

2022-04-13 Thread Waiman Long

On 4/13/22 02:01, Dan Williams wrote:

Changes since v1 [1]:
- Improve the clarity of the cover letter and changelogs of the
   major patches (Patch2 and Patch12) (Pierre, Kevin, and Dave)
- Fix device_lock_interruptible() false negative deadlock detection
   (Kevin)
- Fix off-by-one error in the device_set_lock_class() enable case (Kevin)
- Spelling fixes in Patch2 changelog (Pierre)
- Compilation fixes when both CONFIG_CXL_BUS=n and
   CONFIG_LIBNVDIMM=n. (0day robot)

[1]: 
https://lore.kernel.org/all/164610292916.2682974.12924748003366352335.st...@dwillia2-desk3.amr.corp.intel.com/

---

The device_lock() is why the lockdep_set_novalidate_class() API exists.
The lock is taken in too many disparate contexts, and lockdep by design
assumes that all device_lock() acquisitions are identical. The lack of
lockdep coverage leads to deadlock scenarios landing upstream. To
mitigate that problem the lockdep_mutex was added [2].

The lockdep_mutex lets a subsystem mirror device_lock() acquisitions
without lockdep_set_novalidate_class() to gain some limited lockdep
coverage. The mirroring approach is limited to taking the device_lock()
after-the-fact in a subsystem's 'struct bus_type' operations and fails
to cover device_lock() acquisition in the driver-core. It also can only
track the needs of one subsystem at a time so, for example the kernel
needs to be recompiled between CONFIG_PROVE_NVDIMM_LOCKING and
CONFIG_PROVE_CXL_LOCKING depending on which subsystem is being
regression tested. Obviously that also means that intra-subsystem
locking dependencies can not be validated.


Instead of using a fake lockdep_mutex, maybe you can just use a unique 
lockdep key for each subsystem and call lockdep_set_class() in the 
device_initialize() if such key is present or 
lockdep_set_novalidate_class() otherwise. The unique key can be passed 
either as a parameter to device_initialize() or as part of the device 
structure. It is certainly less cumbersome that having a fake 
lockdep_mutex array in the device structure.


Cheers,
Longman




Re: [PATCH v2 03/12] cxl/core: Refactor a cxl_lock_class() out of cxl_nested_lock()

2022-04-13 Thread Peter Zijlstra


*Completely* untested (I wouldn't know where to begin and probably odn't
have the hardware anyway), and known incomplete.

What's wrong with something like this then?

---
 drivers/cxl/core/pmem.c |  8 +
 drivers/cxl/core/port.c | 58 +---
 drivers/cxl/cxl.h   | 78 -
 3 files changed, 42 insertions(+), 102 deletions(-)

diff --git a/drivers/cxl/core/pmem.c b/drivers/cxl/core/pmem.c
index 8de240c4d96b..aca0dd5eefeb 100644
--- a/drivers/cxl/core/pmem.c
+++ b/drivers/cxl/core/pmem.c
@@ -80,6 +80,8 @@ struct cxl_nvdimm_bridge *cxl_find_nvdimm_bridge(struct 
cxl_nvdimm *cxl_nvd)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_find_nvdimm_bridge, CXL);
 
+static lock_class_key cxl_nvdimm_bridge_key;
+
 static struct cxl_nvdimm_bridge *cxl_nvdimm_bridge_alloc(struct cxl_port *port)
 {
struct cxl_nvdimm_bridge *cxl_nvb;
@@ -104,6 +106,8 @@ static struct cxl_nvdimm_bridge 
*cxl_nvdimm_bridge_alloc(struct cxl_port *port)
dev->bus = &cxl_bus_type;
dev->type = &cxl_nvdimm_bridge_type;
 
+   lockdep_set_class(&dev->mutex, &cxl_nvdimm_bridge_key);
+
return cxl_nvb;
 
 err:
@@ -214,6 +218,8 @@ struct cxl_nvdimm *to_cxl_nvdimm(struct device *dev)
 }
 EXPORT_SYMBOL_NS_GPL(to_cxl_nvdimm, CXL);
 
+static struct lock_class_key cxl_nvdimm_key;
+
 static struct cxl_nvdimm *cxl_nvdimm_alloc(struct cxl_memdev *cxlmd)
 {
struct cxl_nvdimm *cxl_nvd;
@@ -231,6 +237,8 @@ static struct cxl_nvdimm *cxl_nvdimm_alloc(struct 
cxl_memdev *cxlmd)
dev->bus = &cxl_bus_type;
dev->type = &cxl_nvdimm_type;
 
+   lockdep_set_class(&dev->mutex, &cxl_nvdimm_key);
+
return cxl_nvd;
 }
 
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 2ab1ba4499b3..cae88404612f 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -312,10 +312,10 @@ static void cxl_port_release(struct device *dev)
struct cxl_port *port = to_cxl_port(dev);
struct cxl_ep *ep, *_e;
 
-   cxl_device_lock(dev);
+   device_lock(dev);
list_for_each_entry_safe(ep, _e, &port->endpoints, list)
cxl_ep_release(ep);
-   cxl_device_unlock(dev);
+   device_unlock(dev);
ida_free(&cxl_port_ida, port->id);
kfree(port);
 }
@@ -391,6 +391,8 @@ static int devm_cxl_link_uport(struct device *host, struct 
cxl_port *port)
return devm_add_action_or_reset(host, cxl_unlink_uport, port);
 }
 
+static struct lock_class_key cxl_port_key;
+
 static struct cxl_port *cxl_port_alloc(struct device *uport,
   resource_size_t component_reg_phys,
   struct cxl_port *parent_port)
@@ -431,6 +433,8 @@ static struct cxl_port *cxl_port_alloc(struct device *uport,
dev->bus = &cxl_bus_type;
dev->type = &cxl_port_type;
 
+   lockdep_set_class(&dev->mutex, &cxl_port_key);
+
return port;
 
 err:
@@ -457,8 +461,10 @@ struct cxl_port *devm_cxl_add_port(struct device *host, 
struct device *uport,
if (IS_ERR(port))
return port;
 
-   if (parent_port)
+   if (parent_port) {
port->depth = parent_port->depth + 1;
+   lockdep_set_class_and_subclass(&port->dev->mutex, &cxl_port, 
port->depth);
+   }
dev = &port->dev;
if (is_cxl_memdev(uport))
rc = dev_set_name(dev, "endpoint%d", port->id);
@@ -554,7 +560,7 @@ static int match_root_child(struct device *dev, const void 
*match)
return 0;
 
port = to_cxl_port(dev);
-   cxl_device_lock(dev);
+   device_lock(dev);
list_for_each_entry(dport, &port->dports, list) {
iter = match;
while (iter) {
@@ -564,7 +570,7 @@ static int match_root_child(struct device *dev, const void 
*match)
}
}
 out:
-   cxl_device_unlock(dev);
+   device_unlock(dev);
 
return !!iter;
 }
@@ -623,13 +629,13 @@ static int add_dport(struct cxl_port *port, struct 
cxl_dport *new)
 static void cond_cxl_root_lock(struct cxl_port *port)
 {
if (is_cxl_root(port))
-   cxl_device_lock(&port->dev);
+   device_lock(&port->dev);
 }
 
 static void cond_cxl_root_unlock(struct cxl_port *port)
 {
if (is_cxl_root(port))
-   cxl_device_unlock(&port->dev);
+   device_unlock(&port->dev);
 }
 
 static void cxl_dport_remove(void *data)
@@ -736,15 +742,15 @@ static int add_ep(struct cxl_port *port, struct cxl_ep 
*new)
 {
struct cxl_ep *dup;
 
-   cxl_device_lock(&port->dev);
+   device_lock(&port->dev);
if (port->dead) {
-   cxl_device_unlock(&port->dev);
+   device_unlock(&port->dev);
return -ENXIO;
}
dup = find_ep(port, new->ep);
if (!dup)
list_add_tail(&new->list, &port->endpoints);
-   cxl_device_unlock(&port->dev);
+

Re: [PATCH v2 02/12] device-core: Add dev->lock_class to enable device_lock() lockdep validation

2022-04-13 Thread Peter Zijlstra
On Tue, Apr 12, 2022 at 11:01:38PM -0700, Dan Williams wrote:
> The device_lock() is hidden from lockdep by default because, for
> example, a device subsystem may do something like:
> 
> ---
> device_add(dev1);
> ...in driver core...
> device_lock(dev1);
> bus->probe(dev1); /* where bus->probe() calls driver1_probe() */
> 
> driver1_probe(struct device *dev)
> {
>   ...do some enumeration...
>   dev2->parent = dev;
>   /* this triggers probe under device_lock(dev2); */
>   device_add(dev2);
> }
> ---
> 
> To lockdep, that device_lock(dev2) looks like a deadlock because lockdep

Recursion, you're meaning to say it looks like same lock recursion.

> only sees lock classes, not individual lock instances. All device_lock()
> instances across the entire kernel are the same class. However, this is
> not a deadlock in practice because the locking is strictly hierarchical.
> I.e. device_lock(dev1) is held over device_lock(dev2), but never the
> reverse.

I have some very vague memories from a conversation with Alan Stern,
some maybe 10 years ago, where I think he was explaining to me this was
not in fact a simple hierarchy.

> In order for lockdep to be satisfied and see that it is
> hierarchical in practice the mutex_lock() call in device_lock() needs to
> be moved to mutex_lock_nested() where the @subclass argument to
> mutex_lock_nested() represents the nesting level, i.e.:

That's not an obvious conclusion; lockdep has lots of funny annotations,
subclasses is just one.

I think the big new development in lockdep since that time with Alan
Stern is that lockdep now has support for dynamic keys; that is lock
keys in heap memory (as opposed to static storage).

> s/device_lock(dev1)/mutex_lock_nested(&dev1->mutex, 1)/
> 
> s/device_lock(dev2)/mutex_lock_nested(&dev2->mutex, 2)/
> 
> Now, what if the internals of the device_lock() could be annotated with
> the right @subclass argument to call mutex_lock_nested()?
> 
> With device_set_lock_class() a subsystem can optionally add that
> metadata. The device_lock() still takes dev->mutex, but when
> dev->lock_class is >= 0 it additionally takes dev->lockdep_mutex with
> the proper nesting. Unlike dev->mutex, dev->lockdep_mutex is not marked
> lockdep_set_novalidate_class() and lockdep will become useful... at
> least for one subsystem at a time.
> 
> It is still the case that only one subsystem can be using lockdep with
> lockdep_mutex at a time because different subsystems will collide class
> numbers. You might say "well, how about subsystem1 gets class ids 0 to 9
> and subsystem2 gets class ids 10 to 20?". MAX_LOCKDEP_SUBCLASSES is 8,
> and 8 is just enough class ids for one subsystem of moderate complexity.

Again, that doesn't seem like an obvious suggestion at all. Why not give
each subsystem a different lock key?


> diff --git a/include/linux/device.h b/include/linux/device.h
> index af2576ace130..6083e757e804 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -402,6 +402,7 @@ struct dev_msi_info {
>   * @mutex:   Mutex to synchronize calls to its driver.
>   * @lockdep_mutex: An optional debug lock that a subsystem can use as a
>   *   peer lock to gain localized lockdep coverage of the device_lock.
> + * @lock_class: per-subsystem annotated device lock class
>   * @bus: Type of bus device is on.
>   * @driver:  Which driver has allocated this
>   * @platform_data: Platform data specific to the device.
> @@ -501,6 +502,7 @@ struct device {
>  dev_set_drvdata/dev_get_drvdata */
>  #ifdef CONFIG_PROVE_LOCKING
>   struct mutexlockdep_mutex;
> + int lock_class;
>  #endif
>   struct mutexmutex;  /* mutex to synchronize calls to
>* its driver.
> @@ -762,18 +764,100 @@ static inline bool dev_pm_test_driver_flags(struct 
> device *dev, u32 flags)
>   return !!(dev->power.driver_flags & flags);
>  }
>  
> +static inline void device_lock_assert(struct device *dev)
> +{
> + lockdep_assert_held(&dev->mutex);
> +}
> +
>  #ifdef CONFIG_PROVE_LOCKING
>  static inline void device_lockdep_init(struct device *dev)
>  {
>   mutex_init(&dev->lockdep_mutex);
> + dev->lock_class = -1;
>   lockdep_set_novalidate_class(&dev->mutex);
>  }
> -#else
> +
> +static inline void device_lock(struct device *dev)
> +{
> + /*
> +  * For double-lock programming errors the kernel will hang
> +  * trying to acquire @dev->mutex before lockdep can report the
> +  * problem acquiring @dev->lockdep_mutex, so manually assert
> +  * before that hang.
> +  */
> + lockdep_assert_not_held(&dev->lockdep_mutex);
> +
> + mutex_lock(&dev->mutex);
> + if (dev->lock_class >= 0)
> + mutex_lock_nested(&dev->lockdep_mutex, dev->lock_class);
> +}
> +
> +static inline int device_lock_interruptible(struct device *dev)
> +{
> + int rc;
> +
> + lockdep_assert_no