Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-19 Thread Dan Williams
On Fri, Mar 19, 2021 at 6:47 PM Dave Chinner  wrote:
[..]
> > Now I'm trying to reconcile the fact that platform
> > poison handling will hit memory_failure() first and may not
> > immediately reach the driver, if ever (see the perennially awkward
> > firmware-first-mode error handling: ghes_handle_memory_failure()) . So
> > even if the ->memory_failure(dev...) up call exists there is no
> > guarantee it can get called for all poison before the memory_failure()
> > down call happens. Which means regardless of whether
> > ->memory_failure(dev...) exists memory_failure() needs to be able to
> > do the right thing.
>
> I don't see how a poor implementation of memory_failure in a driver
> or hardware is even remotely relevant to the interface used to
> notify the filesystem of a media or device failure. It sounds like
> you are trying to use memory_failure() for something it was never
> intended to support and that there's a bunch of driver and
> infrastructure work needed to make it work how you want it to work.
> And even then it may not work the way we want it to work
>
> > Combine that with the fact that new buses like CXL might be configured
> > in "poison on decode error" mode which means that a memory_failure()
> > storm can happen regardless of whether the driver initiates it
> > programatically.
>
> Straw man argument.
>
> "We can't make this interface a ranged notification because the
> hardware might only be able to do page-by-page notification."

No, it's "we can't make this interface notify the filesytem that
sectors have failed before the memory_failure() (ranged or not) has
communicated that pfns have failed."

memory_failure() today is the first and sometimes only interface that
learns of pfn failures.

>
> You can do page-by-page notification with a range based interface.
> We are talking about how to efficiently and reliably inform the
> filesystem that a range of a device is no longer accessible and so
> it needs to revoke all mappings over that range of it's address
> space. That does not need to be a single page at a time interface.
>
> If your hardware is configured to do stupid things, then that is not
> the fault of the software interface used to communicate faults up
> the stack, nor is it something that the notfication interface should
> try to fix or mitigate.
>
> > How about a mechanism to optionally let a filesystem take over memory
> > failure handling for a range of pfns that the memory_failure() can
> > consult to fail ranges at a time rather than one by one? So a new
> > 'struct dax_operations' op (void) (*memory_failure_register(struct
> > dax_device *, void *data). Where any agent that claims a dax_dev can
> > register to take over memory_failure() handling for any event that
> > happens in that range. This would be routed through device-mapper like
> > any other 'struct dax_operations' op. I think that meets your
> > requirement to get notifications of all the events you want to handle,
> > but still allows memory_failure() to be the last resort for everything
> > that has not opted into this error handling.
>
> Which is basically the same as the proposed ->corrupted_range stack,
> except it doesn't map the pfns back to LBA addresses the filesystem
> needs to make sense of the failure.
>
> fs-dax filesystems have no clue what pfns are, or how to translate
> them to LBAs in their block device address space that the map
> everything to. The fs-dax infrastructure asks the filesystem for
> bdev/sector based mappings, and internally converts them to pfns by
> a combination of bdev and daxdev callouts. Hence fs-dax filesystems
> never see nor interpret pfns at all.  Nor do they have the
> capability to convert a PFN to a LBA address. And neither the
> underlying block device nor the associated DAX device provide a
> method for doing this reverse mapping translation.

True.

>
> So if we have an interface that hands a {daxdev,PFN,len} tuple to
> the filesystem, exactly what is the filesystem supposed to do with
> it? How do we turn that back into a {bdev,sector,len} tuple so we
> can do reverse mapping lookups to find the filesystem objects that
> allocated within the notified range?
>
> I'll point out again that these address space translations were
> something that the ->corrupted_range callbacks handled directly - no
> layer in the stack was handed a range that it didn't know how to map
> to it's own internal structures. By the time it got to the
> filesystem, it was a {bdev,sector,len} tuple, and the filesystem
> could feed that directly to it's reverse mapping lookups
>
> Maybe I'm missing something magical about ->memory_failure that does
> all this translation for us, but I don't see it in this patchset. I
> just don't see how this proposed interface is a usable at the
> filesystem level as it stands.

So then it's not the filesystem that needs to register for
memory_failure() it's the driver in order to translate the failed LBAs
up the stack. However, 

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-19 Thread Dave Chinner
On Thu, Mar 18, 2021 at 12:20:35PM -0700, Dan Williams wrote:
> On Wed, Mar 17, 2021 at 9:58 PM Dave Chinner  wrote:
> >
> > On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > > Jason wondered why the get_user_pages_fast() path takes references on a
> > > @pgmap object. The rationale was to protect against accessing a 'struct
> > > page' that might be in the process of being removed by the driver, but
> > > he rightly points out that should be solved the same way all gup-fast
> > > synchronization is solved which is invalidate the mapping and let the
> > > gup slow path do @pgmap synchronization [1].
> > >
> > > To achieve that it means that new user mappings need to stop being
> > > created and all existing user mappings need to be invalidated.
> > >
> > > For device-dax this is already the case as kill_dax() prevents future
> > > faults from installing a pte, and the single device-dax inode
> > > address_space can be trivially unmapped.
> > >
> > > The situation is different for filesystem-dax where device pages could
> > > be mapped by any number of inode address_space instances. An initial
> > > thought was to treat the device removal event like a drop_pagecache_sb()
> > > event that walks superblocks and unmaps all inodes. However, Dave points
> > > out that it is not just the filesystem user-mappings that need to react
> > > to global DAX page-unmap events, it is also filesystem metadata
> > > (proposed DAX metadata access), and other drivers (upstream
> > > DM-writecache) that need to react to this event [2].
> > >
> > > The only kernel facility that is meant to globally broadcast the loss of
> > > a page (via corruption or surprise remove) is memory_failure(). The
> > > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > > However, the events that would trigger the need to call memory_failure()
> > > over a full PMEM device should be rare.
> >
> > This is a highly suboptimal design. Filesystems only need a single
> > callout to trigger a shutdown that unmaps every active mapping in
> > the filesystem - we do not need a page-by-page error notification
> > which results in 250 million hwposion callouts per TB of pmem to do
> > this.
> >
> > Indeed, the moment we get the first hwpoison from this patch, we'll
> > map it to the primary XFS superblock and we'd almost certainly
> > consider losing the storage behind that block to be a shut down
> > trigger. During the shutdown, the filesystem should unmap all the
> > active mappings (we already need to add this to shutdown on DAX
> > regardless of this device remove issue) and so we really don't need
> > a page-by-page notification of badness.
> 
> XFS doesn't, but what about device-mapper and other agents? Even if
> the driver had a callback up the stack memory_failure() still needs to
> be able to trigger failures down the stack for CPU consumed poison.

If the device is gone, then they don't need page by page
notifucation, either. Tell them the entire device is gone so they
can do what they need (like pass it up to the filesystem as ranges
of badness!).

> > AFAICT, it's going to take minutes, maybe hours for do the page-by-page
> > iteration to hwposion every page. It's going to take a few seconds
> > for the filesystem shutdown to run a device wide invalidation.
> >
> > SO, yeah, I think this should simply be a single ranged call to the
> > filesystem like:
> >
> > ->memory_failure(dev, 0, -1ULL)
> >
> > to tell the filesystem that the entire backing device has gone away,
> > and leave the filesystem to handle failure entirely at the
> > filesystem level.
> 
> So I went with memory_failure() after our discussion of all the other
> agents in the system that might care about these pfns going offline
> and relying on memory_failure() to route down to each of those. I.e.
> the "reuse the drop_pagecache_sb() model" idea was indeed
> insufficient.

Using drop_pagecache_sb is insufficient because filesystems have
more than just inode indexed caches that pmem failure may affect.
This is not an argument against a "knock everything down at once"
notification model, just that drop_pagecache_sb() is ...
insufficient to do what we need...

> Now I'm trying to reconcile the fact that platform
> poison handling will hit memory_failure() first and may not
> immediately reach the driver, if ever (see the perennially awkward
> firmware-first-mode error handling: ghes_handle_memory_failure()) . So
> even if the ->memory_failure(dev...) up call exists there is no
> guarantee it can get called for all poison before the memory_failure()
> down call happens. Which means regardless of whether
> ->memory_failure(dev...) exists memory_failure() needs to be able to
> do the right thing.

I don't see how a poor implementation of memory_failure in a driver
or hardware is even remotely relevant to the interface used to
notify the filesystem of a media or device failure. It sounds like
you are trying to use memory_failure() for 

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-18 Thread Dan Williams
On Wed, Mar 17, 2021 at 9:45 PM Darrick J. Wong  wrote:
>
> On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > Jason wondered why the get_user_pages_fast() path takes references on a
> > @pgmap object. The rationale was to protect against accessing a 'struct
> > page' that might be in the process of being removed by the driver, but
> > he rightly points out that should be solved the same way all gup-fast
> > synchronization is solved which is invalidate the mapping and let the
> > gup slow path do @pgmap synchronization [1].
> >
> > To achieve that it means that new user mappings need to stop being
> > created and all existing user mappings need to be invalidated.
> >
> > For device-dax this is already the case as kill_dax() prevents future
> > faults from installing a pte, and the single device-dax inode
> > address_space can be trivially unmapped.
> >
> > The situation is different for filesystem-dax where device pages could
> > be mapped by any number of inode address_space instances. An initial
> > thought was to treat the device removal event like a drop_pagecache_sb()
> > event that walks superblocks and unmaps all inodes. However, Dave points
> > out that it is not just the filesystem user-mappings that need to react
> > to global DAX page-unmap events, it is also filesystem metadata
> > (proposed DAX metadata access), and other drivers (upstream
> > DM-writecache) that need to react to this event [2].
> >
> > The only kernel facility that is meant to globally broadcast the loss of
> > a page (via corruption or surprise remove) is memory_failure(). The
> > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > However, the events that would trigger the need to call memory_failure()
> > over a full PMEM device should be rare. Remove should always be
> > coordinated by the administrator with the filesystem. If someone force
> > removes a device from underneath a mounted filesystem the driver assumes
> > they have a good reason, or otherwise get to keep the pieces. Since
> > ->remove() callbacks can not fail the only option is to trigger the mass
> > memory_failure().
> >
> > The mechanism to determine whether memory_failure() triggers at
> > pmem->remove() time is whether the associated dax_device has an elevated
> > reference at @pgmap ->kill() time.
> >
> > With this in place the get_user_pages_fast() path can drop its
> > half-measure synchronization with an @pgmap reference.
> >
> > Link: http://lore.kernel.org/r/20210224010017.gq2643...@ziepe.ca [1]
> > Link: http://lore.kernel.org/r/20210302075736.gj4...@dread.disaster.area [2]
> > Reported-by: Jason Gunthorpe 
> > Cc: Dave Chinner 
> > Cc: Christoph Hellwig 
> > Cc: Shiyang Ruan 
> > Cc: Vishal Verma 
> > Cc: Dave Jiang 
> > Cc: Ira Weiny 
> > Cc: Matthew Wilcox 
> > Cc: Jan Kara 
> > Cc: Andrew Morton 
> > Cc: Naoya Horiguchi 
> > Cc: "Darrick J. Wong" 
> > Signed-off-by: Dan Williams 
> > ---
> >  drivers/dax/super.c  |   15 +++
> >  drivers/nvdimm/pmem.c|   10 +-
> >  drivers/nvdimm/pmem.h|1 +
> >  include/linux/dax.h  |5 +
> >  include/linux/memremap.h |5 +
> >  include/linux/mm.h   |3 +++
> >  mm/memory-failure.c  |   11 +--
> >  mm/memremap.c|   11 +++
> >  8 files changed, 58 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index 5fa6ae9dbc8b..5ebcedf4a68c 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev)
> >  }
> >  EXPORT_SYMBOL_GPL(put_dax);
> >
> > +bool dax_is_idle(struct dax_device *dax_dev)
> > +{
> > + struct inode *inode;
> > +
> > + if (!dax_dev)
> > + return true;
> > +
> > + WARN_ONCE(test_bit(DAXDEV_ALIVE, _dev->flags),
> > +   "dax idle check on live device.\n");
> > +
> > + inode = _dev->inode;
> > + return atomic_read(>i_count) < 2;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_is_idle);
> > +
> >  /**
> >   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
> >   * @host: alternate name for the device registered by a dax driver
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index b8a85bfb2e95..e8822c9262ee 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -348,15 +348,21 @@ static void pmem_pagemap_kill(struct dev_pagemap 
> > *pgmap)
> >  {
> >   struct request_queue *q =
> >   container_of(pgmap->ref, struct request_queue, 
> > q_usage_counter);
> > + struct pmem_device *pmem = q->queuedata;
> >
> >   blk_freeze_queue_start(q);
> > + kill_dax(pmem->dax_dev);
> > + if (!dax_is_idle(pmem->dax_dev)) {
> > + dev_warn(pmem->dev,
> > +  "DAX active at remove, trigger mass memory 
> > failure\n");
> > + dev_pagemap_failure(pgmap);
> > + }
> >  }
> >
> >  static void pmem_release_disk(void 

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-18 Thread Dan Williams
On Wed, Mar 17, 2021 at 9:58 PM Dave Chinner  wrote:
>
> On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> > Jason wondered why the get_user_pages_fast() path takes references on a
> > @pgmap object. The rationale was to protect against accessing a 'struct
> > page' that might be in the process of being removed by the driver, but
> > he rightly points out that should be solved the same way all gup-fast
> > synchronization is solved which is invalidate the mapping and let the
> > gup slow path do @pgmap synchronization [1].
> >
> > To achieve that it means that new user mappings need to stop being
> > created and all existing user mappings need to be invalidated.
> >
> > For device-dax this is already the case as kill_dax() prevents future
> > faults from installing a pte, and the single device-dax inode
> > address_space can be trivially unmapped.
> >
> > The situation is different for filesystem-dax where device pages could
> > be mapped by any number of inode address_space instances. An initial
> > thought was to treat the device removal event like a drop_pagecache_sb()
> > event that walks superblocks and unmaps all inodes. However, Dave points
> > out that it is not just the filesystem user-mappings that need to react
> > to global DAX page-unmap events, it is also filesystem metadata
> > (proposed DAX metadata access), and other drivers (upstream
> > DM-writecache) that need to react to this event [2].
> >
> > The only kernel facility that is meant to globally broadcast the loss of
> > a page (via corruption or surprise remove) is memory_failure(). The
> > downside of memory_failure() is that it is a pfn-at-a-time interface.
> > However, the events that would trigger the need to call memory_failure()
> > over a full PMEM device should be rare.
>
> This is a highly suboptimal design. Filesystems only need a single
> callout to trigger a shutdown that unmaps every active mapping in
> the filesystem - we do not need a page-by-page error notification
> which results in 250 million hwposion callouts per TB of pmem to do
> this.
>
> Indeed, the moment we get the first hwpoison from this patch, we'll
> map it to the primary XFS superblock and we'd almost certainly
> consider losing the storage behind that block to be a shut down
> trigger. During the shutdown, the filesystem should unmap all the
> active mappings (we already need to add this to shutdown on DAX
> regardless of this device remove issue) and so we really don't need
> a page-by-page notification of badness.

XFS doesn't, but what about device-mapper and other agents? Even if
the driver had a callback up the stack memory_failure() still needs to
be able to trigger failures down the stack for CPU consumed poison.

>
> AFAICT, it's going to take minutes, maybe hours for do the page-by-page
> iteration to hwposion every page. It's going to take a few seconds
> for the filesystem shutdown to run a device wide invalidation.
>
> SO, yeah, I think this should simply be a single ranged call to the
> filesystem like:
>
> ->memory_failure(dev, 0, -1ULL)
>
> to tell the filesystem that the entire backing device has gone away,
> and leave the filesystem to handle failure entirely at the
> filesystem level.

So I went with memory_failure() after our discussion of all the other
agents in the system that might care about these pfns going offline
and relying on memory_failure() to route down to each of those. I.e.
the "reuse the drop_pagecache_sb() model" idea was indeed
insufficient. Now I'm trying to reconcile the fact that platform
poison handling will hit memory_failure() first and may not
immediately reach the driver, if ever (see the perennially awkward
firmware-first-mode error handling: ghes_handle_memory_failure()) . So
even if the ->memory_failure(dev...) up call exists there is no
guarantee it can get called for all poison before the memory_failure()
down call happens. Which means regardless of whether
->memory_failure(dev...) exists memory_failure() needs to be able to
do the right thing.

Combine that with the fact that new buses like CXL might be configured
in "poison on decode error" mode which means that a memory_failure()
storm can happen regardless of whether the driver initiates it
programatically.

How about a mechanism to optionally let a filesystem take over memory
failure handling for a range of pfns that the memory_failure() can
consult to fail ranges at a time rather than one by one? So a new
'struct dax_operations' op (void) (*memory_failure_register(struct
dax_device *, void *data). Where any agent that claims a dax_dev can
register to take over memory_failure() handling for any event that
happens in that range. This would be routed through device-mapper like
any other 'struct dax_operations' op. I think that meets your
requirement to get notifications of all the events you want to handle,
but still allows memory_failure() to be the last resort for everything
that has not opted into this error handling.

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-17 Thread Dave Chinner
On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> Jason wondered why the get_user_pages_fast() path takes references on a
> @pgmap object. The rationale was to protect against accessing a 'struct
> page' that might be in the process of being removed by the driver, but
> he rightly points out that should be solved the same way all gup-fast
> synchronization is solved which is invalidate the mapping and let the
> gup slow path do @pgmap synchronization [1].
> 
> To achieve that it means that new user mappings need to stop being
> created and all existing user mappings need to be invalidated.
> 
> For device-dax this is already the case as kill_dax() prevents future
> faults from installing a pte, and the single device-dax inode
> address_space can be trivially unmapped.
> 
> The situation is different for filesystem-dax where device pages could
> be mapped by any number of inode address_space instances. An initial
> thought was to treat the device removal event like a drop_pagecache_sb()
> event that walks superblocks and unmaps all inodes. However, Dave points
> out that it is not just the filesystem user-mappings that need to react
> to global DAX page-unmap events, it is also filesystem metadata
> (proposed DAX metadata access), and other drivers (upstream
> DM-writecache) that need to react to this event [2].
> 
> The only kernel facility that is meant to globally broadcast the loss of
> a page (via corruption or surprise remove) is memory_failure(). The
> downside of memory_failure() is that it is a pfn-at-a-time interface.
> However, the events that would trigger the need to call memory_failure()
> over a full PMEM device should be rare.

This is a highly suboptimal design. Filesystems only need a single
callout to trigger a shutdown that unmaps every active mapping in
the filesystem - we do not need a page-by-page error notification
which results in 250 million hwposion callouts per TB of pmem to do
this.

Indeed, the moment we get the first hwpoison from this patch, we'll
map it to the primary XFS superblock and we'd almost certainly
consider losing the storage behind that block to be a shut down
trigger. During the shutdown, the filesystem should unmap all the
active mappings (we already need to add this to shutdown on DAX
regardless of this device remove issue) and so we really don't need
a page-by-page notification of badness.

AFAICT, it's going to take minutes, maybe hours for do the page-by-page
iteration to hwposion every page. It's going to take a few seconds
for the filesystem shutdown to run a device wide invalidation.

SO, yeah, I think this should simply be a single ranged call to the
filesystem like:

->memory_failure(dev, 0, -1ULL)

to tell the filesystem that the entire backing device has gone away,
and leave the filesystem to handle failure entirely at the
filesystem level.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-17 Thread Darrick J. Wong
On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> Jason wondered why the get_user_pages_fast() path takes references on a
> @pgmap object. The rationale was to protect against accessing a 'struct
> page' that might be in the process of being removed by the driver, but
> he rightly points out that should be solved the same way all gup-fast
> synchronization is solved which is invalidate the mapping and let the
> gup slow path do @pgmap synchronization [1].
> 
> To achieve that it means that new user mappings need to stop being
> created and all existing user mappings need to be invalidated.
> 
> For device-dax this is already the case as kill_dax() prevents future
> faults from installing a pte, and the single device-dax inode
> address_space can be trivially unmapped.
> 
> The situation is different for filesystem-dax where device pages could
> be mapped by any number of inode address_space instances. An initial
> thought was to treat the device removal event like a drop_pagecache_sb()
> event that walks superblocks and unmaps all inodes. However, Dave points
> out that it is not just the filesystem user-mappings that need to react
> to global DAX page-unmap events, it is also filesystem metadata
> (proposed DAX metadata access), and other drivers (upstream
> DM-writecache) that need to react to this event [2].
> 
> The only kernel facility that is meant to globally broadcast the loss of
> a page (via corruption or surprise remove) is memory_failure(). The
> downside of memory_failure() is that it is a pfn-at-a-time interface.
> However, the events that would trigger the need to call memory_failure()
> over a full PMEM device should be rare. Remove should always be
> coordinated by the administrator with the filesystem. If someone force
> removes a device from underneath a mounted filesystem the driver assumes
> they have a good reason, or otherwise get to keep the pieces. Since
> ->remove() callbacks can not fail the only option is to trigger the mass
> memory_failure().
> 
> The mechanism to determine whether memory_failure() triggers at
> pmem->remove() time is whether the associated dax_device has an elevated
> reference at @pgmap ->kill() time.
> 
> With this in place the get_user_pages_fast() path can drop its
> half-measure synchronization with an @pgmap reference.
> 
> Link: http://lore.kernel.org/r/20210224010017.gq2643...@ziepe.ca [1]
> Link: http://lore.kernel.org/r/20210302075736.gj4...@dread.disaster.area [2]
> Reported-by: Jason Gunthorpe 
> Cc: Dave Chinner 
> Cc: Christoph Hellwig 
> Cc: Shiyang Ruan 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Ira Weiny 
> Cc: Matthew Wilcox 
> Cc: Jan Kara 
> Cc: Andrew Morton 
> Cc: Naoya Horiguchi 
> Cc: "Darrick J. Wong" 
> Signed-off-by: Dan Williams 
> ---
>  drivers/dax/super.c  |   15 +++
>  drivers/nvdimm/pmem.c|   10 +-
>  drivers/nvdimm/pmem.h|1 +
>  include/linux/dax.h  |5 +
>  include/linux/memremap.h |5 +
>  include/linux/mm.h   |3 +++
>  mm/memory-failure.c  |   11 +--
>  mm/memremap.c|   11 +++
>  8 files changed, 58 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 5fa6ae9dbc8b..5ebcedf4a68c 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev)
>  }
>  EXPORT_SYMBOL_GPL(put_dax);
>  
> +bool dax_is_idle(struct dax_device *dax_dev)
> +{
> + struct inode *inode;
> +
> + if (!dax_dev)
> + return true;
> +
> + WARN_ONCE(test_bit(DAXDEV_ALIVE, _dev->flags),
> +   "dax idle check on live device.\n");
> +
> + inode = _dev->inode;
> + return atomic_read(>i_count) < 2;
> +}
> +EXPORT_SYMBOL_GPL(dax_is_idle);
> +
>  /**
>   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
>   * @host: alternate name for the device registered by a dax driver
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index b8a85bfb2e95..e8822c9262ee 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -348,15 +348,21 @@ static void pmem_pagemap_kill(struct dev_pagemap *pgmap)
>  {
>   struct request_queue *q =
>   container_of(pgmap->ref, struct request_queue, q_usage_counter);
> + struct pmem_device *pmem = q->queuedata;
>  
>   blk_freeze_queue_start(q);
> + kill_dax(pmem->dax_dev);
> + if (!dax_is_idle(pmem->dax_dev)) {
> + dev_warn(pmem->dev,
> +  "DAX active at remove, trigger mass memory failure\n");
> + dev_pagemap_failure(pgmap);
> + }
>  }
>  
>  static void pmem_release_disk(void *__pmem)
>  {
>   struct pmem_device *pmem = __pmem;
>  
> - kill_dax(pmem->dax_dev);
>   put_dax(pmem->dax_dev);
>   del_gendisk(pmem->disk);
>   put_disk(pmem->disk);
> @@ -406,6 +412,7 @@ static int pmem_attach_disk(struct device *dev,
>   

[PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-17 Thread Dan Williams
Jason wondered why the get_user_pages_fast() path takes references on a
@pgmap object. The rationale was to protect against accessing a 'struct
page' that might be in the process of being removed by the driver, but
he rightly points out that should be solved the same way all gup-fast
synchronization is solved which is invalidate the mapping and let the
gup slow path do @pgmap synchronization [1].

To achieve that it means that new user mappings need to stop being
created and all existing user mappings need to be invalidated.

For device-dax this is already the case as kill_dax() prevents future
faults from installing a pte, and the single device-dax inode
address_space can be trivially unmapped.

The situation is different for filesystem-dax where device pages could
be mapped by any number of inode address_space instances. An initial
thought was to treat the device removal event like a drop_pagecache_sb()
event that walks superblocks and unmaps all inodes. However, Dave points
out that it is not just the filesystem user-mappings that need to react
to global DAX page-unmap events, it is also filesystem metadata
(proposed DAX metadata access), and other drivers (upstream
DM-writecache) that need to react to this event [2].

The only kernel facility that is meant to globally broadcast the loss of
a page (via corruption or surprise remove) is memory_failure(). The
downside of memory_failure() is that it is a pfn-at-a-time interface.
However, the events that would trigger the need to call memory_failure()
over a full PMEM device should be rare. Remove should always be
coordinated by the administrator with the filesystem. If someone force
removes a device from underneath a mounted filesystem the driver assumes
they have a good reason, or otherwise get to keep the pieces. Since
->remove() callbacks can not fail the only option is to trigger the mass
memory_failure().

The mechanism to determine whether memory_failure() triggers at
pmem->remove() time is whether the associated dax_device has an elevated
reference at @pgmap ->kill() time.

With this in place the get_user_pages_fast() path can drop its
half-measure synchronization with an @pgmap reference.

Link: http://lore.kernel.org/r/20210224010017.gq2643...@ziepe.ca [1]
Link: http://lore.kernel.org/r/20210302075736.gj4...@dread.disaster.area [2]
Reported-by: Jason Gunthorpe 
Cc: Dave Chinner 
Cc: Christoph Hellwig 
Cc: Shiyang Ruan 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Ira Weiny 
Cc: Matthew Wilcox 
Cc: Jan Kara 
Cc: Andrew Morton 
Cc: Naoya Horiguchi 
Cc: "Darrick J. Wong" 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |   15 +++
 drivers/nvdimm/pmem.c|   10 +-
 drivers/nvdimm/pmem.h|1 +
 include/linux/dax.h  |5 +
 include/linux/memremap.h |5 +
 include/linux/mm.h   |3 +++
 mm/memory-failure.c  |   11 +--
 mm/memremap.c|   11 +++
 8 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 5fa6ae9dbc8b..5ebcedf4a68c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev)
 }
 EXPORT_SYMBOL_GPL(put_dax);
 
+bool dax_is_idle(struct dax_device *dax_dev)
+{
+   struct inode *inode;
+
+   if (!dax_dev)
+   return true;
+
+   WARN_ONCE(test_bit(DAXDEV_ALIVE, _dev->flags),
+ "dax idle check on live device.\n");
+
+   inode = _dev->inode;
+   return atomic_read(>i_count) < 2;
+}
+EXPORT_SYMBOL_GPL(dax_is_idle);
+
 /**
  * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
  * @host: alternate name for the device registered by a dax driver
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index b8a85bfb2e95..e8822c9262ee 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -348,15 +348,21 @@ static void pmem_pagemap_kill(struct dev_pagemap *pgmap)
 {
struct request_queue *q =
container_of(pgmap->ref, struct request_queue, q_usage_counter);
+   struct pmem_device *pmem = q->queuedata;
 
blk_freeze_queue_start(q);
+   kill_dax(pmem->dax_dev);
+   if (!dax_is_idle(pmem->dax_dev)) {
+   dev_warn(pmem->dev,
+"DAX active at remove, trigger mass memory failure\n");
+   dev_pagemap_failure(pgmap);
+   }
 }
 
 static void pmem_release_disk(void *__pmem)
 {
struct pmem_device *pmem = __pmem;
 
-   kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);
del_gendisk(pmem->disk);
put_disk(pmem->disk);
@@ -406,6 +412,7 @@ static int pmem_attach_disk(struct device *dev,
devm_namespace_disable(dev, ndns);
 
dev_set_drvdata(dev, pmem);
+   pmem->dev = dev;
pmem->phys_addr = res->start;
pmem->size = resource_size(res);
fua = nvdimm_has_flush(nd_region);
@@ -467,6 +474,7 @@ static int pmem_attach_disk(struct device