Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-14 Thread Dan Williams
Jane Chu wrote:
> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
> one call for each poison with accurate address.
> 
> Also, short ARS would find 2 poisons.
> 
> I attached the console output, my annotation is prefixed with "<==".

[29078.634817] {4}[Hardware Error]:   physical_address: 0x0040a0602600  
<== 2nd poison @ 0x600
[29078.642200] {4}[Hardware Error]:   physical_address_mask: 0xff00

Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
is seeing a 256-byte address mask?

Sigh, is this "firmware-first" causing the kernel to get bad information
via the native mechanisms?

I would expect that if this test was truly worried about minimizing BIOS
latency it would disable firmware-first error reporting. I wonder if
that fixes the observed problem?



Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-14 Thread Dan Williams
Jane Chu wrote:
> On 7/13/2022 5:24 PM, Dan Williams wrote:
> > Jane Chu wrote:
> >> On 7/12/2022 5:48 PM, Dan Williams wrote:
> >>> Jane Chu wrote:
>  Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
>  granularity") changed nfit_handle_mce() callback to report badrange for
>  each poison at an alignment indicated by 1ULL << 
>  MCI_MISC_ADDR_LSB(mce->misc)
>  instead of the hardcoded L1_CACHE_BYTES. However recently on a server
>  populated with Intel DCPMEM v2 dimms, it appears that
>  1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte 
>  blocks.
>  Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
>  8 poisons.
> 
>  [29076.590281] {3}[Hardware Error]:   physical_address: 
>  0x0040a0602400
>  [..]
>  [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: 
>  Recovered
>  [29076.627519] mce: [Hardware Error]: Machine check events logged
>  [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x408000, 
>  0x1f8000)
>  [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: 
>  (0x40a0602000, 0x1000)
>  [..]
>  [29078.634817] {4}[Hardware Error]:   physical_address: 
>  0x0040a0602600
>  [..]
>  [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x408000, 
>  0x1f8000)
>  [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: 
>  (0x40a0602000, 0x1000)
>  [..]
>  {
>  "dev":"namespace0.0",
>  "mode":"fsdax",
>  "map":"dev",
>  "size":33820770304,
>  "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
>  "sector_size":512,
>  "align":2097152,
>  "blockdev":"pmem0",
>  "badblock_count":8,
>  "badblocks":[
>    {
>  "offset":8208,
>  "length":8,
>  "dimms":[
>    "nmem0"
>  ]
>    }
>  ]
>  }
> 
>  So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for 
>  poison
>  radius and shouldn't be used.  More over, as each injected poison is 
>  being
>  reported independently, any alignment under 512-byte appear works:
>  L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length 
>  reports),
>  or 512-byte.
> 
>  To get around this issue, 512-bytes is chosen as the alignment because
>  a. it happens to be the badblock granularity,
>  b. ndctl inject-error cannot inject more than one poison to a 
>  512-byte block,
>  c. architecture agnostic
> >>>
> >>> I am failing to see the kernel bug? Yes, you injected less than 8
> >>> "badblocks" of poison and the hardware reported 8 blocks of poison, but
> >>> that's not the kernel's fault, that's the hardware. What happens when
> >>> hardware really does detect 8 blocks of consective poison and this
> >>> implementation decides to only record 1 at a time?
> >>
> >> In that case, there will be 8 reports of the poisons by APEI GHES,
> > 
> > Why would there be 8 reports for just one poison consumption event?
> 
> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
> one call for each poison with accurate address.
> 
> Also, short ARS would find 2 poisons.
> 
> I attached the console output, my annotation is prefixed with "<==".
> 
> It is from these information I concluded that no poison will be missed
> in terms of reporting.
> 
> > 
> >> ARC scan will also report 8 poisons, each will get to be added to the
> >> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
> > 
> > Right, that's what I'm saying about the proposed change, trim the
> > reported poison by what is return from a "short" ARS. Recall that
> > short-ARS just reads from a staging buffer that the BIOS knows about, it
> > need not go all the way to hardware.
> 
> Okay, that confirms my understanding of your proposal. More below.
> 
> > 
> >> In the above 2 poison example, the poison in 0x0040a0602400 and in
> >> 0x0040a0602600 were separately reported.
> > 
> > Separately reported, each with a 4K alignment?
> 
> Yes,  and so twice nfit_handle_mce() call
> nvdimm_bus_add_badrange() with addr: 0x40a0602000, length: 0x1000
> complete overlap.
> 
> > 
> >>> It seems the fix you want is for the hardware to report the precise
> >>> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
> >>> that precision in this case.
> >>
> >> That field describes a 4K range even for a single poison, it confuses
> >> people unnecessarily.
> > 
> > I agree with you on the problem statement, it's the fix where I have
> > questions.
> > 
> >>> However, the ARS engine likely can return the precise error ranges so I
> >>> think the fix is to just use the address range indicated by 1UL <<
> >>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> >>> scrub request to ask the 

Re: [PATCH] acpi/nfit: badrange report spill over to clean range

2022-07-14 Thread Jane Chu
On 7/13/2022 5:24 PM, Dan Williams wrote:
> Jane Chu wrote:
>> On 7/12/2022 5:48 PM, Dan Williams wrote:
>>> Jane Chu wrote:
 Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
 granularity") changed nfit_handle_mce() callback to report badrange for
 each poison at an alignment indicated by 1ULL << 
 MCI_MISC_ADDR_LSB(mce->misc)
 instead of the hardcoded L1_CACHE_BYTES. However recently on a server
 populated with Intel DCPMEM v2 dimms, it appears that
 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte 
 blocks.
 Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
 8 poisons.

 [29076.590281] {3}[Hardware Error]:   physical_address: 0x0040a0602400
 [..]
 [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: 
 Recovered
 [29076.627519] mce: [Hardware Error]: Machine check events logged
 [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x408000, 0x1f8000)
 [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 
 0x1000)
 [..]
 [29078.634817] {4}[Hardware Error]:   physical_address: 0x0040a0602600
 [..]
 [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x408000, 0x1f8000)
 [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 
 0x1000)
 [..]
 {
 "dev":"namespace0.0",
 "mode":"fsdax",
 "map":"dev",
 "size":33820770304,
 "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
 "sector_size":512,
 "align":2097152,
 "blockdev":"pmem0",
 "badblock_count":8,
 "badblocks":[
   {
 "offset":8208,
 "length":8,
 "dimms":[
   "nmem0"
 ]
   }
 ]
 }

 So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for 
 poison
 radius and shouldn't be used.  More over, as each injected poison is being
 reported independently, any alignment under 512-byte appear works:
 L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
 or 512-byte.

 To get around this issue, 512-bytes is chosen as the alignment because
 a. it happens to be the badblock granularity,
 b. ndctl inject-error cannot inject more than one poison to a 512-byte 
 block,
 c. architecture agnostic
>>>
>>> I am failing to see the kernel bug? Yes, you injected less than 8
>>> "badblocks" of poison and the hardware reported 8 blocks of poison, but
>>> that's not the kernel's fault, that's the hardware. What happens when
>>> hardware really does detect 8 blocks of consective poison and this
>>> implementation decides to only record 1 at a time?
>>
>> In that case, there will be 8 reports of the poisons by APEI GHES,
> 
> Why would there be 8 reports for just one poison consumption event?

I meant to say there would be 8 calls to the nfit_handle_mce() callback,
one call for each poison with accurate address.

Also, short ARS would find 2 poisons.

I attached the console output, my annotation is prefixed with "<==".

It is from these information I concluded that no poison will be missed
in terms of reporting.

> 
>> ARC scan will also report 8 poisons, each will get to be added to the
>> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
> 
> Right, that's what I'm saying about the proposed change, trim the
> reported poison by what is return from a "short" ARS. Recall that
> short-ARS just reads from a staging buffer that the BIOS knows about, it
> need not go all the way to hardware.

Okay, that confirms my understanding of your proposal. More below.

> 
>> In the above 2 poison example, the poison in 0x0040a0602400 and in
>> 0x0040a0602600 were separately reported.
> 
> Separately reported, each with a 4K alignment?

Yes,  and so twice nfit_handle_mce() call
nvdimm_bus_add_badrange() with addr: 0x40a0602000, length: 0x1000
complete overlap.

> 
>>> It seems the fix you want is for the hardware to report the precise
>>> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
>>> that precision in this case.
>>
>> That field describes a 4K range even for a single poison, it confuses
>> people unnecessarily.
> 
> I agree with you on the problem statement, it's the fix where I have
> questions.
> 
>>> However, the ARS engine likely can return the precise error ranges so I
>>> think the fix is to just use the address range indicated by 1UL <<
>>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
>>> scrub request to ask the device for the precise error list.
>>
>> You mean for nfit_handle_mce() callback to issue a short ARS per each
>> poison report over a 4K range
> 
> Over a L1_CACHE_BYTES range...
> 
>> in order to decide the precise range as a workaround of the hardware
>> issue?  if there are 8 poisoned detected, 

Re: [PATCH v1 1/1] nvdimm/namespace: return uuid_null only once in nd_dev_to_uuid()

2022-07-14 Thread Dan Williams
Andy Shevchenko wrote:
> On Thu, Jul 14, 2022 at 11:24:05AM -0700, Dan Williams wrote:
> > Andy Shevchenko wrote:
> > > Refactor nd_dev_to_uuid() in order to make code shorter and cleaner
> > > by joining conditions and hence returning uuid_null only once.
> > 
> > Apologies for the delay, applied for v5.20.
> 
> No problem and thanks!
> 
> P.S. One patch out of three is a fix, would be nice to have it in v5.19
> release.

Found it, applied it.



RE: [PATCH v1 1/1] nvdimm/namespace: drop nested variable in create_namespace_pmem()

2022-07-14 Thread Dan Williams
Andy Shevchenko wrote:
> Kernel build bot reported:
> 
>   namespace_devs.c:1991:10: warning: Local variable 'uuid' shadows outer 
> variable [shadowVariable]
> 
> Refactor create_namespace_pmem() by dropping a nested version of
> the same variable.

Applied for 5.19-rc.

Thanks!



Re: [PATCH v1 1/1] nvdimm/namespace: return uuid_null only once in nd_dev_to_uuid()

2022-07-14 Thread Andy Shevchenko
On Thu, Jul 14, 2022 at 11:24:05AM -0700, Dan Williams wrote:
> Andy Shevchenko wrote:
> > Refactor nd_dev_to_uuid() in order to make code shorter and cleaner
> > by joining conditions and hence returning uuid_null only once.
> 
> Apologies for the delay, applied for v5.20.

No problem and thanks!

P.S. One patch out of three is a fix, would be nice to have it in v5.19
release.

-- 
With Best Regards,
Andy Shevchenko





RE: [PATCH v1 1/1] nvdimm/namespace: drop unneeded temporary variable in size_store()

2022-07-14 Thread Dan Williams
Andy Shevchenko wrote:
> Refactor size_store() in order to remove temporary variable on stack
> by joining conditionals.

Looks good, applied.



RE: [PATCH v1 1/1] nvdimm/namespace: return uuid_null only once in nd_dev_to_uuid()

2022-07-14 Thread Dan Williams
Andy Shevchenko wrote:
> Refactor nd_dev_to_uuid() in order to make code shorter and cleaner
> by joining conditions and hence returning uuid_null only once.

Apologies for the delay, applied for v5.20.



RE: [RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-07-14 Thread Dan Williams
ruansy.f...@fujitsu.com wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>-> devres_release_all()   # was pmem driver ->remove() in v1
> -> kill_dax()
>  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>   -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  So do not shutdown filesystem directly if something not
> supported, or if failure range includes metadata area.  Make sure all
> files and processes are handled correctly.
> 
> ==
> Changes since v5:
>   1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
>   2. hold s_umount before sync_filesystem()
>   3. move sync_filesystem() after SB_BORN check
>   4. Rebased on next-20220714
> 
> Changes since v4:
>   1. sync_filesystem() at the beginning when MF_MEM_REMOVE
>   2. Rebased on next-20220706
> 
> [1]: 
> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/dax/super.c |  3 ++-
>  fs/xfs/xfs_notify_failure.c | 15 +++
>  include/linux/mm.h  |  1 +
>  3 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..cf9a64563fbe 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
>  
>   if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>  
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
>   synchronize_srcu(_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 69d9c83ea4b2..6da6747435eb 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -76,6 +76,9 @@ xfs_dax_failure_fn(
>  
>   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* Do not shutdown so early when device is to be removed */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
>   }
> @@ -174,12 +177,22 @@ xfs_dax_notify_failure(
>   struct xfs_mount*mp = dax_holder(dax_dev);
>   u64 ddev_start;
>   u64 ddev_end;
> + int error;
>  
>   if (!(mp->m_sb.sb_flags & SB_BORN)) {
>   xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>   return -EIO;
>   }
>  
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + down_write(>m_super->s_umount);
> + error = sync_filesystem(mp->m_super);
> + up_write(>m_super->s_umount);

Are all mappings invalidated after this point?

The goal of the removal notification is to invalidate all DAX mappings
that are no pointing to pfns that do not exist anymore, so just syncing
does not seem like enough, and the shutdown is skipped above. What am I
missing?

Notice that kill_dev_dax() does unmap_mapping_range() after invalidating
the dax device and that ensures that all existing mappings are gone and
cannot be re-established. As far as I can see a process with an existing
dax mapping will still be able to use it after this runs, no?



Re: [RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-07-14 Thread Darrick J. Wong
On Thu, Jul 14, 2022 at 10:34:29AM +, ruansy.f...@fujitsu.com wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1].  With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
> 
> Call trace:
> trigger unbind
>  -> unbind_store()
>   -> ... (skip)
>-> devres_release_all()   # was pmem driver ->remove() in v1
> -> kill_dax()
>  -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>   -> xfs_dax_notify_failure()
> 
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event.  So do not shutdown filesystem directly if something not
> supported, or if failure range includes metadata area.  Make sure all
> files and processes are handled correctly.
> 
> ==
> Changes since v5:
>   1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
>   2. hold s_umount before sync_filesystem()
>   3. move sync_filesystem() after SB_BORN check
>   4. Rebased on next-20220714
> 
> Changes since v4:
>   1. sync_filesystem() at the beginning when MF_MEM_REMOVE
>   2. Rebased on next-20220706
> 
> [1]: 
> https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/
> 
> Signed-off-by: Shiyang Ruan 

Looks reasonable to me now,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  drivers/dax/super.c |  3 ++-
>  fs/xfs/xfs_notify_failure.c | 15 +++
>  include/linux/mm.h  |  1 +
>  3 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 9b5e2a5eb0ae..cf9a64563fbe 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>   return;
>  
>   if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>  
>   clear_bit(DAXDEV_ALIVE, _dev->flags);
>   synchronize_srcu(_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 69d9c83ea4b2..6da6747435eb 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -76,6 +76,9 @@ xfs_dax_failure_fn(
>  
>   if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>   (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* Do not shutdown so early when device is to be removed */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
>   }
> @@ -174,12 +177,22 @@ xfs_dax_notify_failure(
>   struct xfs_mount*mp = dax_holder(dax_dev);
>   u64 ddev_start;
>   u64 ddev_end;
> + int error;
>  
>   if (!(mp->m_sb.sb_flags & SB_BORN)) {
>   xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>   return -EIO;
>   }
>  
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + down_write(>m_super->s_umount);
> + error = sync_filesystem(mp->m_super);
> + up_write(>m_super->s_umount);
> + if (error)
> + return error;
> + }
> +
>   if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
>   xfs_warn(mp,
>"notify_failure() not supported on realtime device!");
> @@ -188,6 +201,8 @@ xfs_dax_notify_failure(
>  
>   if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>   mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
>   xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>   xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>   return -EFSCORRUPTED;
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 4287bec50c28..2ddfb76c8a83 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3188,6 +3188,7 @@ enum mf_flags {
>   MF_SOFT_OFFLINE = 1 << 3,
>   MF_UNPOISON = 1 << 4,
>   MF_SW_SIMULATED = 1 << 5,
> + MF_MEM_PRE_REMOVE = 1 << 6,
>  };
>  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> -- 
> 2.37.0



[RFC PATCH v6] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind

2022-07-14 Thread ruansy.f...@fujitsu.com
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1].  With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.

Call trace:
trigger unbind
 -> unbind_store()
  -> ... (skip)
   -> devres_release_all()   # was pmem driver ->remove() in v1
-> kill_dax()
 -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
  -> xfs_dax_notify_failure()

Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event.  So do not shutdown filesystem directly if something not
supported, or if failure range includes metadata area.  Make sure all
files and processes are handled correctly.

==
Changes since v5:
  1. Renamed MF_MEM_REMOVE to MF_MEM_PRE_REMOVE
  2. hold s_umount before sync_filesystem()
  3. move sync_filesystem() after SB_BORN check
  4. Rebased on next-20220714

Changes since v4:
  1. sync_filesystem() at the beginning when MF_MEM_REMOVE
  2. Rebased on next-20220706

[1]: 
https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.st...@dwillia2-desk3.amr.corp.intel.com/

Signed-off-by: Shiyang Ruan 
---
 drivers/dax/super.c |  3 ++-
 fs/xfs/xfs_notify_failure.c | 15 +++
 include/linux/mm.h  |  1 +
 3 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 9b5e2a5eb0ae..cf9a64563fbe 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
 
if (dax_dev->holder_data != NULL)
-   dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+   dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+   MF_MEM_PRE_REMOVE);
 
clear_bit(DAXDEV_ALIVE, _dev->flags);
synchronize_srcu(_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 69d9c83ea4b2..6da6747435eb 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -76,6 +76,9 @@ xfs_dax_failure_fn(
 
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+   /* Do not shutdown so early when device is to be removed */
+   if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
}
@@ -174,12 +177,22 @@ xfs_dax_notify_failure(
struct xfs_mount*mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+   int error;
 
if (!(mp->m_sb.sb_flags & SB_BORN)) {
xfs_warn(mp, "filesystem is not ready for notify_failure()!");
return -EIO;
}
 
+   if (mf_flags & MF_MEM_PRE_REMOVE) {
+   xfs_info(mp, "device is about to be removed!");
+   down_write(>m_super->s_umount);
+   error = sync_filesystem(mp->m_super);
+   up_write(>m_super->s_umount);
+   if (error)
+   return error;
+   }
+
if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
xfs_warn(mp,
 "notify_failure() not supported on realtime device!");
@@ -188,6 +201,8 @@ xfs_dax_notify_failure(
 
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+   if (mf_flags & MF_MEM_PRE_REMOVE)
+   return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4287bec50c28..2ddfb76c8a83 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3188,6 +3188,7 @@ enum mf_flags {
MF_SOFT_OFFLINE = 1 << 3,
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
+   MF_MEM_PRE_REMOVE = 1 << 6,
 };
 int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
  unsigned long count, int mf_flags);
-- 
2.37.0