Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dave Chinner
On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong  wrote:
> > > > I really don't see you seem to be telling us that invalidation is an
> > > > either/or choice. There's more ways to convert physical block
> > > > address -> inode file offset and mapping index than brute force
> > > > inode cache walks
> > >
> > > Yes, but I was trying to map it to an existing mechanism and the
> > > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > > needs to happen here.
> >
> > Yes.  XFS (with rmap enabled) can do all the iteration and walking in
> > that function except for the invalidate_mapping_* call itself.  The goal
> > of this series is first to wire up a callback within both the block and
> > pmem subsystems so that they can take notifications and reverse-map them
> > through the storage stack until they reach an fs superblock.
> 
> I'm chuckling because this "reverse map all the way up the block
> layer" is the opposite of what Dave said at the first reaction to my
> proposal, "can't the mm map pfns to fs inode  address_spaces?".

Ah, no, I never said that the filesystem can't do reverse maps. I
was asking if the mm could directly (brute-force) invalidate PTEs
pointing at physical pmem ranges without needing walk the inode
mappings. That would be far more efficient if it could be done

> Today whenever the pmem driver receives new corrupted range
> notification from the lower level nvdimm
> infrastructure(nd_pmem_notify) it updates the 'badblocks' instance
> associated with the pmem gendisk and then notifies userspace that
> there are new badblocks. This seems a perfect place to signal an upper
> level stacked block device that may also be watching disk->bb. Then
> each gendisk in a stacked topology is responsible for watching the
> badblock notifications of the next level and storing a remapped
> instance of those blocks until ultimately the filesystem mounted on
> the top-level block device is responsible for registering for those
> top-level disk->bb events.
> 
> The device gone notification does not map cleanly onto 'struct badblocks'.

Filesystems are not allowed to interact with the gendisk
infrastructure - that's for supporting the device side of a block
device. It's a layering violation, and many a filesytem developer
has been shouted at for trying to do this. At most we can peek
through it to query functionality support from the request queue,
but otherwise filesystems do not interact with anything under
bdev->bd_disk.

As it is, badblocks are used by devices to manage internal state.
e.g. md for recording stripes that need recovery if the system
crashes while they are being written out.

> If an upper level agent really cared about knowing about ->remove()
> events before they happened it could maybe do something like:
> 
> dev = disk_to_dev(bdev->bd_disk)->parent;
> bus_register_notifier(dev->bus. _host_device_notifier_block)

Yeah, that's exactly the sort of thing that filesystems have been
aggressively discouraged from doing for years.

Part of the reason for this is that gendisk based mechanisms are not
very good for stacked device error reporting. Part of the problem
here is that every layer of the stacked device has to hook the
notifier of the block devices underneath it, then translate the
event to match the upper block device map, then regenerate the
notification for the next layer up. This isn't an efficient way to
pass a notification through a series of stacked devices and it is
messy and cumbersome to maintain.

It can be effective for getting notifications to userspace about
something that happens to a specific block device. But The userspace
still ends up having to solve the "what does this error resolve to"
problem. i.e. Userspace still needs to map that notification to a
filesystem, and for data loss events map it to objects within the
filesystem, which can be extremely expensive to do from userspace.

This is exactly the sort of userspace error reporting mess that
various projects have asked us to try to fix. Plumbing errors
internally through the kernel up to the filesystem where the
filesytem can point directly to the user data that is affected is a
simple, effective solution to the problem. Especially if we then
have a generic error notification mechanism for filesystems to emit
errors to registered userspace watchers...

> I still don't think that solves the need for a separate mechanism for
> global dax_device pte invalidation.

It's just another type of media error because.

> I think that global dax_device invalidation needs new kernel
> infrastructure to allow internal users, like dm-writecache and future
> filesystems using dax for metadata, to take a fault when pmem is
> offlined.

 if userspace has directly mapped into the cache, and the cache
storage goes away, the userspace app has to be killed because we
have no idea if the device going away has caused data loss or not.

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams
On Mon, Mar 1, 2021 at 9:38 PM Dave Chinner  wrote:
>
> On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote:
> > On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner  wrote:
> > [..]
> > > We do not need a DAX specific mechanism to tell us "DAX device
> > > gone", we need a generic block device interface that tells us "range
> > > of block device is gone".
> >
> > This is the crux of the disagreement. The block_device is going away
> > *and* the dax_device is going away.
>
> No, that is not the disagreement I have with what you are saying.
> You still haven't understand that it's even more basic and generic
> than devices going away. At the simplest form, all the filesystem
> wants is to be notified of is when *unrecoverable media errors*
> occur in the persistent storage that underlies the filesystem.
>
> The filesystem does not care what that media is build from - PMEM,
> flash, corroded spinning disks, MRAM, or any other persistent media
> you can think off. It just doesn't matter.
>
> What we care about is that the contents of a *specific LBA range* no
> longer contain *valid data*. IOWs, the data in that range of the
> block device has been lost, cannot be retreived and/or cannot be
> written to any more.
>
> PMEM taking a MCE because ECC tripped is a media error because data
> is lost and inaccessible until recovery actions are taken.
>
> MD RAID failing a scrub is a media error and data is lost and
> unrecoverable at that layer.
>
> A device disappearing is a media error because the storage media is
> now permanently inaccessible to the higher layers.
>
> This "media error" categorisation is a fundamental property of
> persistent storage and, as such, is a property of the block devices
> used to access said persistent storage.
>
> That's the disagreement here - that you and Christoph are saying
> ->corrupted_range is not a block device property because only a
> pmem/DAX device currently generates it.
>
> You both seem to be NACKing a generic interface because it's only
> implemented for the first subsystem that needs it. AFAICT, you
> either don't understand or are completely ignoring the architectural
> need for it to be provided across the rest of the storage stack that
> *block device based filesystems depend on*.

No I'm NAKing it because it's the wrong interface. See my 'struct
badblocks' argument in the reply to Darrick. That 'struct badblocks'
infrastructure arose from MD and is shared with PMEM.

>
> Sure, there might be dax device based fielsystems around the corner.
> They just require a different pmem device ->corrupted_range callout
> to implement the notification - one that directs to the dax device
> rather than the block device. That's simple and trivial to
> implement, but such functionaity for DAX devices  does not replace
> the need for the same generic functionality to be provided across a
> *range of different block devices* as required by *block device
> based filesystems*.
>
> And that's fundamentally the problem. XFS is block device based, not
> DAX device based. We require errors to be reported through block
> device mechanisms. fs-dax does not change this - it is based on pmem
> being presented as a primarily as a block device to the block device
> based filesystems and only secondarily as a dax device. Hence if it
> can be trivially implemented as a block device interface, that's
> where it should go, because then all the other block devices that
> the filesytem runs on can provide the same functionality for similar
> media error events

Sure, use 'struct badblocks' not struct block_device and
block_device_operations.
>
> > The dax_device removal implies one
> > set of actions (direct accessed pfns invalid) the block device removal
> > implies another (block layer sector access offline).
>
> There you go again, saying DAX requires an action, while the block
> device notification is a -state change- (i.e. goes offline).

There you go reacting to the least generous interpretation of what I said.

s/pfns invalid/pfns offline/

>
> This is exactly what I said was wrong in my last email.
>
> > corrupted_range
> > is blurring the notification for 2 different failure domains. Look at
> > the nascent idea to mount a filesystem on dax sans a block device.
> > Look at the existing plumbing for DM to map dax_operations through a
> > device stack.
>
> Ummm, it just maps the direct_access call to the underlying device
> and calls it's ->direct_access method. All it's doing is LBA
> mapping. That's all it needs to do for ->corrupted_range, too.
> I have no clue why you think this is a problem for error
> notification...
>
> > Look at the pushback Ruan got for adding a new
> > block_device operation for corrupted_range().
>
> one person said "no". That's hardly pushback. Especially as I think
> Christoph's objection about this being dax specific functionality
> is simply wrong, as per above.

It's not wrong when we have a perfectly suitable object for sector
based error notification 

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams
On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong  wrote:
>
> On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
> > >
> > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  
> > > > wrote:
> > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > > > wrote:
> > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > > it points to, check if it points to the PMEM that is being removed,
> > > > > grab the page it points to, map that to the relevant struct page,
> > > > > run collect_procs() on that page, then kill the user processes that
> > > > > map that page.
> > > > >
> > > > > So why can't we walk the ptescheck the physical pages that they
> > > > > map to and if they map to a pmem page we go poison that
> > > > > page and that kills any user process that maps it.
> > > > >
> > > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > > >
> > > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > > large array of pages.
> > >
> > > Not really. You're assuming all a filesystem has to do is invalidate
> > > everything if a device goes away, and that's not true. Finding if an
> > > inode has a mapping that spans a specific device in a multi-device
> > > filesystem can be a lot more complex than that. Just walking inodes
> > > is easy - determining whihc inodes need invalidation is the hard
> > > part.
> >
> > That inode-to-device level of specificity is not needed for the same
> > reason that drop_caches does not need to be specific. If the wrong
> > page is unmapped a re-fault will bring it back, and re-fault will fail
> > for the pages that are successfully removed.
> >
> > > That's where ->corrupt_range() comes in - the filesystem is already
> > > set up to do reverse mapping from physical range to inode(s)
> > > offsets...
> >
> > Sure, but what is the need to get to that level of specificity with
> > the filesystem for something that should rarely happen in the course
> > of normal operation outside of a mistake?
>
> I can't tell if we're conflating the "a bunch of your pmem went bad"
> case with the "all your dimms fell out of the machine" case.

>From the pmem driver perspective it has the media scanning to find
some small handful of cachelines that have gone bad, and it has the
driver ->remove() callback to tell it a bunch of pmem is now offline.
The NVDIMM device "range has gone bad" mechanism has no way to
communicate multiple terabytes have gone bad at once.

In fact I think the distinction is important that ->remove() is not
treated as ->corrupted_range() because I expect the level of freakout
is much worse for a "your storage is offline" notification vs "your
storage is corrupted" notification.

> If, say, a single cacheline's worth of pmem goes bad on a node with 2TB
> of pmem, I certainly want that level of specificity.  Just notify the
> users of the dead piece, don't flush the whole machine down the drain.

Right, something like corrupted_range() is there to say, "keep going
upper layers, but note that this handful of sectors now has
indeterminant data and will return -EIO on access until repaired". The
repair for device-offline is device-online.

>
> > > > There's likely always more pages than inodes, but perhaps it's more
> > > > efficient to walk the 'struct page' array than sb->s_inodes?
> > >
> > > I really don't see you seem to be telling us that invalidation is an
> > > either/or choice. There's more ways to convert physical block
> > > address -> inode file offset and mapping index than brute force
> > > inode cache walks
> >
> > Yes, but I was trying to map it to an existing mechanism and the
> > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > needs to happen here.
>
> Yes.  XFS (with rmap enabled) can do all the iteration and walking in
> that function except for the invalidate_mapping_* call itself.  The goal
> of this series is first to wire up a callback within both the block and
> pmem subsystems so that they can take notifications and reverse-map them
> through the storage stack until they reach an fs superblock.

I'm chuckling because this "reverse map all the way up the block
layer" is the opposite of what Dave said at the first reaction to my
proposal, "can't the mm map pfns to fs inode  address_spaces?".

I think dax unmap is distinct from corrupted_range() precisely because
they are events happening in two different domains, block device
sectors vs dax device pfns.

Let's step back. I think a chain of ->corrupted_range() callbacks up
the block stack terminating in the filesystem with dax implications
tacked on is the wrong abstraction. Why not use the existing generic
object for communicating bad 

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dave Chinner
On Mon, Mar 01, 2021 at 07:33:28PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner  wrote:
> [..]
> > We do not need a DAX specific mechanism to tell us "DAX device
> > gone", we need a generic block device interface that tells us "range
> > of block device is gone".
> 
> This is the crux of the disagreement. The block_device is going away
> *and* the dax_device is going away.

No, that is not the disagreement I have with what you are saying.
You still haven't understand that it's even more basic and generic
than devices going away. At the simplest form, all the filesystem
wants is to be notified of is when *unrecoverable media errors*
occur in the persistent storage that underlies the filesystem.

The filesystem does not care what that media is build from - PMEM,
flash, corroded spinning disks, MRAM, or any other persistent media
you can think off. It just doesn't matter.

What we care about is that the contents of a *specific LBA range* no
longer contain *valid data*. IOWs, the data in that range of the
block device has been lost, cannot be retreived and/or cannot be
written to any more.

PMEM taking a MCE because ECC tripped is a media error because data
is lost and inaccessible until recovery actions are taken.

MD RAID failing a scrub is a media error and data is lost and
unrecoverable at that layer.

A device disappearing is a media error because the storage media is
now permanently inaccessible to the higher layers.

This "media error" categorisation is a fundamental property of
persistent storage and, as such, is a property of the block devices
used to access said persistent storage.

That's the disagreement here - that you and Christoph are saying
->corrupted_range is not a block device property because only a
pmem/DAX device currently generates it.

You both seem to be NACKing a generic interface because it's only
implemented for the first subsystem that needs it. AFAICT, you
either don't understand or are completely ignoring the architectural
need for it to be provided across the rest of the storage stack that
*block device based filesystems depend on*.

Sure, there might be dax device based fielsystems around the corner.
They just require a different pmem device ->corrupted_range callout
to implement the notification - one that directs to the dax device
rather than the block device. That's simple and trivial to
implement, but such functionaity for DAX devices  does not replace
the need for the same generic functionality to be provided across a
*range of different block devices* as required by *block device
based filesystems*.

And that's fundamentally the problem. XFS is block device based, not
DAX device based. We require errors to be reported through block
device mechanisms. fs-dax does not change this - it is based on pmem
being presented as a primarily as a block device to the block device
based filesystems and only secondarily as a dax device. Hence if it
can be trivially implemented as a block device interface, that's
where it should go, because then all the other block devices that
the filesytem runs on can provide the same functionality for similar
media error events

> The dax_device removal implies one
> set of actions (direct accessed pfns invalid) the block device removal
> implies another (block layer sector access offline).

There you go again, saying DAX requires an action, while the block
device notification is a -state change- (i.e. goes offline).

This is exactly what I said was wrong in my last email.

> corrupted_range
> is blurring the notification for 2 different failure domains. Look at
> the nascent idea to mount a filesystem on dax sans a block device.
> Look at the existing plumbing for DM to map dax_operations through a
> device stack.

Ummm, it just maps the direct_access call to the underlying device
and calls it's ->direct_access method. All it's doing is LBA
mapping. That's all it needs to do for ->corrupted_range, too.
I have no clue why you think this is a problem for error
notification...

> Look at the pushback Ruan got for adding a new
> block_device operation for corrupted_range().

one person said "no". That's hardly pushback. Especially as I think
Christoph's objection about this being dax specific functionality
is simply wrong, as per above.

> > This is why we need to communicate what error occurred, not what
> > action a device driver thinks needs to be taken.
> 
> The driver is only an event producer in this model, whatever the
> consumer does at the other end is not its concern. There may be a
> generic consumer and a filesystem specific consumer.



That's why these are all ops functions that can provide multiple
implementations to different device types. So that when we get a new
use case, the ops function structure can be replaced with one that
directs the notification to the new user instead of to the existing
one. It's a design pattern we use all over the kernel code.

Cheers,

Dave.
-- 
Dave Chinner

Re: [ndctl PATCH 2/2] ndctl/test: add checking the presence of jq command ahead

2021-03-01 Thread Santosh Sivaraj
QI Fuli  writes:

> Due to the lack of jq command, the result of the test will be 'fail'.
> This patch adds checking the presence of jq commmand ahead.
> If there is no jq command in the system, the test will be marked as 'skip'.
>
> Signed-off-by: QI Fuli 
> Link: https://github.com/pmem/ndctl/issues/141

Though it looks slightly redundant after having the check in configure.ac, not
against having additional checks.

Reviewed-by: Santosh Sivaraj 

Thanks,
Santosh

> ---
>  test/daxdev-errors.sh   | 1 +
>  test/inject-error.sh| 2 ++
>  test/inject-smart.sh| 1 +
>  test/label-compat.sh| 1 +
>  test/max_available_extent_ns.sh | 1 +
>  test/monitor.sh | 2 ++
>  test/multi-dax.sh   | 1 +
>  test/sector-mode.sh | 2 ++
>  8 files changed, 11 insertions(+)
>
> diff --git a/test/daxdev-errors.sh b/test/daxdev-errors.sh
> index 6281f32..9547d78 100755
> --- a/test/daxdev-errors.sh
> +++ b/test/daxdev-errors.sh
> @@ -9,6 +9,7 @@ rc=77
>  . $(dirname $0)/common
>
>  check_min_kver "4.12" || do_skip "lacks dax dev error handling"
> +check_prereq "jq"
>
>  trap 'err $LINENO' ERR
>
> diff --git a/test/inject-error.sh b/test/inject-error.sh
> index c636033..7d0b826 100755
> --- a/test/inject-error.sh
> +++ b/test/inject-error.sh
> @@ -11,6 +11,8 @@ err_count=8
>
>  . $(dirname $0)/common
>
> +check_prereq "jq"
> +
>  trap 'err $LINENO' ERR
>
>  # sample json:
> diff --git a/test/inject-smart.sh b/test/inject-smart.sh
> index 94705df..4ca83b8 100755
> --- a/test/inject-smart.sh
> +++ b/test/inject-smart.sh
> @@ -166,6 +166,7 @@ do_tests()
>  }
>
>  check_min_kver "4.19" || do_skip "kernel $KVER may not support smart 
> (un)injection"
> +check_prereq "jq"
>  modprobe nfit_test
>  rc=1
>
> diff --git a/test/label-compat.sh b/test/label-compat.sh
> index 340b93d..8ab2858 100755
> --- a/test/label-compat.sh
> +++ b/test/label-compat.sh
> @@ -10,6 +10,7 @@ BASE=$(dirname $0)
>  . $BASE/common
>
>  check_min_kver "4.11" || do_skip "may not provide reliable isetcookie values"
> +check_prereq "jq"
>
>  trap 'err $LINENO' ERR
>
> diff --git a/test/max_available_extent_ns.sh b/test/max_available_extent_ns.sh
> index 14d741d..343f3c9 100755
> --- a/test/max_available_extent_ns.sh
> +++ b/test/max_available_extent_ns.sh
> @@ -9,6 +9,7 @@ rc=77
>  trap 'err $LINENO' ERR
>
>  check_min_kver "4.19" || do_skip "kernel $KVER may not support 
> max_available_size"
> +check_prereq "jq"
>
>  init()
>  {
> diff --git a/test/monitor.sh b/test/monitor.sh
> index cdab5e1..28c5541 100755
> --- a/test/monitor.sh
> +++ b/test/monitor.sh
> @@ -13,6 +13,8 @@ smart_supported_bus=""
>
>  . $(dirname $0)/common
>
> +check_prereq "jq"
> +
>  trap 'err $LINENO' ERR
>
>  check_min_kver "4.15" || do_skip "kernel $KVER may not support monitor 
> service"
> diff --git a/test/multi-dax.sh b/test/multi-dax.sh
> index e932569..8496619 100755
> --- a/test/multi-dax.sh
> +++ b/test/multi-dax.sh
> @@ -9,6 +9,7 @@ rc=77
>  . $(dirname $0)/common
>
>  check_min_kver "4.13" || do_skip "may lack multi-dax support"
> +check_prereq "jq"
>
>  trap 'err $LINENO' ERR
>
> diff --git a/test/sector-mode.sh b/test/sector-mode.sh
> index dd7013e..54fa806 100755
> --- a/test/sector-mode.sh
> +++ b/test/sector-mode.sh
> @@ -6,6 +6,8 @@ rc=77
>
>  . $(dirname $0)/common
>
> +check_prereq "jq"
> +
>  set -e
>  trap 'err $LINENO' ERR
>
> --
> 2.29.2
> ___
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH 1/2] configure: add checking jq command

2021-03-01 Thread Santosh Sivaraj


Hi Qi,

QI Fuli  writes:

> Add checking jq command since it is needed to validate tests
>
> Cc: Santosh Sivaraj 
> Signed-off-by: QI Fuli 
> Link: https://github.com/pmem/ndctl/issues/141
> ---
>  configure.ac | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/configure.ac b/configure.ac
> index 5ec8d2f..839836b 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -65,6 +65,12 @@ fi
>  AC_SUBST([XMLTO])
>  fi
>
> +AC_CHECK_PROG(JQ, [jq], [$(which jq)], [missing])
> +if test "x$JQ" = xmissing; then
> + AC_MSG_ERROR([jq command needed to validate tests])
> +fi
> +AC_SUBST([JQ])
> +
>  AC_C_TYPEOF
>  AC_DEFINE([HAVE_STATEMENT_EXPR], 1, [Define to 1 if you have statement 
> expressions.])
>
> --
> 2.29.2

Acked-by: Santosh Sivaraj 

Thanks,
Santosh
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[RFC PATCH v1 6/6] badblocks: switch to the improved badblock handling code

2021-03-01 Thread Coly Li
This patch removes old code of badblocks_set(), badblocks_clear() and
badblocks_check(), and make them as wrappers to call _badblocks_set(),
_badblocks_clear() and _badblocks_check().

By this change now the badblock handing switch to the improved algorithm
in  _badblocks_set(), _badblocks_clear() and _badblocks_check().

This patch only contains the changes of old code deletion, new added
code for the improved algorithms are in previous patches.

Signed-off-by: Coly Li 
---
 block/badblocks.c | 310 +-
 1 file changed, 3 insertions(+), 307 deletions(-)

diff --git a/block/badblocks.c b/block/badblocks.c
index 304b91159a42..904c6ed0de6d 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -1386,75 +1386,7 @@ static int _badblocks_check(struct badblocks *bb, 
sector_t s, int sectors,
 int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
sector_t *first_bad, int *bad_sectors)
 {
-   int hi;
-   int lo;
-   u64 *p = bb->page;
-   int rv;
-   sector_t target = s + sectors;
-   unsigned seq;
-
-   if (bb->shift > 0) {
-   /* round the start down, and the end up */
-   s >>= bb->shift;
-   target += (1>= bb->shift;
-   sectors = target - s;
-   }
-   /* 'target' is now the first block after the bad range */
-
-retry:
-   seq = read_seqbegin(>lock);
-   lo = 0;
-   rv = 0;
-   hi = bb->count;
-
-   /* Binary search between lo and hi for 'target'
-* i.e. for the last range that starts before 'target'
-*/
-   /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
-* are known not to be the last range before target.
-* VARIANT: hi-lo is the number of possible
-* ranges, and decreases until it reaches 1
-*/
-   while (hi - lo > 1) {
-   int mid = (lo + hi) / 2;
-   sector_t a = BB_OFFSET(p[mid]);
-
-   if (a < target)
-   /* This could still be the one, earlier ranges
-* could not.
-*/
-   lo = mid;
-   else
-   /* This and later ranges are definitely out. */
-   hi = mid;
-   }
-   /* 'lo' might be the last that started before target, but 'hi' isn't */
-   if (hi > lo) {
-   /* need to check all range that end after 's' to see if
-* any are unacknowledged.
-*/
-   while (lo >= 0 &&
-  BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
-   if (BB_OFFSET(p[lo]) < target) {
-   /* starts before the end, and finishes after
-* the start, so they must overlap
-*/
-   if (rv != -1 && BB_ACK(p[lo]))
-   rv = 1;
-   else
-   rv = -1;
-   *first_bad = BB_OFFSET(p[lo]);
-   *bad_sectors = BB_LEN(p[lo]);
-   }
-   lo--;
-   }
-   }
-
-   if (read_seqretry(>lock, seq))
-   goto retry;
-
-   return rv;
+   return _badblocks_check(bb, s, sectors, first_bad, bad_sectors);
 }
 EXPORT_SYMBOL_GPL(badblocks_check);
 
@@ -1476,154 +1408,7 @@ EXPORT_SYMBOL_GPL(badblocks_check);
 int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
int acknowledged)
 {
-   u64 *p;
-   int lo, hi;
-   int rv = 0;
-   unsigned long flags;
-
-   if (bb->shift < 0)
-   /* badblocks are disabled */
-   return 1;
-
-   if (bb->shift) {
-   /* round the start down, and the end up */
-   sector_t next = s + sectors;
-
-   s >>= bb->shift;
-   next += (1>= bb->shift;
-   sectors = next - s;
-   }
-
-   write_seqlock_irqsave(>lock, flags);
-
-   p = bb->page;
-   lo = 0;
-   hi = bb->count;
-   /* Find the last range that starts at-or-before 's' */
-   while (hi - lo > 1) {
-   int mid = (lo + hi) / 2;
-   sector_t a = BB_OFFSET(p[mid]);
-
-   if (a <= s)
-   lo = mid;
-   else
-   hi = mid;
-   }
-   if (hi > lo && BB_OFFSET(p[lo]) > s)
-   hi = lo;
-
-   if (hi > lo) {
-   /* we found a range that might merge with the start
-* of our new range
-*/
-   sector_t a = BB_OFFSET(p[lo]);
-   sector_t e = a + BB_LEN(p[lo]);
-   int ack = BB_ACK(p[lo]);
-
-   if (e >= s) {

[RFC PATCH v1 5/6] badblocks: improve badblocks_check() for multiple ranges handling

2021-03-01 Thread Coly Li
This patch rewrites badblocks_check() with similar coding style as
_badblocks_set() and _badblocks_clear(). The only difference is bad
blocks checking may handle multiple ranges in bad tables now.

If a checking range covers multiple bad blocks range in bad block table,
like the following condition (C is the checking range, E1, E2, E3 are
three bad block ranges in bad block table),
  ++
  |C   |
  ++
++  ++  ++
| E1 |  | E2 |  | E3 |
++  ++  ++
The improved badblocks_check() algorithm will divid checking range C
into multiple parts, and handle them in 7 runs of a while-loop,
  +--+ ++ ++ ++ ++ ++ ++
  |C1| | C2 | | C3 | | C4 | | C5 | | C6 | | C7 |
  +--+ ++ ++ ++ ++ ++ ++
   ++++++
   | E1 || E2 || E3 |
   ++++++
And the start LBA and length of range E1 will be set as first_bad and
bad_sectors for the caller.

The return value rule is consistent for multiple ranges. For example if
there are following bad block ranges in bad block table,
   Index No. StartLen Ack
   0  400  20  1
   1  500  50  1
   2  650  20  0
the return value, first_bad, bad_sectors by calling badblocks_set() with
different checking range can be the following values,
Checking Start, Len Return Value   first_badbad_sectors
   100, 100  0   N/A   N/A
   100, 310  1   400   10
   100, 440  1   400   10
   100, 540  1   400   10
   100, 600 -1   400   10
   100, 800 -1   400   10

In order to make code review easier, this patch names the improved bad
block range checking routine as _badblocks_check() and does not change
existing badblock_check() code yet. Later patch will delete old code of
badblocks_check() and make it as a wrapper to call _badblocks_check().
Then the new added code won't mess up with the old deleted code, it will
be more clear and easier for code review.

Signed-off-by: Coly Li 
---
 block/badblocks.c | 99 +++
 1 file changed, 99 insertions(+)

diff --git a/block/badblocks.c b/block/badblocks.c
index 4db6d1adff42..304b91159a42 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -1249,6 +1249,105 @@ static int _badblocks_clear(struct badblocks *bb, 
sector_t s, int sectors)
return rv;
 }
 
+/* Do the exact work to check bad blocks range from the bad block table */
+static int _badblocks_check(struct badblocks *bb, sector_t s, int sectors,
+   sector_t *first_bad, int *bad_sectors)
+{
+   u64 *p;
+   struct bad_context bad;
+   int prev = -1, hint = -1, set = 0;
+   int unacked_badblocks, acked_badblocks;
+   int len, rv;
+   unsigned int seq;
+
+   WARN_ON(bb->shift < 0 || sectors == 0);
+
+   if (bb->shift > 0) {
+   sector_t target;
+
+   /* round the start down, and the end up */
+   target = s + sectors;
+   rounddown(s, bb->shift);
+   roundup(target, bb->shift);
+   sectors = target - s;
+   }
+
+retry:
+   seq = read_seqbegin(>lock);
+
+   bad.orig_start = s;
+   bad.orig_len = sectors;
+   p = bb->page;
+   unacked_badblocks = 0;
+   acked_badblocks = 0;
+
+re_check:
+   bad.start = s;
+   bad.len = sectors;
+
+   if (badblocks_empty(bb)) {
+   len = sectors;
+   goto update_sectors;
+   }
+
+   prev = prev_badblocks(bb, , hint);
+
+   /* start after all badblocks */
+   if ((prev + 1) >= bb->count && !overlap_front(bb, prev, )) {
+   len = sectors;
+   goto update_sectors;
+   }
+
+   if (overlap_front(bb, prev, )) {
+   if (BB_ACK(p[prev]))
+   acked_badblocks++;
+   else
+   unacked_badblocks++;
+
+   if (BB_END(p[prev]) >= (s + sectors))
+   len = sectors;
+   else
+   len = BB_END(p[prev]) - s;
+
+   if (set == 0) {
+   *first_bad = BB_OFFSET(p[prev]);
+   *bad_sectors = BB_LEN(p[prev]);
+   set = 1;
+   }
+   goto update_sectors;
+   }
+
+   /* Not front overlap, but behind overlap */
+   if ((prev + 1) < bb->count && overlap_behind(bb, , prev + 1)) {
+   len = BB_OFFSET(p[prev + 1]) - bad.start;
+   hint = prev + 1;
+   goto 

[RFC PATCH v1 4/6] badblocks: improve badblocks_clear() for multiple ranges handling

2021-03-01 Thread Coly Li
With the foundamental ideas and helper routines from badblocks_set()
improvement, clearing bad block for multiple ranges is much simpler.

With a similar idea from badblocks_set() improvement, this patch
simplifies bad block range clearing into 5 situations. No matter how
complicated the clearing condition is, we just look at the head part
of clearing range with relative already set bad block range from the
bad block table. The rested part will be handled in next run of the
while-loop.

Based on existing helpers addef from badblocks_set(), this patch adds
two more helpers,
- front_clear()
  Clear the bad block range from bad block table which is front
  overlapped with the clearing range.
- front_splitting_clear()
  Handle the condition that the clearing range hits middle of an
  already set bad block range from bad block table.

Similar as badblocks_set(), the first part of clearing range is handled
with relative bad block range which is find by prev_badblocks(). In most
cases a valid hint is provided to prev_badblocks() to avoid unnecessary
bad block table iteration.

This patch also explains the detail algorithm code comments at beginning
of badblocks.c, including which five simplified situations are categried
and how all the bad block range clearing conditions are handled by these
five situations.

Again, in order to make the code review easier and avoid the code
changes mixed together, this patch does not modify badblock_clear() and
implement another routine called _badblock_clear() for the improvement.
Later patch will delete current code of badblock_clear() and make it as
a wrapper to _badblock_clear(), so the code change can be much clear for
review.

Signed-off-by: Coly Li 
---
 block/badblocks.c | 319 ++
 1 file changed, 319 insertions(+)

diff --git a/block/badblocks.c b/block/badblocks.c
index b0a7780d75c2..4db6d1adff42 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -330,6 +330,123 @@
  * avoided. In my test with the hint to prev_badblocks(), except for the first
  * loop, all rested calls to prev_badblocks() can go into the fast path and
  * return correct bad blocks table index immediately.
+ *
+ *
+ * Clearing a bad blocks range from the bad block table has similar idea as
+ * setting does, but much more simpler. The only thing needs to be noticed is
+ * when the clearning range hits middle of a bad block range, the existing bad
+ * block range will split into two, and one more item should be added into the
+ * bad block table. The simplified situations to beconsidered are, (The already
+ * set bad blocks ranges in bad block table are naming with prefix E, and the
+ * clearing bad blocks range is naming with prefix C)
+ *
+ * 1) A clearing range is not overlapped to any already set ranges in bad block
+ *table.
+ *+-+ |  +-+ |  +-+
+ *|  C  | |  |  C  | |  |  C  |
+ *+-+ or +-+ or +-+
+ *+---+   |   ++ ++  |  +---+
+ *| E |   |   | E1 | | E2 |  |  | E |
+ *+---+   |   ++ ++  |  +---+
+ *For the above situations, no bad block to be cleared and no failure
+ *happens, simply returns 0.
+ * 2) The clearing range hits middle of an already setting bad blocks range in
+ *the bad block table.
+ *+---+
+ *| C |
+ *+---+
+ * +-+
+ * | E   |
+ * +-+
+ *In this situation if the bad block table is not full, the range E will be
+ *split into two ranges E1 and E2. The result is,
+ * +--+   +--+
+ * |  E1  |   |  E2  |
+ * +--+   +--+
+ * 3) The clearing range starts exactly at same LBA as an already set bad 
block range
+ *from the bad block table.
+ * 3.1) Partially covered at head part
+ * ++
+ * | C  |
+ * ++
+ * +-+
+ * | E   |
+ * +-+
+ *For this situation, the overlapped already set range will update the
+ *start LBA to end of C and shrink the range to BB_LEN(E) - BB_LEN(C). No
+ *item deleted from bad block table. The result is,
+ *  ++
+ *  | E1 |
+ *  ++
+ * 3.2) Exact fully covered
+ * +-+
+ * | C   |
+ * +-+
+ * +-+
+ * | E   |
+ * +-+
+ *For this situation the whole bad blocks range E will be cleared and its
+ *corresponded item is deleted from the bad block table.
+ * 4) The clearing range exactly ends at same LBA as an already set bad block
+ *range.
+ *   +---+
+ *   |   C   |
+ *   

[RFC PATCH v1 3/6] badblocks: improvement badblocks_set() for multiple ranges handling

2021-03-01 Thread Coly Li
Recently I received a bug report that current badblocks code does not
properly handle multiple ranges. For example,
badblocks_set(bb, 32, 1, true);
badblocks_set(bb, 34, 1, true);
badblocks_set(bb, 36, 1, true);
badblocks_set(bb, 32, 12, true);
Then indeed badblocks_show() reports,
32 3
36 1
But the expected bad blocks table should be,
32 12
Obviously only the first 2 ranges are merged and badblocks_set() returns
and ignores the rest setting range.

This behavior is improper, if the caller of badblocks_set() wants to set
a range of blocks into bad blocks table, all of the blocks in the range
should be handled even the previous part encountering failure.

The desired way to set bad blocks range by badblocks_set() is,
- Set as many as blocks in the setting range into bad blocks table.
- Merge the bad blocks ranges and occupy as less as slots in the bad
  blocks table.
- Fast.

Indeed the above proposal is complicated, especially with the following
restrictions,
- The setting bad blocks range can be ackknowledged or not acknowledged.
- The bad blocks table size is limited.
- Memory allocation should be avoided.

The basic idea of the patch is to categorize all possible bad blocks
range setting combinationsinto to much less simplified and more less
special conditions. Inside badblocks_set() there is an implicit loop
composed by jumping between labels 're_insert' and 'update_sectors'. No
matter how large the setting bad blocks range is, in every loop just a
minimized range from the head is handled by a pre-defined behavior from
one of the categorized conditions. The logic is simple and code flow is
manageable.

The different relative layout between the setting range and existing bad
block range are checked and handled (merge, combine, overwrite, insert)
by the helpers in previous patch. This patch is to make all the helpers
work together with the above idea.

This patch only has the algorithm improvement for badblocks_set(). There
are following patches contain improvement for badblocks_clear() and
badblocks_check(). But the algorithm in badblocks_set() is fundamental
and typical, other improvement in clear and check routines are based on
all the helpers and ideas in this patch.

In order to make the change to be more clear for code review, this patch
does not directly modify existing badblocks_set(), and just add a new
one named _badblocks_set(). Later patch will remove current existing
badblocks_set() code and make it as a wrapper of _badblocks_set(). So
the new added change won't be mixed with deleted code, the code review
can be easier.

Signed-off-by: Coly Li 
---
 block/badblocks.c | 561 --
 1 file changed, 541 insertions(+), 20 deletions(-)

diff --git a/block/badblocks.c b/block/badblocks.c
index fd76bbe7b5a2..b0a7780d75c2 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -16,6 +16,322 @@
 #include 
 #include 
 
+/*
+ * The purpose of badblocks set/clear is to manage bad blocks ranges which are
+ * identified by LBA addresses.
+ *
+ * When the caller of badblocks_set() wants to set a range of bad blocks, the
+ * setting range can be acked or unacked. And the setting range may merge,
+ * overwrite, skip the overlaypped already set range, depends on who they are
+ * overlapped or adjacent, and the acknowledgment type of the ranges. It can be
+ * more complicated when the setting range covers multiple already set bad 
block
+ * ranges, with restritctions of maximum length of each bad range and the bad
+ * table space limitation.
+ *
+ * It is difficut and unnecessary to take care of all the possible situations,
+ * for setting a large range of bad blocks, we can handle it by dividing the
+ * large range into smaller ones when encounter overlap, max range length or
+ * bad table full conditions. Every time only a smaller piece of the bad range
+ * is handled with a limited number of conditions how it is interacted with
+ * possible overlapped or adjacent already set bad block ranges. Then the hard
+ * complicated problem can be much simpler to habndle in proper way.
+ *
+ * When setting a range of bad blocks to the bad table, the simplified 
situations
+ * to be considered are, (The already set bad blocks ranges are naming with
+ *  prefix E, and the setting bad blocks range is naming with prefix S)
+ *
+ * 1) A setting range is not overlapped or adjacent to any other already set 
bad
+ *block range.
+ * ++
+ * |S   |
+ * ++
+ *+-+   +-+
+ *|  E1 |   |  E2 |
+ *+-+   +-+
+ *For this situation if the bad blocks table is not full, just allocate a
+ *free slot from the bad blocks table to mark the setting range S. The
+ *result is,
+ *+-+  ++   

[RFC PATCH v1 2/6] badblocks: add helper routines for badblock ranges handling

2021-03-01 Thread Coly Li
This patch adds several helper routines to improve badblock ranges
handling. These helper routines will be used later in the improved
version of badblocks_set()/badblocks_clear()/badblocks_check().

- Helpers prev_by_hint() and prev_badblocks() are used to find the bad
  range from bad table which the searching range starts at or after.

- The following helpers are to decide the relative layout between the
  manipulating range and existing bad block range from bad table.
  - can_merge_behind()
Return 'true' if the manipulating range can backward merge with the
bad block range.
  - can_merge_front()
Return 'true' if the manipulating range can forward merge with the
bad block range.
  - can_combine_front()
Return 'true' if two adjacent bad block ranges before the
manipulating range can be merged.
  - overlap_front()
Return 'true' if the manipulating range exactly overlaps with the
bad block range in front of its range.
  - overlap_behind()
Return 'true' if the manipulating range exactly overlaps with the
bad block range behind its range.
  - can_front_overwrite()
Return 'true' if the manipulating range can forward overwrite the
bad block range in front of its range.

- The following helpers are to add the manipulating range into the bad
  block table. Different routine is called with the specific relative
  layout between the maniplating range and other bad block range in the
  bad block table.
  - behind_merge()
Merge the maniplating range with the bad block range behind its
range, and return the number of merged length in unit of sector.
  - front_merge()
Merge the maniplating range with the bad block range in front of
its range, and return the number of merged length in unit of sector.
  - front_combine()
Combine the two adjacent bad block ranges before the manipulating
range into a larger one.
  - front_overwrite()
Overwrite partial of whole bad block range which is in front of the
manipulating range. The overwrite may split existing bad block range
and generate more bad block ranges into the bad block table.
  - insert_at()
Insert the manipulating range at a specific location in the bad
block table.

All the above helpers are used in later patches to improve the bad block
ranges handling for badblocks_set()/badblocks_clear()/badblocks_check().

Signed-off-by: Coly Li 
---
 block/badblocks.c | 374 ++
 1 file changed, 374 insertions(+)

diff --git a/block/badblocks.c b/block/badblocks.c
index d39056630d9c..fd76bbe7b5a2 100644
--- a/block/badblocks.c
+++ b/block/badblocks.c
@@ -16,6 +16,380 @@
 #include 
 #include 
 
+/*
+ * Find the range starts at-or-before 's' from bad table. The search
+ * starts from index 'hint' and stops at index 'hint_end' from the bad
+ * table.
+ */
+static int prev_by_hint(struct badblocks *bb, sector_t s, int hint)
+{
+   u64 *p = bb->page;
+   int ret = -1;
+   int hint_end = hint + 2;
+
+   while ((hint < hint_end) && ((hint + 1) <= bb->count) &&
+  (BB_OFFSET(p[hint]) <= s)) {
+   if ((hint + 1) == bb->count || BB_OFFSET(p[hint + 1]) > s) {
+   ret = hint;
+   break;
+   }
+   hint++;
+   }
+
+   return ret;
+}
+
+/*
+ * Find the range starts at-or-before bad->start. If 'hint' is provided
+ * (hint >= 0) then search in the bad table from hint firstly. It is
+ * very probably the wanted bad range can be found from the hint index,
+ * then the unnecessary while-loop iteration can be avoided.
+ */
+static int prev_badblocks(struct badblocks *bb, struct bad_context *bad,
+ int hint)
+{
+   u64 *p;
+   int lo, hi;
+   sector_t s = bad->start;
+   int ret = -1;
+
+   if (!bb->count)
+   goto out;
+
+   if (hint >= 0) {
+   ret = prev_by_hint(bb, s, hint);
+   if (ret >= 0)
+   goto out;
+   }
+
+   lo = 0;
+   hi = bb->count;
+   p = bb->page;
+
+   while (hi - lo > 1) {
+   int mid = (lo + hi)/2;
+   sector_t a = BB_OFFSET(p[mid]);
+
+   if (a <= s)
+   lo = mid;
+   else
+   hi = mid;
+   }
+
+   if (BB_OFFSET(p[lo]) <= s)
+   ret = lo;
+out:
+   return ret;
+}
+
+/*
+ * Return 'true' if the range indicated by 'bad' can be backward merged
+ * with the bad range (from the bad table) index by 'behind'.
+ */
+static bool can_merge_behind(struct badblocks *bb, struct bad_context *bad,
+int behind)
+{
+   u64 *p = bb->page;
+   sector_t s = bad->start;
+   sector_t sectors = bad->len;
+   int ack = bad->ack;
+
+   if ((s <= BB_OFFSET(p[behind])) &&
+   ((s + sectors) >= BB_OFFSET(p[behind])) &&
+   ((BB_END(p[behind]) - s) <= BB_MAX_LEN) &&
+

[RFC PATCH v1 1/6] badblocks: add more helper structure and routines in badblocks.h

2021-03-01 Thread Coly Li
This patch adds the following helper structure and routines into
badblocks.h,
- struct bad_context
  This structure is used in improved badblocks code for bad table
  iteration.
- BB_END()
  The macro to culculate end LBA of a bad range record from bad
  table.
- badblocks_full() and badblocks_empty()
  The inline routines to check whether bad table is full or empty.
- set_changed() and clear_changed()
  The inline routines to set and clear 'changed' tag from struct
  badblocks.

These new helper structure and routines can help to make the code more
clear, they will be used in the improved badblocks code in following
patches.

Signed-off-by: Coly Li 
---
 include/linux/badblocks.h | 32 
 1 file changed, 32 insertions(+)

diff --git a/include/linux/badblocks.h b/include/linux/badblocks.h
index 2426276b9bd3..166161842d1f 100644
--- a/include/linux/badblocks.h
+++ b/include/linux/badblocks.h
@@ -15,6 +15,7 @@
 #define BB_OFFSET(x)   (((x) & BB_OFFSET_MASK) >> 9)
 #define BB_LEN(x)  (((x) & BB_LEN_MASK) + 1)
 #define BB_ACK(x)  (!!((x) & BB_ACK_MASK))
+#define BB_END(x)  (BB_OFFSET(x) + BB_LEN(x))
 #define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
 
 /* Bad block numbers are stored sorted in a single page.
@@ -41,6 +42,14 @@ struct badblocks {
sector_t size;  /* in sectors */
 };
 
+struct bad_context {
+   sector_tstart;
+   sector_tlen;
+   int ack;
+   sector_torig_start;
+   sector_torig_len;
+};
+
 int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
   sector_t *first_bad, int *bad_sectors);
 int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
@@ -63,4 +72,27 @@ static inline void devm_exit_badblocks(struct device *dev, 
struct badblocks *bb)
}
badblocks_exit(bb);
 }
+
+static inline int badblocks_full(struct badblocks *bb)
+{
+   return (bb->count >= MAX_BADBLOCKS);
+}
+
+static inline int badblocks_empty(struct badblocks *bb)
+{
+   return (bb->count == 0);
+}
+
+static inline void set_changed(struct badblocks *bb)
+{
+   if (bb->changed != 1)
+   bb->changed = 1;
+}
+
+static inline void clear_changed(struct badblocks *bb)
+{
+   if (bb->changed != 0)
+   bb->changed = 0;
+}
+
 #endif
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[RFC PATCH v1 0/6] badblocks improvement for multiple bad block ranges

2021-03-01 Thread Coly Li
This is the first completed effort to improve badblocks code to handle
multiple ranges in bad block table.

There is neither in-memory nor on-disk format change in this series, all
existing API and data structures are consistent. This series just only
improve the code algorithm to handle more corner cases, the interfaces
are same and consistency to all existing callers (md raid and nvdimm
drivers).

The original motivation of the change is from the requirement from our
customer, that current badblocks routines don't handle multiple ranges.
For example if the bad block setting range covers multiple ranges from
bad block table, only the first two bad block ranges merged and rested
ranges are intact. The expected behavior should be all the covered
ranges to be handled.

All the patches are tested by modified user space code and the code
logic works as expected. Kernel space testing and debugging is on the
way while I am asking help for code review at the same time.

The whole change is divided into 6 patches to make the code review more
clear and easier. If people prefer, I'd like to post a single large
patch finally after the code review accomplished.

Thank you in advance for any review comment and suggestion.

Coly Li (6):
  badblocks: add more helper structure and routines in badblocks.h
  badblocks: add helper routines for badblock ranges handling
  badblocks: improvement badblocks_set() for multiple ranges handling
  badblocks: improve badblocks_clear() for multiple ranges handling
  badblocks: improve badblocks_check() for multiple ranges handling
  badblocks: switch to the improved badblock handling code

 block/badblocks.c | 1591 ++---
 include/linux/badblocks.h |   32 +
 2 files changed, 1332 insertions(+), 291 deletions(-)

-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams
On Mon, Mar 1, 2021 at 6:42 PM Dave Chinner  wrote:
[..]
> We do not need a DAX specific mechanism to tell us "DAX device
> gone", we need a generic block device interface that tells us "range
> of block device is gone".

This is the crux of the disagreement. The block_device is going away
*and* the dax_device is going away. The dax_device removal implies one
set of actions (direct accessed pfns invalid) the block device removal
implies another (block layer sector access offline). corrupted_range
is blurring the notification for 2 different failure domains. Look at
the nascent idea to mount a filesystem on dax sans a block device.
Look at the existing plumbing for DM to map dax_operations through a
device stack. Look at the pushback Ruan got for adding a new
block_device operation for corrupted_range().

> The reason that the block device is gone is irrelevant to the
> filesystem. The type of block device is also irrelevant. If the
> filesystem isn't using DAX (e.g. dax=never mount option) even when
> it is on a DAX capable device, then it just doesn't care about
> invalidating DAX mappings because it has none. But it still may care
> about shutting down the filesystem because the block device is gone.

Sure, let's have a discussion about a block_device gone notification,
and a dax_device gone notification.

> This is why we need to communicate what error occurred, not what
> action a device driver thinks needs to be taken.

The driver is only an event producer in this model, whatever the
consumer does at the other end is not its concern. There may be a
generic consumer and a filesystem specific consumer.

> The error is
> important to the filesystem, the action might be completely
> irrelevant. And, as we know now, shutdown on DAX enable filesystems
> needs to imply DAX mapping invalidation in all cases, not just when
> the device disappears from under the filesystem.

Sure.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Darrick J. Wong
On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
> >
> > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  wrote:
> > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > > wrote:
> > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > it points to, check if it points to the PMEM that is being removed,
> > > > grab the page it points to, map that to the relevant struct page,
> > > > run collect_procs() on that page, then kill the user processes that
> > > > map that page.
> > > >
> > > > So why can't we walk the ptescheck the physical pages that they
> > > > map to and if they map to a pmem page we go poison that
> > > > page and that kills any user process that maps it.
> > > >
> > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > >
> > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > large array of pages.
> >
> > Not really. You're assuming all a filesystem has to do is invalidate
> > everything if a device goes away, and that's not true. Finding if an
> > inode has a mapping that spans a specific device in a multi-device
> > filesystem can be a lot more complex than that. Just walking inodes
> > is easy - determining whihc inodes need invalidation is the hard
> > part.
> 
> That inode-to-device level of specificity is not needed for the same
> reason that drop_caches does not need to be specific. If the wrong
> page is unmapped a re-fault will bring it back, and re-fault will fail
> for the pages that are successfully removed.
> 
> > That's where ->corrupt_range() comes in - the filesystem is already
> > set up to do reverse mapping from physical range to inode(s)
> > offsets...
> 
> Sure, but what is the need to get to that level of specificity with
> the filesystem for something that should rarely happen in the course
> of normal operation outside of a mistake?

I can't tell if we're conflating the "a bunch of your pmem went bad"
case with the "all your dimms fell out of the machine" case.

If, say, a single cacheline's worth of pmem goes bad on a node with 2TB
of pmem, I certainly want that level of specificity.  Just notify the
users of the dead piece, don't flush the whole machine down the drain.

> > > There's likely always more pages than inodes, but perhaps it's more
> > > efficient to walk the 'struct page' array than sb->s_inodes?
> >
> > I really don't see you seem to be telling us that invalidation is an
> > either/or choice. There's more ways to convert physical block
> > address -> inode file offset and mapping index than brute force
> > inode cache walks
> 
> Yes, but I was trying to map it to an existing mechanism and the
> internals of drop_pagecache_sb() are, in coarse terms, close to what
> needs to happen here.

Yes.  XFS (with rmap enabled) can do all the iteration and walking in
that function except for the invalidate_mapping_* call itself.  The goal
of this series is first to wire up a callback within both the block and
pmem subsystems so that they can take notifications and reverse-map them
through the storage stack until they reach an fs superblock.

Once the information has reached XFS, it can use its own reverse
mappings to figure out which pages of which inodes are now targetted.
The future of DAX hw error handling can be that you throw the spitwad at
us, and it's our problem to distill that into mm invalidation calls.
XFS' reverse mapping data is indexed by storage location and isn't
sharded by address_space, so (except for the DIMMs falling out), we
don't need to walk the entire inode list or scan the entire mapping.

Between XFS and DAX and mm, the mm already has the invalidation calls,
xfs already has the distiller, and so all we need is that first bit.
The current mm code doesn't fully solve the problem, nor does it need
to, since it handles DRAM errors acceptably* already.

* Actually, the hwpoison code should _also_ be calling ->corrupted_range
when DRAM goes bad so that we can detect metadata failures and either
reload the buffer or (if it was dirty) shut down.

> >
> > .
> >
> > > > IOWs, what needs to happen at this point is very filesystem
> > > > specific. Assuming that "device unplug == filesystem dead" is not
> > > > correct, nor is specifying a generic action that assumes the
> > > > filesystem is dead because a device it is using went away.
> > >
> > > Ok, I think I set this discussion in the wrong direction implying any
> > > mapping of this action to a "filesystem dead" event. It's just a "zap
> > > all ptes" event and upper layers recover from there.
> >
> > Yes, that's exactly what ->corrupt_range() is intended for. It
> > allows the filesystem to lock out access to the bad range
> > and then 

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dave Chinner
On Mon, Mar 01, 2021 at 04:32:36PM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 2:47 PM Dave Chinner  wrote:
> > Now we have the filesytem people providing a mechanism for the pmem
> > devices to tell the filesystems about physical device failures so
> > they can handle such failures correctly themselves. Having the
> > device go away unexpectedly from underneath a mounted and active
> > filesystem is a *device failure*, not an "unplug event".
> 
> It's the same difference to the physical page, all mappings to that
> page need to be torn down. I'm happy to call an fs callback and let
> each filesystem do what it wants with a "every pfn in this dax device
> needs to be unmapped".

You keep talking like this is something specific to a DAX device.
It isn't - the filesystem needs to take specific actions if any type
of block device reports that it has a corrupted range, not just DAX.
A DAX device simply adds "and invalidate direct mappings" to the
list of stuff that needs to be done.

And as far as a filesystem is concerned, there is no difference
between "this 4kB range is bad" and "the range of this entire device
is bad". We have to do the same things in both situations.

> I'm looking at the ->corrupted_range() patches trying to map it to
> this use case and I don't see how, for example a realtime-xfs over DM
> over multiple PMEM gets the notification to the right place.
> bd_corrupted_range() uses get_super() which get the wrong answer for
> both realtime-xfs and DM.

I'm not sure I follow your logic. What is generating the wrong
answer?

We already have infrastructure for the block device to look up the
superblock mounted on top of it, an DM already uses that for things
like "dmsetup suspend" to freeze the filesystem before it does
something.  This "superblock lookup" only occurs for the top level
DM device, not for the component pmem devices that make up the DM
device.


IOWs, if there's a DM device that maps multiple pmem devices, then
it should be stacking the bd_corrupted_range() callbacks to map the
physical device range to the range in the higher level DM device
that belongs to. This mapping of ranges is what DM exists to do -
the filesystem has no clue about what devices make up a DM device,
so the DM device *must* translate ranges for component devices into
the ranges that it maps that device into the LBA range it exposes to
the filesystem.

> I'd flip that arrangement around and have the FS tell the block device
> "if something happens to you, here is the super_block to notify".

We already have a mechanism for this that the block device calls:
get_active_super(bdev). There can be only one superblock per block
device - the superblock has exclusive ownership of the block device
while the filesystem is mounted.

get_active_super() returns the superblock that sits on top of the
bdev with an active reference, allowing the caller to safely access
and operate on the sueprblock without having to worry about the
superblock going away in the middle of whatever operation the block
device needs to perform.

If this isn't working, then existing storage stack functionality
doesn't work as it should and this needs fixing independently of
the PMEM/DAX stuff we are talking about here.

> So
> to me this looks like a fs_dax_register_super() helper that plumbs the
> superblock through an arbitrary stack of block devices to the leaf
> block-device that might want to send a notification up when a global
> unmap operation needs to be performed.

No, this is just wrong. The filesystem has no clue what block device
is at the leaf level of a block device stack, nor what LBA block
range represents that device within the address space the stacked
block devices present to the filesystem.

> > Please listen when we say "that is
> > not sufficient" because we don't want to be backed into a corner
> > that we have to fix ourselves again before we can enable some basic
> > filesystem functionality that we should have been able to support on
> > DAX from the start...
> 
> That's some revisionist interpretation of how the discovery of the
> reflink+dax+memory-error-handling collision went down.
> 
> The whole point of this discussion is to determine if
> ->corrupted_range() is suitable for this notification, and looking at
> the code as is, it isn't. Perhaps you have a different implementation
> of ->corrupted_range() in mind that allows this to be plumbed
> correctly?

So rather than try to make the generic ->corrupted_range
infrastructure be able to report "DAX range is invalid" (which is
the very definition of a corrupted block device range!), you want
to introduce a DAX specific notification to tell us that a range in
the block device contains invalid/corrupt data?

We're talking about a patchset that is in development. The proposed
notification path is supposed to be generic and *not PMEM specific*,
and is intended to handle exactly the use case that you raised.
The implementation may not be perfect yet, so rather 

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams
On Mon, Mar 1, 2021 at 2:47 PM Dave Chinner  wrote:
>
> On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> > On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
> > >
> > > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  
> > > > wrote:
> > > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > > > wrote:
> > > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > > it points to, check if it points to the PMEM that is being removed,
> > > > > grab the page it points to, map that to the relevant struct page,
> > > > > run collect_procs() on that page, then kill the user processes that
> > > > > map that page.
> > > > >
> > > > > So why can't we walk the ptescheck the physical pages that they
> > > > > map to and if they map to a pmem page we go poison that
> > > > > page and that kills any user process that maps it.
> > > > >
> > > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > > >
> > > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > > large array of pages.
> > >
> > > Not really. You're assuming all a filesystem has to do is invalidate
> > > everything if a device goes away, and that's not true. Finding if an
> > > inode has a mapping that spans a specific device in a multi-device
> > > filesystem can be a lot more complex than that. Just walking inodes
> > > is easy - determining whihc inodes need invalidation is the hard
> > > part.
> >
> > That inode-to-device level of specificity is not needed for the same
> > reason that drop_caches does not need to be specific. If the wrong
> > page is unmapped a re-fault will bring it back, and re-fault will fail
> > for the pages that are successfully removed.
> >
> > > That's where ->corrupt_range() comes in - the filesystem is already
> > > set up to do reverse mapping from physical range to inode(s)
> > > offsets...
> >
> > Sure, but what is the need to get to that level of specificity with
> > the filesystem for something that should rarely happen in the course
> > of normal operation outside of a mistake?
>
> Dan, you made this mistake with the hwpoisoning code that we're
> trying to fix that here. Hard coding a 1:1 physical address to
> inode/offset into the DAX mapping was a bad mistake. It's also one
> that should never have occurred because it's *obviously wrong* to
> filesystem developers and has been for a long time.

I admit that mistake. The traditional memory error handling model
assumptions around page->mapping were broken by DAX, I'm not trying to
repeat that mistake. I feel we're talking past each other on the
discussion of the proposals.

> Now we have the filesytem people providing a mechanism for the pmem
> devices to tell the filesystems about physical device failures so
> they can handle such failures correctly themselves. Having the
> device go away unexpectedly from underneath a mounted and active
> filesystem is a *device failure*, not an "unplug event".

It's the same difference to the physical page, all mappings to that
page need to be torn down. I'm happy to call an fs callback and let
each filesystem do what it wants with a "every pfn in this dax device
needs to be unmapped".

I'm looking at the ->corrupted_range() patches trying to map it to
this use case and I don't see how, for example a realtime-xfs over DM
over multiple PMEM gets the notification to the right place.
bd_corrupted_range() uses get_super() which get the wrong answer for
both realtime-xfs and DM.

I'd flip that arrangement around and have the FS tell the block device
"if something happens to you, here is the super_block to notify". So
to me this looks like a fs_dax_register_super() helper that plumbs the
superblock through an arbitrary stack of block devices to the leaf
block-device that might want to send a notification up when a global
unmap operation needs to be performed.

I naively think that "for_each_inode()
unmap_mapping_range(>i_mapping)" is sufficient as a generic
implementation, that does not preclude XFS to override that generic
implementation and handle it directly if it so chooses.

> The mistake you made was not understanding how filesystems work,
> nor actually asking filesystem developers what they actually needed.

You're going too far here, but that's off topic.

> You're doing the same thing here - you're telling us what you think
> the solution filesystems need is.

No, I'm not, I'm trying to understand tradeoffs. I apologize if this
is coming across as not listening.

> Please listen when we say "that is
> not sufficient" because we don't want to be backed into a corner
> that we have to fix ourselves again before we can enable some basic
> filesystem functionality that we should have been able to support on
> DAX from the start...

That's 

My Proposal

2021-03-01 Thread Maria Amparo Moraleda


Dear Sir,I have access to very vital information that can be used to move a huge amount of money. I have done my homework very well and I have the machinery in place to get it done since I am still in active service. If it was possible for me to do it alone, I would not have bothered contacting you. Ultimately, I need you to play an important role in the completion of this business transaction.
Reply if you are willing to do the business. [Email: info-amparo-h...@yandex.com]
Regard,
Maria Amparo Moraleda___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dave Chinner
On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
> >
> > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  wrote:
> > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > > wrote:
> > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > it points to, check if it points to the PMEM that is being removed,
> > > > grab the page it points to, map that to the relevant struct page,
> > > > run collect_procs() on that page, then kill the user processes that
> > > > map that page.
> > > >
> > > > So why can't we walk the ptescheck the physical pages that they
> > > > map to and if they map to a pmem page we go poison that
> > > > page and that kills any user process that maps it.
> > > >
> > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > >
> > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > large array of pages.
> >
> > Not really. You're assuming all a filesystem has to do is invalidate
> > everything if a device goes away, and that's not true. Finding if an
> > inode has a mapping that spans a specific device in a multi-device
> > filesystem can be a lot more complex than that. Just walking inodes
> > is easy - determining whihc inodes need invalidation is the hard
> > part.
> 
> That inode-to-device level of specificity is not needed for the same
> reason that drop_caches does not need to be specific. If the wrong
> page is unmapped a re-fault will bring it back, and re-fault will fail
> for the pages that are successfully removed.
> 
> > That's where ->corrupt_range() comes in - the filesystem is already
> > set up to do reverse mapping from physical range to inode(s)
> > offsets...
> 
> Sure, but what is the need to get to that level of specificity with
> the filesystem for something that should rarely happen in the course
> of normal operation outside of a mistake?

Dan, you made this mistake with the hwpoisoning code that we're
trying to fix that here. Hard coding a 1:1 physical address to
inode/offset into the DAX mapping was a bad mistake. It's also one
that should never have occurred because it's *obviously wrong* to
filesystem developers and has been for a long time.

Now we have the filesytem people providing a mechanism for the pmem
devices to tell the filesystems about physical device failures so
they can handle such failures correctly themselves. Having the
device go away unexpectedly from underneath a mounted and active
filesystem is a *device failure*, not an "unplug event".

The mistake you made was not understanding how filesystems work,
nor actually asking filesystem developers what they actually needed.
You're doing the same thing here - you're telling us what you think
the solution filesystems need is. Please listen when we say "that is
not sufficient" because we don't want to be backed into a corner
that we have to fix ourselves again before we can enable some basic
filesystem functionality that we should have been able to support on
DAX from the start...

> > > There's likely always more pages than inodes, but perhaps it's more
> > > efficient to walk the 'struct page' array than sb->s_inodes?
> >
> > I really don't see you seem to be telling us that invalidation is an
> > either/or choice. There's more ways to convert physical block
> > address -> inode file offset and mapping index than brute force
> > inode cache walks
> 
> Yes, but I was trying to map it to an existing mechanism and the
> internals of drop_pagecache_sb() are, in coarse terms, close to what
> needs to happen here.

No.

drop_pagecache_sb() is not a relevant model for telling a filesystem
that the block device underneath has gone away, nor for a device to
ensure that access protections that *are managed by the filesystem*
are enforced/revoked sanely.

drop_pagecache_sb() is a brute-force model for invalidating user
data mappings that the filesystem performs in response to such a
notification. It only needs this brute-force approach if it has no
other way to find active DAX mappings across the range of the device
that has gone away.

But this model doesn't work for direct mapped metadata, journals or
any other internal direct filesystem mappings that aren't referenced
by inodes that the filesytem might be using. The filesystem still
needs to invalidate all those mappings and prevent further access to
them, even from within the kernel itself.

Filesystems are way more complex than pure DAX devices, and hence
handle errors and failure events differently. Unlike DAX devices, we
have both internal and external references to the DAX device, and we
can have both external and internal direct maps.  Invalidating user
data mappings is all dax devices need to do on unplug, 

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams
On Sun, Feb 28, 2021 at 11:27 PM Yasunori Goto  wrote:
>
> Hello, Dan-san,
>
> On 2021/02/27 4:24, Dan Williams wrote:
> > On Fri, Feb 26, 2021 at 11:05 AM Darrick J. Wong  wrote:
> >>
> >> On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote:
> >>> Hi, guys
> >>>
> >>> Beside this patchset, I'd like to confirm something about the
> >>> "EXPERIMENTAL" tag for dax in XFS.
> >>>
> >>> In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> >>> when we mount a pmem device with dax option, has been existed for a
> >>> while.  It's a bit annoying when using fsdax feature.  So, my initial
> >>> intention was to remove this tag.  And I started to find out and solve
> >>> the problems which prevent it from being removed.
> >>>
> >>> As is talked before, there are 3 main problems.  The first one is "dax
> >>> semantics", which has been resolved.  The rest two are "RMAP for
> >>> fsdax" and "support dax reflink for filesystem", which I have been
> >>> working on.
> >>
> >> 
> >>
> >>> So, what I want to confirm is: does it means that we can remove the
> >>> "EXPERIMENTAL" tag when the rest two problem are solved?
> >>
> >> Yes.  I'd keep the experimental tag for a cycle or two to make sure that
> >> nothing new pops up, but otherwise the two patchsets you've sent close
> >> those two big remaining gaps.  Thank you for working on this!
> >>
> >>> Or maybe there are other important problems need to be fixed before
> >>> removing it?  If there are, could you please show me that?
> >>
> >> That remains to be seen through QA/validation, but I think that's it.
> >>
> >> Granted, I still have to read through the two patchsets...
> >
> > I've been meaning to circle back here as well.
> >
> > My immediate concern is the issue Jason recently highlighted [1] with
> > respect to invalidating all dax mappings when / if the device is
> > ripped out from underneath the fs. I don't think that will collide
> > with Ruan's implementation, but it does need new communication from
> > driver to fs about removal events.
> >
> > [1]: 
> > http://lore.kernel.org/r/capcyv4i+pzhyziepf2pah0dt5jdfkmkdx-3usqy1fahf6lp...@mail.gmail.com
> >
>
> I'm not sure why there is a race condition between unbinding operation
> and accessing mmaped file on filesystem dax yet.
>
> May be silly question, but could you tell me why the "unbinding"
> operation of the namespace which is mounted by filesystem dax must be
> allowed?

The unbind operation is used to switch the mode of a namespace between
fsdax and devdax. There is no way to fail unbind. At most it can be
delayed for a short while to perform cleanup, but it can't be blocked
indefinitely. There is the option to specify 'suppress_bind_attrs' to
the driver to preclude software triggered device removal, but that
would disable mode changes for the device.

> If "unbinding" is rejected when the filesystem is mounted with dax
> enabled, what is inconvenience?

It would be interesting (read difficult) to introduce the concept of
dynamic 'suppress_bind_attrs'. Today the decision is static at driver
registration time, not in response to how the device is being used.

I think global invalidation of all inodes that might be affected by a
dax-capable device being ripped away from the filesystem is sufficient
and avoids per-fs enabling, but I'm willing to be convinced that
->corrupted_range() is the proper vehicle for this.

>
> I can imagine if a device like usb memory stick is removed surprisingly,
> kernel/filesystem need to reject writeback at the time, and discard page
> cache. Then, I can understand that unbinding operation is essential for
> such case.

For usb the system is protected by the fact that all future block-i/o
submissions to the old block-device will fail, and a new usb-device
being plugged in will get a new block-device. I.e. the old security
model is invalidated / all holes are closed by blk_cleanup_queue().

> But I don't know why PMEM device/namespace allows unbinding operation
> like surprising removal event.

DAX hands direct mappings to physical pages, if the security model
fronting those physical pages changes the mappings attained via the
old security model need to be invalidated. blk_cleanup_queue() does
not invalidate DAX mappings.

The practical value of fixing that hole is small given that physical
unplug is not implemented for NVDIMMs today, but the get_user_pages()
path can be optimized if this invalidation is implemented, and future
hotplug-capable NVDIMM buses like CXL will need this.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Dan Williams
On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
>
> On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  wrote:
> > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > wrote:
> > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > it points to, check if it points to the PMEM that is being removed,
> > > grab the page it points to, map that to the relevant struct page,
> > > run collect_procs() on that page, then kill the user processes that
> > > map that page.
> > >
> > > So why can't we walk the ptescheck the physical pages that they
> > > map to and if they map to a pmem page we go poison that
> > > page and that kills any user process that maps it.
> > >
> > > i.e. I can't see how unexpected pmem device unplug is any different
> > > to an MCE delivering a hwpoison event to a DAX mapped page.
> >
> > I guess the tradeoff is walking a long list of inodes vs walking a
> > large array of pages.
>
> Not really. You're assuming all a filesystem has to do is invalidate
> everything if a device goes away, and that's not true. Finding if an
> inode has a mapping that spans a specific device in a multi-device
> filesystem can be a lot more complex than that. Just walking inodes
> is easy - determining whihc inodes need invalidation is the hard
> part.

That inode-to-device level of specificity is not needed for the same
reason that drop_caches does not need to be specific. If the wrong
page is unmapped a re-fault will bring it back, and re-fault will fail
for the pages that are successfully removed.

> That's where ->corrupt_range() comes in - the filesystem is already
> set up to do reverse mapping from physical range to inode(s)
> offsets...

Sure, but what is the need to get to that level of specificity with
the filesystem for something that should rarely happen in the course
of normal operation outside of a mistake?

>
> > There's likely always more pages than inodes, but perhaps it's more
> > efficient to walk the 'struct page' array than sb->s_inodes?
>
> I really don't see you seem to be telling us that invalidation is an
> either/or choice. There's more ways to convert physical block
> address -> inode file offset and mapping index than brute force
> inode cache walks

Yes, but I was trying to map it to an existing mechanism and the
internals of drop_pagecache_sb() are, in coarse terms, close to what
needs to happen here.

>
> .
>
> > > IOWs, what needs to happen at this point is very filesystem
> > > specific. Assuming that "device unplug == filesystem dead" is not
> > > correct, nor is specifying a generic action that assumes the
> > > filesystem is dead because a device it is using went away.
> >
> > Ok, I think I set this discussion in the wrong direction implying any
> > mapping of this action to a "filesystem dead" event. It's just a "zap
> > all ptes" event and upper layers recover from there.
>
> Yes, that's exactly what ->corrupt_range() is intended for. It
> allows the filesystem to lock out access to the bad range
> and then recover the data. Or metadata, if that's where the bad
> range lands. If that recovery fails, it can then report a data
> loss/filesystem shutdown event to userspace and kill user procs that
> span the bad range...
>
> FWIW, is this notification going to occur before or after the device
> has been physically unplugged?

Before. This will be operations that happen in the pmem driver
->remove() callback.

> i.e. what do we do about the
> time-of-unplug-to-time-of-invalidation window where userspace can
> still attempt to access the missing pmem though the
> not-yet-invalidated ptes? It may not be likely that people just yank
> pmem nvdimms out of machines, but with NVMe persistent memory
> spaces, there's every chance that someone pulls the wrong device...

The physical removal aspect is only theoretical today. While the pmem
driver has a ->remove() path that's purely a software unbind
operation. That said the vulnerability window today is if a process
acquires a dax mapping, the pmem device hosting that filesystem goes
through an unbind / bind cycle, and then a new filesystem is created /
mounted. That old pte may be able to access data that is outside its
intended protection domain.

Going forward, for buses like CXL, there will be a managed physical
remove operation via PCIE native hotplug. The flow there is that the
PCIE hotplug driver will notify the OS of a pending removal, trigger
->remove() on the pmem driver, and then notify the technician (slot
status LED) that the card is safe to pull.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl PATCH 2/2] ndctl/test: add checking the presence of jq command ahead

2021-03-01 Thread QI Fuli
Due to the lack of jq command, the result of the test will be 'fail'.
This patch adds checking the presence of jq commmand ahead.
If there is no jq command in the system, the test will be marked as 'skip'.

Signed-off-by: QI Fuli 
Link: https://github.com/pmem/ndctl/issues/141
---
 test/daxdev-errors.sh   | 1 +
 test/inject-error.sh| 2 ++
 test/inject-smart.sh| 1 +
 test/label-compat.sh| 1 +
 test/max_available_extent_ns.sh | 1 +
 test/monitor.sh | 2 ++
 test/multi-dax.sh   | 1 +
 test/sector-mode.sh | 2 ++
 8 files changed, 11 insertions(+)

diff --git a/test/daxdev-errors.sh b/test/daxdev-errors.sh
index 6281f32..9547d78 100755
--- a/test/daxdev-errors.sh
+++ b/test/daxdev-errors.sh
@@ -9,6 +9,7 @@ rc=77
 . $(dirname $0)/common

 check_min_kver "4.12" || do_skip "lacks dax dev error handling"
+check_prereq "jq"

 trap 'err $LINENO' ERR

diff --git a/test/inject-error.sh b/test/inject-error.sh
index c636033..7d0b826 100755
--- a/test/inject-error.sh
+++ b/test/inject-error.sh
@@ -11,6 +11,8 @@ err_count=8

 . $(dirname $0)/common

+check_prereq "jq"
+
 trap 'err $LINENO' ERR

 # sample json:
diff --git a/test/inject-smart.sh b/test/inject-smart.sh
index 94705df..4ca83b8 100755
--- a/test/inject-smart.sh
+++ b/test/inject-smart.sh
@@ -166,6 +166,7 @@ do_tests()
 }

 check_min_kver "4.19" || do_skip "kernel $KVER may not support smart 
(un)injection"
+check_prereq "jq"
 modprobe nfit_test
 rc=1

diff --git a/test/label-compat.sh b/test/label-compat.sh
index 340b93d..8ab2858 100755
--- a/test/label-compat.sh
+++ b/test/label-compat.sh
@@ -10,6 +10,7 @@ BASE=$(dirname $0)
 . $BASE/common

 check_min_kver "4.11" || do_skip "may not provide reliable isetcookie values"
+check_prereq "jq"

 trap 'err $LINENO' ERR

diff --git a/test/max_available_extent_ns.sh b/test/max_available_extent_ns.sh
index 14d741d..343f3c9 100755
--- a/test/max_available_extent_ns.sh
+++ b/test/max_available_extent_ns.sh
@@ -9,6 +9,7 @@ rc=77
 trap 'err $LINENO' ERR

 check_min_kver "4.19" || do_skip "kernel $KVER may not support 
max_available_size"
+check_prereq "jq"

 init()
 {
diff --git a/test/monitor.sh b/test/monitor.sh
index cdab5e1..28c5541 100755
--- a/test/monitor.sh
+++ b/test/monitor.sh
@@ -13,6 +13,8 @@ smart_supported_bus=""

 . $(dirname $0)/common

+check_prereq "jq"
+
 trap 'err $LINENO' ERR

 check_min_kver "4.15" || do_skip "kernel $KVER may not support monitor service"
diff --git a/test/multi-dax.sh b/test/multi-dax.sh
index e932569..8496619 100755
--- a/test/multi-dax.sh
+++ b/test/multi-dax.sh
@@ -9,6 +9,7 @@ rc=77
 . $(dirname $0)/common

 check_min_kver "4.13" || do_skip "may lack multi-dax support"
+check_prereq "jq"

 trap 'err $LINENO' ERR

diff --git a/test/sector-mode.sh b/test/sector-mode.sh
index dd7013e..54fa806 100755
--- a/test/sector-mode.sh
+++ b/test/sector-mode.sh
@@ -6,6 +6,8 @@ rc=77

 . $(dirname $0)/common

+check_prereq "jq"
+
 set -e
 trap 'err $LINENO' ERR

--
2.29.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[ndctl PATCH 1/2] configure: add checking jq command

2021-03-01 Thread QI Fuli
Add checking jq command since it is needed to validate tests

Cc: Santosh Sivaraj 
Signed-off-by: QI Fuli 
Link: https://github.com/pmem/ndctl/issues/141
---
 configure.ac | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/configure.ac b/configure.ac
index 5ec8d2f..839836b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -65,6 +65,12 @@ fi
 AC_SUBST([XMLTO])
 fi

+AC_CHECK_PROG(JQ, [jq], [$(which jq)], [missing])
+if test "x$JQ" = xmissing; then
+   AC_MSG_ERROR([jq command needed to validate tests])
+fi
+AC_SUBST([JQ])
+
 AC_C_TYPEOF
 AC_DEFINE([HAVE_STATEMENT_EXPR], 1, [Define to 1 if you have statement 
expressions.])

--
2.29.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


***SPAM*** Dear Supporter

2021-03-01 Thread AAHF
Dear Supporter

RE: LETTER OF INTRODUCTION AND REQUEST FOR PARTNERSHIP/SUPPORT-THE COMPLETE 
STRUGGLE TO BRING HAPPINESS TO THE ORPHANS, LESS PRIVILEGED AND THE HELPLESS 
OLDER PEOPLE.

We are happily contacting you for this course because you are a kind-heartily 
fellow and a promoter of the welfare of the less privileged, here may not your 
focus site for charity or you may not have an interest in such as an 
individual, but I beg of you, rethink, we are all human-being and the people of 
this part of the world totally need your partnership/support.

The Asadu Amagu Humanitarian Foundation (AAHF) is a registered NGO and all 
government authorize operational approval documents are available for your 
perusal and/or verification, we start in early 2018 and our aim and objectives 
are to promote the well-being of the helpless/sick older people and for Godly 
upbringing of orphan children, less privileged and to fight against internet 
scammer and fraudsters and crimes in Sub-Saharan African and the world since 
our search proved beyond a reasonable doubt that bad government policies and 
abuses of public offices born crime around Africa.

Your partnership/support with AAHF stands a chance to save souls and make our 
world a better place for an inhabitant, AAHF, is not against any region/s we 
are against immoralities and crime. To make our world crime free, please party 
with AAHF. I assure you no advice, mainstream or financial support you give to 
AAHF will not be misused, your partnership with AAHF is needed and we will be 
accountable to you.

An adage says! In every twelve (12) there must be Judas Iscariot, that betrayed 
Jesus and in every thirteen (13) there must be Jesus Christ the savior of the 
world, please do not think everybody in Nigeria, West Africa is bad, many are 
good people, but unfortunately bad people are more because of hardship, please 
work with us to give the downtrodden a better life and save souls. The entire 
management of (AAHF)  Asadu Amagu Humanitarian humbly asks now for your 
personal donation support, we appreciate any amount, we need money for human 
empowerment and to buy medicine, food items and clothes, etc., to put a smile 
on the face of the Orphans and unprivileged and the helpless sick older people.

Please upon your willingness for freewill donation, kindly contact us banking 
detail:

Yours faithfully,
Miss. Ogechukwu Agashi
Public Record, Officer,
ASADU AMAGU HUMANITARIAN FOUNDATION (AAHF)
Direct Telephone: (+234) 7036585426
E-mail:  i...@aahfoundation.com.
ngwebsite: https://aahfoundation.com.ng
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org