Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-22 Thread Christoph Hellwig
On Wed, Apr 21, 2021 at 12:09:54PM -0700, Dan Williams wrote:
> Can you get in the habit of not replying inline with new patches like
> this? Collect the review feedback, take a pause, and resend the full
> series so tooling like b4 and patchwork can track when a new posting
> supersedes a previous one. As is, this inline style inflicts manual
> effort on the maintainer.

Honestly I don't mind it at all.  If you shiny new tooling can't handle
it maybe you should fix your shiny new tooling instead of changing
everyones workflow?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 00/10] fsdax,xfs: Add reflink support for fsdax

2021-04-02 Thread Christoph Hellwig
Shiyang, Dan:

given that the whole reflink+dax thing is going to take a while and thus
not going to happen for this merge window, what about queueing up the
cleanup patches 1,2 and 3 so that we can reduce the patch load a little?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 02/10] fsdax: Factor helper: dax_fault_actor()

2021-04-02 Thread Christoph Hellwig
> + if (!pmd)
> + return dax_load_hole(xas, mapping, , vmf);
> + else
> + return dax_pmd_load_hole(xas, vmf, iomap, );

> + if (pmd)
> + return vmf_insert_pfn_pmd(vmf, pfn, write);
> + if (write)
> + return vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
> + else
> + return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
> +}

No need for else statements after returning.

Otherwise looks good:

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

2021-03-24 Thread Christoph Hellwig
On Wed, Mar 24, 2021 at 09:37:01AM -0700, Dan Williams wrote:
> > Eww.  As I said I think the right way is that the file system (or
> > other consumer) can register a set of callbacks for opening the device.
> 
> How does that solve the problem of the driver being notified of all
> pfn failure events?

Ok, I probably just showed I need to spend more time looking at
your proposal vs the actual code..

Don't we have a proper way how one of the nvdimm layers own a
spefific memory range and call directly into that instead of through
a notifier?

> Today pmem only finds out about the ones that are
> notified via native x86 machine check error handling via a notifier
> (yes "firmware-first" error handling fails to do the right thing for
> the pmem driver),

Did any kind of firmware-first error handling ever get anything
right?  I wish people would have learned that by now.

> or the ones that are eventually reported via address
> range scrub, but only for the nvdimms that implement range scrubbing.
> memory_failure() seems a reasonable catch all point to route pfn
> failure events, in an arch independent way, to interested drivers.

Yeah.

> I'm fine swapping out dax_device blocking_notiier chains for your
> proposal, but that does not address all the proposed reworks in my
> list which are:
> 
> - delete "drivers/acpi/nfit/mce.c"
> 
> - teach memory_failure() to be able to communicate range failure
> 
> - enable memory_failure() to defer to a filesystem that can say
> "critical metadata is impacted, no point in trying to do file-by-file
> isolation, bring the whole fs down".

This all sounds sensible.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

2021-03-24 Thread Christoph Hellwig
On Tue, Mar 23, 2021 at 07:19:28PM -0700, Dan Williams wrote:
> So I think the path forward is:
> 
> - teach memory_failure() to allow for ranged failures
> 
> - let interested drivers register for memory failure events via a
> blocking_notifier_head

Eww.  As I said I think the right way is that the file system (or
other consumer) can register a set of callbacks for opening the device.
I have a series I need to finish and send out to do that for block
devices.  We probably also need the concept of a holder for the dax
device to make it work nicely, as otherwise we're going to have a bit
of a mess.

> This obviously does not solve Dave's desire to get this type of error
> reporting on block_devices, but I think there's nothing stopping a
> parallel notifier chain from being created for block-devices, but
> that's orthogonal to requirements and capabilities provided by
> dax-devices.

FYI, my series could easily accomodate that if we ever get a block
driver that actually could report such errors.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RESEND PATCH v2.1 07/10] iomap: Introduce iomap_apply2() for operations on two files

2021-03-11 Thread Christoph Hellwig
On Thu, Mar 04, 2021 at 01:41:42PM +0800, Shiyang Ruan wrote:
> Some operations, such as comparing a range of data in two files under
> fsdax mode, requires nested iomap_open()/iomap_end() on two file.  Thus,
> we introduce iomap_apply2() to accept arguments from two files and
> iomap_actor2_t for actions on two files.

I still wonder if adding the iter based iomap API that willy proposed
would be a better fit here.  In that case we might not even need
a special API for the double iteration.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2] include: Remove pagemap.h from blkdev.h

2021-03-10 Thread Christoph Hellwig
On Tue, Mar 09, 2021 at 07:57:47PM +, Matthew Wilcox (Oracle) wrote:
> My UEK-derived config has 1030 files depending on pagemap.h before
> this change.  Afterwards, just 326 files need to be rebuilt when I
> touch pagemap.h.  I think blkdev.h is probably included too widely,
> but untangling that dependency is harder and this solves my problem.
> x86 allmodconfig builds, but there may be implicit include problems
> on other architectures.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
> v2: Fix CONFIG_SWAP=n implicit use of pagemap.h by swap.h.  Increases
> the number of files from 240, but that's still a big win -- 68%
> reduction instead of 77%.

Looks good.  I suspect blkdev.h also has penty of other includes that
aren't needed either..

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2] libnvdimm: Notify disk drivers to revalidate region read-only

2021-03-09 Thread Christoph Hellwig
Looks good to me:

Reviewed-by: Christoph Hellwig 

Question on the pre-existing code: given that nvdimm_check_and_set_ro is
the only caller of set_disk_ro for nvdimm devices, we'll also get
the message when initially setting up any read-only disk.  Is that
intentional?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] libnvdimm: Let revalidate_disk() revalidate region read-only

2021-03-08 Thread Christoph Hellwig
On Mon, Mar 08, 2021 at 10:54:22PM -0800, Dan Williams wrote:
> Previous kernels allowed the BLKROSET to override the disk's read-only
> status. With that situation fixed the pmem driver needs to rely on
> revalidate_disk() to clear the disk read-only status after the host
> region has been marked read-write.
> 
> Recall that when libnvdimm determines that the persistent memory has
> lost persistence (for example lack of energy to flush from DRAM to FLASH
> on an NVDIMM-N device) it marks the region read-only, but that state can
> be overridden by the user via:
> 
>echo 0 > /sys/bus/nd/devices/regionX/read_only
> 
> ...to date there is no notification that the region has restored
> persistence, so the user override is the only recovery.

I've just resent my series to kill of ->revalidate_disk for good, so this
obvious makes me a little unhappy.  Given that ->revalidate_disk
only ends up beeing called from the same path that ->open is called,
why can't you just hook this up from the open method?

Also any reason the sysfs attribute can't just directly propagate the
information to the disk?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 02/11] blk: Introduce ->corrupted_range() for block device

2021-03-04 Thread Christoph Hellwig
On Thu, Mar 04, 2021 at 02:42:50PM -0800, Darrick J. Wong wrote:
> My vision here, however, is to establish upcalls for /both/ types of
> stroage.

I already have patches for doing these kinds of callbacks properly
for the block layer. They will be posted shortly.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 09/10] fs/xfs: Handle CoW for fsdax write() path

2021-03-03 Thread Christoph Hellwig
On Wed, Mar 03, 2021 at 09:57:48AM +, ruansy.f...@fujitsu.com wrote:
> > What is the advantage of the ioemap_end handler here?  It adds another
> > indirect funtion call to the fast path, so if we can avoid it, I'd
> > rather do that.
> 
> These code were in xfs_file_dax_write().  I moved them into the iomap_end
> because the mmaped CoW need this.
> 
> I know this is not so good, but I could not find another better way. Do you
> have any ideas? 

mmaped copy is the copy_edge case?  Maybe just use different iomap_ops for
that case vs plain write?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 05/10] fsdax: Replace mmap entry in case of CoW

2021-03-03 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 05/10] fsdax: Replace mmap entry in case of CoW

2021-03-03 Thread Christoph Hellwig
On Wed, Mar 03, 2021 at 09:41:54AM +, ruansy.f...@fujitsu.com wrote:
> 
> > >
> > >   if (dirty)
> > >   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > 
> > I still think the __mark_inode_dirty should just be moved into the one
> > caller that needs it.
> 
> I found that the dirty flag will be used in the next few lines, so I keep
> this function inside. If I move it outside, the drity flag should be passed
> in as well. 
> 
> @@ -774,6 +780,9 @@ static void *dax_insert_entry(struct xa_state *xas,
>  if (dirty)
>  xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>  
> +   if (cow)
> +   xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +
>  xas_unlock_irq(xas);
>  return entry;
> }
> 
> 
> So, may I ask what's your purpose for doing in that way?

Oh, true.  We can't just move that out as the xas needs to stay
locked.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 09/10] fs/xfs: Handle CoW for fsdax write() path

2021-03-03 Thread Christoph Hellwig
On Fri, Feb 26, 2021 at 08:20:29AM +0800, Shiyang Ruan wrote:
>   error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> - _buffered_write_iomap_ops);
> +   IS_DAX(VFS_I(ip)) ?
> +   _dax_write_iomap_ops : _buffered_write_iomap_ops);

Please add a xfs_zero_range helper that picks the right iomap_ops
instead of open coding this in a few places.

> +static int
> +xfs_dax_write_iomap_end(
> + struct inode*inode,
> + loff_t  pos,
> + loff_t  length,
> + ssize_t written,
> + unsigned intflags,
> + struct iomap*iomap)
> +{
> + int error = 0;
> + xfs_inode_t *ip = XFS_I(inode);
> +
> + if (pos + written > i_size_read(inode)) {
> + i_size_write(inode, pos + written);
> + error = xfs_setfilesize(ip, pos, written);
> + }
> + if (xfs_is_cow_inode(ip))
> + error = xfs_reflink_end_cow(ip, pos, written);
> +
> + return error;

What is the advantage of the ioemap_end handler here?  It adds another
indirect funtion call to the fast path, so if we can avoid it, I'd
rather do that.

Also, shouldn't we cancel the COW rather than finishing it when setting
the file size fails?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 06/10] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero

2021-03-03 Thread Christoph Hellwig
On Fri, Feb 26, 2021 at 08:20:26AM +0800, Shiyang Ruan wrote:
> Punch hole on a reflinked file needs dax_copy_edge() too.  Otherwise,
> data in not aligned area will be not correct.  So, add the srcmap to
> dax_iomap_zero() and replace memset() as dax_copy_edge().
> 
> Signed-off-by: Shiyang Ruan 

Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 05/10] fsdax: Replace mmap entry in case of CoW

2021-03-03 Thread Christoph Hellwig
On Fri, Feb 26, 2021 at 08:20:25AM +0800, Shiyang Ruan wrote:
> We replace the existing entry to the newly allocated one in case of CoW.
> Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
> entry as writeprotected.  This helps us snapshots so new write
> pagefaults after snapshots trigger a CoW.
> 
> Signed-off-by: Goldwyn Rodrigues 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c | 37 ++---
>  1 file changed, 26 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 748dfb89fb41..ec4b733e0b59 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -722,6 +722,9 @@ static int copy_cow_page_dax(struct block_device *bdev, 
> struct dax_device *dax_d
>   return 0;
>  }
>  
> +#define DAX_IF_DIRTY (1 << 0)
> +#define DAX_IF_COW   (1 << 1)
> +
>  /*
>   * By this point grab_mapping_entry() has ensured that we have a locked entry
>   * of the appropriate size so we don't have to worry about downgrading PMDs 
> to
> @@ -729,16 +732,19 @@ static int copy_cow_page_dax(struct block_device *bdev, 
> struct dax_device *dax_d
>   * already in the tree, we will skip the insertion and just dirty the PMD as
>   * appropriate.
>   */
> -static void *dax_insert_entry(struct xa_state *xas,
> - struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> +static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> + void *entry, pfn_t pfn, unsigned long flags,
> + unsigned int insert_flags)
>  {
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
>   void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;
>  
>   if (dirty)
>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

I still think the __mark_inode_dirty should just be moved into the one
caller that needs it.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 04/10] fsdax: Introduce dax_iomap_cow_copy()

2021-03-03 Thread Christoph Hellwig
On Fri, Feb 26, 2021 at 08:20:24AM +0800, Shiyang Ruan wrote:
> + if (!copy_edge) {
> + ret = copy_mc_to_kernel(daddr, saddr, length);
> + return ret;

No need for the ret variable here, this can be:

if (!copy_edge)
return copy_mc_to_kernel(daddr, saddr, length);

Otherwise looks good:

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 02/10] fsdax: Factor helper: dax_fault_actor()

2021-03-03 Thread Christoph Hellwig
On Fri, Feb 26, 2021 at 08:20:22AM +0800, Shiyang Ruan wrote:
> The core logic in the two dax page fault functions is similar. So, move
> the logic into a common helper function. Also, to facilitate the
> addition of new features, such as CoW, switch-case is no longer used to
> handle different iomap types.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c | 211 ++-
>  1 file changed, 117 insertions(+), 94 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 7031e4302b13..9dea1572868e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1289,6 +1289,93 @@ static int dax_fault_cow_page(struct vm_fault *vmf, 
> struct iomap *iomap,
>   return 0;
>  }
>  
> +static vm_fault_t dax_fault_insert_pfn(struct vm_fault *vmf, pfn_t pfn,
> + bool pmd, bool write)
> +{
> + vm_fault_t ret;
> +
> + if (!pmd) {
> + struct vm_area_struct *vma = vmf->vma;
> + unsigned long address = vmf->address;
> +
> + if (write)
> + ret = vmf_insert_mixed_mkwrite(vma, address, pfn);
> + else
> + ret = vmf_insert_mixed(vma, address, pfn);
> + } else
> + ret = vmf_insert_pfn_pmd(vmf, pfn, write);

What about simplifying this a little bit more, something like:

if (pmd)
return vmf_insert_pfn_pmd(vmf, pfn, write);

if (write)
return vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
return vmf_insert_mixed(vmf->vma, vmf->address, pfn);

also given that this only has a single user, why not keep open coding
it in the caller?

> +#ifdef CONFIG_FS_DAX_PMD
> +static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault 
> *vmf,
> + struct iomap *iomap, void **entry);
> +#else
> +static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault 
> *vmf,
> + struct iomap *iomap, void **entry)
> +{
> + return VM_FAULT_FALLBACK;
> +}
> +#endif

Can we try to avoid the forward declaration?  Also is there a reason
dax_pmd_load_hole does not compile for the !CONFIG_FS_DAX_PMD case?
If it compiles fine we can just rely on IS_ENABLED() based dead code
elimination entirely.

> + /* if we are reading UNWRITTEN and HOLE, return a hole. */
> + if (!write &&
> + (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) {
> + if (!pmd)
> + return dax_load_hole(xas, mapping, , vmf);
> + else
> + return dax_pmd_load_hole(xas, vmf, iomap, );
> + }
> +
> + if (iomap->type != IOMAP_MAPPED) {
> + WARN_ON_ONCE(1);
> + return VM_FAULT_SIGBUS;
> + }

Nit: I'd use a switch statement here for a clarity:

switch (iomap->type) {
case IOMAP_MAPPED:
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (!write) {
if (!pmd)
return dax_load_hole(xas, mapping, , vmf);
return dax_pmd_load_hole(xas, vmf, iomap, );
}
break;
default:
WARN_ON_ONCE(1);
return VM_FAULT_SIGBUS;
}


> + err = dax_iomap_pfn(iomap, pos, size, );
> + if (err)
> + goto error_fault;
> +
> + entry = dax_insert_entry(xas, mapping, vmf, entry, pfn, 0,
> +  write && !sync);
> +
> + if (sync)
> + return dax_fault_synchronous_pfnp(pfnp, pfn);
> +
> + ret = dax_fault_insert_pfn(vmf, pfn, pmd, write);
> +
> +error_fault:
> + if (err)
> + ret = dax_fault_return(err);
> +
> + return ret;

It seems like the only place that sets err is the dax_iomap_pfn case
above.  So I'd move the dax_fault_return there, which then allows a direct
return for everyone else, including the open coded version of
dax_fault_insert_pfn.

I really like where this is going!
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 01/10] fsdax: Factor helpers to simplify dax fault code

2021-03-03 Thread Christoph Hellwig
On Fri, Feb 26, 2021 at 08:20:21AM +0800, Shiyang Ruan wrote:
> The dax page fault code is too long and a bit difficult to read. And it
> is hard to understand when we trying to add new features. Some of the
> PTE/PMD codes have similar logic. So, factor them as helper functions to
> simplify the code.
> 
> Signed-off-by: Shiyang Ruan 

Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

2021-02-24 Thread Christoph Hellwig
On Thu, Feb 18, 2021 at 08:20:18AM -0800, Darrick J. Wong wrote:
> > I think a nested call like this is necessary.  That's why I use the open
> > code way.
> 
> This might be a good place to implement an iomap_apply2() loop that
> actually /does/ walk all the extents of file1 and file2.  There's now
> two users of this idiom.

Why do we need a special helper for that?

> (Possibly structured as a "get next mappings from both" generator
> function like Matthew Wilcox keeps asking for. :))

OTOH this might be a good first use for that.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 05/11] mm, fsdax: Refactor memory-failure handler for dax mapping

2021-02-18 Thread Christoph Hellwig
On Wed, Feb 17, 2021 at 10:56:11AM +0800, Ruan Shiyang wrote:
> I'd like to confirm one thing...  I have checked all of this patchset by 
> checkpatch.pl and it did not report the overly long line warning.  So, I 
> should still obey the rule of 80 chars one line?

checkpatch.pl is completely broken, I would not rely on it.

Here is the quote from the coding style document:

"The preferred limit on the length of a single line is 80 columns.

Statements longer than 80 columns should be broken into sensible chunks,
unless exceeding 80 columns significantly increases readability and does
not hide information."
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 10/11] xfs: Implement ->corrupted_range() for XFS

2021-02-10 Thread Christoph Hellwig


> + if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> + (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + // TODO check and try to fix metadata
> + rc = -EFSCORRUPTED;
> + xfs_force_shutdown(cur->bc_mp, SHUTDOWN_CORRUPT_META);

Just return early here so that we can avoid the else later.

> + /*
> +  * Get files that incore, filter out others that are not in use.
> +  */
> + rc = xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner,
> +   XFS_IGET_INCORE, 0, );

Can we rename rc to error?

> + if (rc || !ip)
> + return rc;

No need to check for ip here.

> + if (!VFS_I(ip)->i_mapping)
> + goto out;

This can't happen either.

> +
> + mapping = VFS_I(ip)->i_mapping;
> + if (IS_DAX(VFS_I(ip)))
> + rc = mf_dax_mapping_kill_procs(mapping, rec->rm_offset,
> +*flags);
> + else {
> + rc = -EIO;
> + mapping_set_error(mapping, rc);
> + }

By passing the method directly to the DAX device we should never get
this called for the non-DAX case.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 06/11] mm, pmem: Implement ->memory_failure() in pmem driver

2021-02-10 Thread Christoph Hellwig
> +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
> + unsigned long pfn, int flags)
> +{
> + struct pmem_device *pdev;
> + struct gendisk *disk;
> + loff_t disk_offset;
> + int rc = 0;
> + unsigned long size = page_size(pfn_to_page(pfn));
> +
> + pdev = container_of(pgmap, struct pmem_device, pgmap);
> + disk = pdev->disk;

Would be nice to initialize this at the time of declaration:

struct pmem_device *pdev =
container_of(pgmap, struct pmem_device, pgmap);
struct gendisk *disk = pdev->disk
unsigned long size = page_size(pfn_to_page(pfn));

> + if (!disk)
> + return -ENXIO;
> +
> + disk_offset = PFN_PHYS(pfn) - pdev->phys_addr - pdev->data_offset;
> + if (disk->fops->corrupted_range) {
> + rc = disk->fops->corrupted_range(disk, NULL, disk_offset,
> +  size, );
> + if (rc == -ENODEV)
> + rc = -ENXIO;
> + } else
> + rc = -EOPNOTSUPP;

Why do we need the disk and fops check here? A pgmap registered by pmem.c
should always have a disk with pmem_fops.  And more importantly this
has no business going through the block layer.

Instead the file system should deposit a callback when starting to use
the dax_device using fs_dax_get_by_bdev / dax_get_by_host and a private
data (the superblock), and we avoid all the lookup problems.

> +int mf_generic_kill_procs(unsigned long long pfn, int flags)

This function seems to be only used inside of memory-failure.c, so it
could be marked static.  Also I'd name it dax_generic_memory_failure
or something like that to match the naming of the ->memory_failure
pgmap operation.

Also maybe just splitting this out into a helper would be a nice prep
patch.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 05/11] mm, fsdax: Refactor memory-failure handler for dax mapping

2021-02-10 Thread Christoph Hellwig
> +extern int mf_dax_mapping_kill_procs(struct address_space *mapping, pgoff_t 
> index, int flags);

No nee for the extern, please avoid the overly long line.

> @@ -120,6 +121,13 @@ static int hwpoison_filter_dev(struct page *p)
>   if (PageSlab(p))
>   return -EINVAL;
>  
> + if (pfn_valid(page_to_pfn(p))) {
> + if (is_device_fsdax_page(p))
> + return 0;
> + else
> + return -EINVAL;
> + }
> +

This looks odd.  For one there is no need for an else after a return.
But more importantly page_mapping() as called below pretty much assumes
a valid PFN, so something is fishy in this function.

> + if (is_zone_device_page(p)) {
> + if (is_device_fsdax_page(p))
> + tk->addr = vma->vm_start +
> + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);

The arithmetics here scream for a common helper, I suspect there might
be other places that could use the same helper.

> +int mf_dax_mapping_kill_procs(struct address_space *mapping, pgoff_t index, 
> int flags)

Overly long line.  Also the naming scheme with the mf_ seems rather
unusual. Maybe dax_kill_mapping_procs?  Also please add a kerneldoc
comment describing the function given that it exported.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 02/11] blk: Introduce ->corrupted_range() for block device

2021-02-10 Thread Christoph Hellwig
On Mon, Feb 08, 2021 at 06:55:21PM +0800, Shiyang Ruan wrote:
> In fsdax mode, the memory failure happens on block device.  So, it is
> needed to introduce an interface for block devices.  Each kind of block
> device can handle the memory failure in ther own ways.

As told before: DAX operations please do not add anything to the block
device.  We've been working very hard to decouple DAX from the block
device, and while we're not done regressing the split should not happen.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 01/11] pagemap: Introduce ->memory_failure()

2021-02-10 Thread Christoph Hellwig
On Mon, Feb 08, 2021 at 06:55:20PM +0800, Shiyang Ruan wrote:
> When memory-failure occurs, we call this function which is implemented
> by each kind of devices.  For the fsdax case, pmem device driver
> implements it.  Pmem device driver will find out the block device where
> the error page locates in, and try to get the filesystem on this block
> device.  And finally call filesystem handler to deal with the error.
> The filesystem will try to recover the corrupted data if possiable.

I'm not sure adding just a method without any of the support code
is a useful patch on its own.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

2021-02-10 Thread Christoph Hellwig
On Tue, Feb 09, 2021 at 05:46:13PM +0800, Ruan Shiyang wrote:
>
>
> On 2021/2/9 下午5:34, Christoph Hellwig wrote:
>> On Tue, Feb 09, 2021 at 05:15:13PM +0800, Ruan Shiyang wrote:
>>> The dax dedupe comparison need the iomap_ops pointer as argument, so my
>>> understanding is that we don't modify the argument list of
>>> generic_remap_file_range_prep(), but move its code into
>>> __generic_remap_file_range_prep() whose argument list can be modified to
>>> accepts the iomap_ops pointer.  Then it looks like this:
>>
>> I'd say just add the iomap_ops pointer to
>> generic_remap_file_range_prep and do away with the extra wrappers.  We
>> only have three callers anyway.
>
> OK.

So looking at this again I think your proposal actaully is better,
given that the iomap variant is still DAX specific.  Sorry for
the noise.

Also I think dax_file_range_compare should use iomap_apply instead
of open coding it.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 11/11] fs/dax: Remove useless functions

2021-02-10 Thread Christoph Hellwig
On Mon, Feb 08, 2021 at 06:55:30PM +0800, Shiyang Ruan wrote:
> Since owner tarcking is triggerred by pmem device, these functions are

s/tarcking/tracking/

> useless.  So remove them.

Note that this patch does not apply for me when applying your two series
on top of 5.11-rc5.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

2021-02-09 Thread Christoph Hellwig
On Tue, Feb 09, 2021 at 05:15:13PM +0800, Ruan Shiyang wrote:
> The dax dedupe comparison need the iomap_ops pointer as argument, so my 
> understanding is that we don't modify the argument list of 
> generic_remap_file_range_prep(), but move its code into 
> __generic_remap_file_range_prep() whose argument list can be modified to 
> accepts the iomap_ops pointer.  Then it looks like this:

I'd say just add the iomap_ops pointer to
generic_remap_file_range_prep and do away with the extra wrappers.  We
only have three callers anyway.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 6/7] fs/xfs: Handle CoW for fsdax write() path

2021-02-08 Thread Christoph Hellwig
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -977,10 +977,14 @@ xfs_free_file_space(
>   if (offset + len > XFS_ISIZE(ip))
>   len = XFS_ISIZE(ip) - offset;
>   error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> - _buffered_write_iomap_ops);
> +   IS_DAX(VFS_I(ip)) ?
> +   _direct_write_iomap_ops : _buffered_write_iomap_ops);
>   if (error)
>   return error;
> + if (xfs_is_reflink_inode(ip))
> + xfs_reflink_end_cow(ip, offset, len);

Maybe we need to add (back) and xfs_zero_range helper that encapsulates
the details?

>   trace_xfs_file_dax_write(ip, count, pos);
>   ret = dax_iomap_rw(iocb, from, _direct_write_iomap_ops);
> - if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> - i_size_write(inode, iocb->ki_pos);
> - error = xfs_setfilesize(ip, pos, ret);
> + if (ret > 0) {
> + if (iocb->ki_pos > i_size_read(inode)) {
> + i_size_write(inode, iocb->ki_pos);
> + error = xfs_setfilesize(ip, pos, ret);
> + }
> + if (xfs_is_cow_inode(ip))
> + xfs_reflink_end_cow(ip, pos, ret);

Nitpick, but I'd just goto out for ret <= 0 to reduce the indentation a
bit.

>   }
>  out:
>   xfs_iunlock(ip, iolock);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 7b9ff824e82d..d6d4cc0f084e 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -765,13 +765,14 @@ xfs_direct_write_iomap_begin(
>   goto out_unlock;
>  
>   if (imap_needs_cow(ip, flags, , nimaps)) {
> + bool need_convert = flags & IOMAP_DIRECT || IS_DAX(inode);
>   error = -EAGAIN;
>   if (flags & IOMAP_NOWAIT)
>   goto out_unlock;
>  
>   /* may drop and re-acquire the ilock */
>   error = xfs_reflink_allocate_cow(ip, , , ,
> - , flags & IOMAP_DIRECT);
> + , need_convert);

Why not:

error = xfs_reflink_allocate_cow(ip, , , ,
,
(flags & IOMAP_DIRECT) || IS_DAX(inode));

?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

2021-02-08 Thread Christoph Hellwig
On Mon, Feb 08, 2021 at 01:09:22AM +0800, Shiyang Ruan wrote:
> With dax we cannot deal with readpage() etc. So, we create a
> funciton callback to perform the file data comparison and pass

s/funciton/function/g

> +#define MIN(a, b) (((a) < (b)) ? (a) : (b))

This should use the existing min or min_t helpers.


>  int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> -   loff_t *len, unsigned int remap_flags)
> +   loff_t *len, unsigned int remap_flags,
> +   compare_range_t compare_range_fn)

Can we keep generic_remap_file_range_prep as-is, and add a new
dax_remap_file_range_prep, both sharing a low-level
__generic_remap_file_range_prep implementation?  And for that
implementation I'd also go for classic if/else instead of the function
pointer.

> +extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +  struct inode *dest, loff_t destoff,
> +  loff_t len, bool *is_same);

no need for the extern.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 4/7] fsdax: Replace mmap entry in case of CoW

2021-02-08 Thread Christoph Hellwig
>  static void *dax_insert_entry(struct xa_state *xas,
>   struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> + void *entry, pfn_t pfn, unsigned long flags, bool insert_flags)
>  {
>   void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;
>  
>   if (dirty)
>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

Can we just move the __mark_inode_dirty to the one caller that needs it
in a prep patch?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 3/7] fsdax: Copy data before write

2021-02-08 Thread Christoph Hellwig
>   switch (iomap.type) {
>   case IOMAP_MAPPED:
> +cow:
>   if (iomap.flags & IOMAP_F_NEW) {
>   count_vm_event(PGMAJFAULT);
>   count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
>   major = VM_FAULT_MAJOR;
>   }
>   error = dax_iomap_direct_access(, pos, PAGE_SIZE,
> - NULL, );
> + , );

Any chance you could look into factoring out this code into a helper
to avoid the goto magic, which is a little too convoluted?

>   switch (iomap.type) {
>   case IOMAP_MAPPED:
> +cow:
>   error = dax_iomap_direct_access(, pos, PMD_SIZE,
> - NULL, );
> + , );
>   if (error < 0)
>   goto finish_iomap;
>  
>   entry = dax_insert_entry(, mapping, vmf, entry, pfn,
>   DAX_PMD, write && !sync);
>  
> + if (srcmap.type != IOMAP_HOLE) {

Same here.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 2/7] fsdax: Introduce dax_copy_edges() for CoW

2021-02-08 Thread Christoph Hellwig
On Mon, Feb 08, 2021 at 01:09:19AM +0800, Shiyang Ruan wrote:
> dax_copy_edges() is a helper functions performs a copy from one part of
> the device to another for data not page aligned.

The function looks good to me, but adding a static function without a
user is not very bisection friendly.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 1/7] fsdax: Output address in dax_iomap_pfn() and rename it

2021-02-08 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-02-03 Thread Christoph Hellwig
On Wed, Feb 03, 2021 at 01:23:31PM -0800, Dan Williams wrote:
> > I'd prefer to keep the helpers for now as I do find them helpful, and so far
> > nobody else who has touched the code has complained. If you feel strongly, I
> > will change it.
> 
> After seeing the options, I think I'd prefer to not have to worry what
> extra magic is happening with cxl_read_mbox_reg32()
> 
> cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> 
> readl(cxlm->mbox_regs + CXLDEV_MB_CAPS_OFFSET);
> 
> The latter is both shorter and more idiomatic.

Same here.  That being said I know some driver maintainers like
wrappers, my real main irk was the macro magic to generate them.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-02-03 Thread Christoph Hellwig
On Tue, Feb 02, 2021 at 10:24:18AM -0800, Ben Widawsky wrote:
> > > + /* Cap 4000h - CXL_CAP_CAP_ID_MEMDEV */
> > > + struct {
> > > + void __iomem *regs;
> > > + } mem;
> > 
> > This style looks massively obsfucated.  For one the comments look like
> > absolute gibberish, but also what is the point of all these anonymous
> > structures?
> 
> They're not anonymous, and their names are for the below register functions. 
> The
> comments are connected spec reference 'Cap h' to definitions in cxl.h. I 
> can
> articulate that if it helps.

But why no simply a

void __iomem *mem_regs;

field vs the extra struct?

> The register space for CXL devices is a bit weird since it's all subdivided
> under 1 BAR for now. To clearly distinguish over the different subregions, 
> these
> helpers exist. It's really easy to mess this up as the developer and I 
> actually
> would disagree that it makes debugging quite a bit easier. It also gets more
> convoluted when you add the other 2 BARs which also each have their own
> subregions.
> 
> For example. if my mailbox function does:
> cxl_read_status_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> 
> instead of:
> cxl_read_mbox_reg32(cxlm, CXLDEV_MB_CAPS_OFFSET);
> 
> It's easier to spot than:
> readl(cxlm->regs + cxlm->status_offset + CXLDEV_MB_CAPS_OFFSET)

Well, what I think would be the most obvious is:

readl(cxlm->status_regs + CXLDEV_MB_CAPS_OFFSET);

> > > + /* 8.2.8.4.3 */
> > 
> > 
> > 
> 
> I had been trying to be consistent with 'CXL2.0 - ' in front of all spec
> reference. I obviously missed this one.

FYI, I generally find section names much easier to find than section
numbers.  Especially as the numbers change very frequently, some times
even for very minor updates to the spec.  E.g. in NVMe the numbers might
even change from NVMe 1.X to NVMe 1.Xa because an errata had to add
a clarification as its own section.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 02/14] cxl/mem: Map memory device registers

2021-02-03 Thread Christoph Hellwig
On Tue, Feb 02, 2021 at 10:31:51AM -0800, Ben Widawsky wrote:
> > > + if (reg_type == CXL_REGLOC_RBI_MEMDEV) {
> > > + rc = 0;
> > > + cxlm = cxl_mem_create(pdev, reg_lo, reg_hi);
> > > + if (!cxlm)
> > > + rc = -ENODEV;
> > > + break;
> > 
> > And given that we're going to grow more types eventually, why not start
> > out with a switch here?  Also why return the structure when nothing
> > uses it?
> 
>  We've (Intel) already started working on the libnvdimm integration which does
>  change this around a bit. In order to go with what's best tested though, I've
>  chosen to use this as is for merge. Many different people not just in Intel
>  have tested these codepaths. The resulting code moves the check on register
>  type out of this function entirely.
> 
>  If you'd like me to make it a switch, I can, but it's going to be extracted
>  later anyway.

This was just a suggestion.  No hard feelings, it's just that the code
looks a little odd to me.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 06/14] cxl/mem: Add basic IOCTL interface

2021-02-02 Thread Christoph Hellwig
> +#if defined(__cplusplus)
> +extern "C" {
> +#endif

This has no business in a kernel header.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 05/14] cxl/mem: Register CXL memX devices

2021-02-02 Thread Christoph Hellwig
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 25e08e5f40bd..33432a4cbe23 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -3179,6 +3179,20 @@ struct device *get_device(struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(get_device);
>  
> +/**
> + * get_live_device() - increment reference count for device iff !dead
> + * @dev: device.
> + *
> + * Forward the call to get_device() if the device is still alive. If
> + * this is called with the device_lock() held then the device is
> + * guaranteed to not die until the device_lock() is dropped.
> + */
> +struct device *get_live_device(struct device *dev)
> +{
> + return dev && !dev->p->dead ? get_device(dev) : NULL;
> +}
> +EXPORT_SYMBOL_GPL(get_live_device);

Err, if you want to add new core functionality that needs to be in a
separate well documented prep patch, and also CCed to the relevant
maintainers.

>   mutex_unlock(>mbox.mutex);
>  }
>  
> +static int cxl_memdev_open(struct inode *inode, struct file *file)
> +{
> + struct cxl_memdev *cxlmd =
> + container_of(inode->i_cdev, typeof(*cxlmd), cdev);
> +
> + file->private_data = cxlmd;

There is no good reason to ever mirror stuff from the inode into
file->private_data, as you can just trivially get at the original
location using file_inode(file).
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 03/14] cxl/mem: Find device capabilities

2021-02-02 Thread Christoph Hellwig
On Fri, Jan 29, 2021 at 04:24:27PM -0800, Ben Widawsky wrote:
>  #ifndef __CXL_H__
>  #define __CXL_H__
>  
> +#include 
> +#include 
> +#include 
> +
> +#define CXL_SET_FIELD(value, field)  
>   \
> + ({ \
> + WARN_ON(!FIELD_FIT(field##_MASK, value));  \
> + FIELD_PREP(field##_MASK, value);   \
> + })
> +
> +#define CXL_GET_FIELD(word, field) FIELD_GET(field##_MASK, word)

This looks like some massive obsfucation.  What is the intent
here?

> + /* Cap 0001h - CXL_CAP_CAP_ID_DEVICE_STATUS */
> + struct {
> + void __iomem *regs;
> + } status;
> +
> + /* Cap 0002h - CXL_CAP_CAP_ID_PRIMARY_MAILBOX */
> + struct {
> + void __iomem *regs;
> + size_t payload_size;
> + } mbox;
> +
> + /* Cap 4000h - CXL_CAP_CAP_ID_MEMDEV */
> + struct {
> + void __iomem *regs;
> + } mem;

This style looks massively obsfucated.  For one the comments look like
absolute gibberish, but also what is the point of all these anonymous
structures?

> +#define cxl_reg(type)
>   \
> + static inline void cxl_write_##type##_reg32(struct cxl_mem *cxlm,  \
> + u32 reg, u32 value)\
> + {  \
> + void __iomem *reg_addr = cxlm->type.regs;  \
> + writel(value, reg_addr + reg); \
> + }  \
> + static inline void cxl_write_##type##_reg64(struct cxl_mem *cxlm,  \
> + u32 reg, u64 value)\
> + {  \
> + void __iomem *reg_addr = cxlm->type.regs;  \
> + writeq(value, reg_addr + reg); \
> + }  \

What is the value add of all this obsfucation over the trivial
calls to the write*/read* functions, possible with a locally
declarate "void __iomem *" variable in the callers like in all
normall drivers?  Except for making the life of the poor soul trying
to debug this code some time in the future really hard, of course.

> + /* 8.2.8.4.3 */


___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 02/14] cxl/mem: Map memory device registers

2021-02-02 Thread Christoph Hellwig
Any reason not to merge a bunch of patches?  Both this one and
the previous one are rather useless on their own, making review
harder than necessary.

> + * cxl_mem_create() - Create a new  cxl_mem.
> + * @pdev: The pci device associated with the new  cxl_mem.
> + * @reg_lo: Lower 32b of the register locator
> + * @reg_hi: Upper 32b of the register locator.
> + *
> + * Return: The new  cxl_mem on success, NULL on failure.
> + *
> + * Map the BAR for a CXL memory device. This BAR has the memory device's
> + * registers for the device as specified in CXL specification.
> + */

A lot of text with almost no value over just reading the function.
What's that fetish with kerneldoc comments for trivial static functions?

> + reg_type =
> + (reg_lo >> CXL_REGLOC_RBI_SHIFT) & CXL_REGLOC_RBI_MASK;

OTOH this screams for a helper that would make the code a lot more
self documenting.

> + if (reg_type == CXL_REGLOC_RBI_MEMDEV) {
> + rc = 0;
> + cxlm = cxl_mem_create(pdev, reg_lo, reg_hi);
> + if (!cxlm)
> + rc = -ENODEV;
> + break;

And given that we're going to grow more types eventually, why not start
out with a switch here?  Also why return the structure when nothing
uses it?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 01/14] cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints

2021-02-02 Thread Christoph Hellwig
> +static int cxl_mem_dvsec(struct pci_dev *pdev, int dvsec)
> +{
> + int pos;
> +
> + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DVSEC);
> + if (!pos)
> + return 0;
> +
> + while (pos) {
> + u16 vendor, id;
> +
> + pci_read_config_word(pdev, pos + PCI_DVSEC_VENDOR_ID_OFFSET,
> +  );
> + pci_read_config_word(pdev, pos + PCI_DVSEC_ID_OFFSET, );
> + if (vendor == PCI_DVSEC_VENDOR_ID_CXL && dvsec == id)
> + return pos;
> +
> + pos = pci_find_next_ext_capability(pdev, pos,
> +PCI_EXT_CAP_ID_DVSEC);
> + }
> +
> + return 0;
> +}

There should be a patch with a generic version of this find a vendor
specific capability helper on linux-pci.

> +#define PCI_CLASS_MEMORY_CXL 0x050210

I think this should go into linux/pci_ids.h.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 01/14] cxl/mem: Introduce a driver for CXL-2.0-Type-3 endpoints

2021-02-02 Thread Christoph Hellwig
On Mon, Feb 01, 2021 at 12:34:11PM -0500, Konrad Rzeszutek Wilk wrote:
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright(c) 2020 Intel Corporation. All rights reserved. */
> 
> Can those two comments have the same type? As in either
> stay with // or do /*.

No.  // is only intended for the SPDX comments specifically so that
they stand out.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 1/3] cdev: Finish the cdev api with queued mode support

2021-01-20 Thread Christoph Hellwig
The subject doesn't make any sense to me.

But thn again queued sound really weird.  You just have a managed
API with a refcount and synchronization, right?

procfs and debugfs already support these kind of managed ops, kinda sad
to duplicate this concept yet another time.

> +static long cdev_queued_ioctl(struct file *file, unsigned int cmd, unsigned 
> long arg)

Overly long line.

> +__must_check int __cdev_register_queued(struct cdev *cdev, struct module 
> *owner,
> + dev_t dev, unsigned count,
> + const struct cdev_operations *qops)
> +{
> + int rc;
> +
> + if (!qops->ioctl || !owner)
> + return -EINVAL;

Why is the ioctl method mandatory?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 03/10] fs: Introduce ->corrupted_range() for superblock

2021-01-08 Thread Christoph Hellwig
On Thu, Dec 31, 2020 at 12:55:54AM +0800, Shiyang Ruan wrote:
> Memory failure occurs in fsdax mode will finally be handled in
> filesystem.  We introduce this interface to find out files or metadata
> affected by the corrupted range, and try to recover the corrupted data
> if possiable.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  include/linux/fs.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 8667d0cdc71e..282e2139b23e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1965,6 +1965,8 @@ struct super_operations {
> struct shrink_control *);
>   long (*free_cached_objects)(struct super_block *,
>   struct shrink_control *);
> + int (*corrupted_range)(struct super_block *sb, struct block_device 
> *bdev,

This adds an overly long line.  But more importantly it must work on
the dax device and not the block device.  I'd also structure the callback
so that it is called on the dax device only, with the file system storing
the super block in a private data member.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 02/10] blk: Introduce ->corrupted_range() for block device

2021-01-08 Thread Christoph Hellwig
It happens on a dax_device.  We should not interwind dax and block_device
even more after a lot of good work has happened to detangle them.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC 9/9] mm: Add follow_devmap_page() for devdax vmas

2020-12-09 Thread Christoph Hellwig
On Tue, Dec 08, 2020 at 03:57:54PM -0400, Jason Gunthorpe wrote:
> What we've talked about is changing the calling convention across all
> of this to something like:
> 
> struct gup_output {
>struct page **cur;
>struct page **end;
>unsigned long vaddr;
>[..]
> }
> 
> And making the manipulator like you saw for GUP common:
> 
> gup_output_single_page()
> gup_output_pages()
> 
> Then putting this eveywhere. This is the pattern that we ended up with
> in hmm_range_fault, and it seems to be working quite well.
> 
> fast/slow should be much more symmetric in code than they are today,
> IMHO.. I think those differences mainly exist because it used to be
> siloed in arch code. Some of the differences might be bugs, we've seen
> that a few times at least..

something like this:

http://git.infradead.org/users/hch/misc.git/commitdiff/c3d019802dbde5a4cc4160e7ec8ccba479b19f97

from this old and not fully working series:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/gup-bvec
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v3 3/6] mm: support THP migration to device private memory

2020-12-02 Thread Christoph Hellwig
[adding a few of the usual suspects]

On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> There are 4 types of ZONE_DEVICE struct pages:
> MEMORY_DEVICE_PRIVATE, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, and
> MEMORY_DEVICE_PCI_P2PDMA.
>
> Currently, memremap_pages() allocates struct pages for a physical address 
> range
> with a page_ref_count(page) of one and increments the pgmap->ref per CPU
> reference count by the number of pages created since each ZONE_DEVICE struct
> page has a pointer to the pgmap.
>
> The struct pages are not freed until memunmap_pages() is called which
> calls put_page() which calls put_dev_pagemap() which releases a reference to
> pgmap->ref. memunmap_pages() blocks waiting for pgmap->ref reference count
> to be zero. As far as I can tell, the put_page() in memunmap_pages() has to
> be the *last* put_page() (see MEMORY_DEVICE_PCI_P2PDMA).
> My RFC [1] breaks this put_page() -> put_dev_pagemap() connection so that
> the struct page reference count can go to zero and back to non-zero without
> changing the pgmap->ref reference count.
>
> Q1: Is that safe? Is there some code that depends on put_page() dropping
> the pgmap->ref reference count as part of memunmap_pages()?
> My testing of [1] seems OK but I'm sure there are lots of cases I didn't test.

It should be safe, but the audit you've done is important to make sure
we do not miss anything important.

> MEMORY_DEVICE_PCI_P2PDMA:
> Struct pages are created in pci_p2pdma_add_resource() and represent device
> memory accessible by PCIe bar address space. Memory is allocated with
> pci_alloc_p2pmem() based on a byte length but the gen_pool_alloc_owner()
> call will allocate memory in a minimum of PAGE_SIZE units.
> Reference counting is +1 per *allocation* on the pgmap->ref reference count.
> Note that this is not +1 per page which is what put_page() expects. So
> currently, a get_page()/put_page() works OK because the page reference count
> only goes 1->2 and 2->1. If it went to zero, the pgmap->ref reference count
> would be incorrect if the allocation size was greater than one page.
>
> I see pci_alloc_p2pmem() is called by nvme_alloc_sq_cmds() and
> pci_p2pmem_alloc_sgl() to create a command queue and a struct scatterlist *.
> Looks like sg_page(sg) returns the ZONE_DEVICE struct page of the scatterlist.
> There are a huge number of places sg_page() is called so it is hard to tell
> whether or not get_page()/put_page() is ever called on 
> MEMORY_DEVICE_PCI_P2PDMA
> pages.

Nothing should call get_page/put_page on them, as they are not treated
as refcountable memory.  More importantly nothing is allowed to keep
a reference longer than the time of the I/O.

> pci_p2pmem_virt_to_bus() will return the physical address and I guess
> pfn_to_page(physaddr >> PAGE_SHIFT) could return the struct page.
>
> Since there is a clear allocation/free, pci_alloc_p2pmem() can probably be
> modified to increment/decrement the MEMORY_DEVICE_PCI_P2PDMA struct page
> reference count. Or maybe just leave it at one like it is now.

And yes, doing that is probably a sensible safe guard.

> MEMORY_DEVICE_FS_DAX:
> Struct pages are created in pmem_attach_disk() and virtio_fs_setup_dax() with
> an initial reference count of one.
> The problem I see is that there are 3 states that are important:
> a) memory is free and not allocated to any file (page_ref_count() == 0).
> b) memory is allocated to a file and in the page cache (page_ref_count() == 
> 1).
> c) some gup() or I/O has a reference even after calling unmap_mapping_pages()
>(page_ref_count() > 1). ext4_break_layouts() basically waits until the
>page_ref_count() == 1 with put_page() calling wake_up_var(>_refcount)
>to wake up ext4_break_layouts().
> The current code doesn't seem to distinguish (a) and (b). If we want to use
> the 0->1 reference count to signal (c), then the page cache would have hold
> entries with a page_ref_count() == 0 which doesn't match the general page 
> cache

I think the sensible model here is to grab a reference when it is
added to the page cache.  That is exactly how normal system memory pages
work.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 2/2] mm: simplify follow_pte{,pmd}

2020-11-11 Thread Christoph Hellwig
On Wed, Nov 11, 2020 at 09:26:20AM +0100, Christian Borntraeger wrote:
> 
> On 11.11.20 09:18, Christoph Hellwig wrote:
> > On Tue, Nov 10, 2020 at 06:21:22PM -0800, Nick Desaulniers wrote:
> >> Sorry, I think this patch may be causing a regression for us for s390?
> >> https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/432129279#L768
> >>
> >> (via https://lore.kernel.org/linux-mm/20201029101432.47011-3-...@lst.de)
> > 
> > Hmm, the call to follow_pte_pmd in the s390 code does not actually exist
> > in my tree.
> 
> This is a mid-air collision in linux-next between
> 
> b2ff5796a934 ("mm: simplify follow_pte{,pmd}")
> a67a88b0b8de ("s390/pci: remove races against pte updates")

Ah.  The fixup is trivial: just s/follow_pte_pmd/follow_pte/.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 2/2] mm: simplify follow_pte{,pmd}

2020-11-11 Thread Christoph Hellwig
On Tue, Nov 10, 2020 at 06:21:22PM -0800, Nick Desaulniers wrote:
> Sorry, I think this patch may be causing a regression for us for s390?
> https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/432129279#L768
> 
> (via https://lore.kernel.org/linux-mm/20201029101432.47011-3-...@lst.de)

Hmm, the call to follow_pte_pmd in the s390 code does not actually exist
in my tree.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: simplify follow_pte a bit

2020-11-10 Thread Christoph Hellwig
On Thu, Oct 29, 2020 at 11:14:30AM +0100, Christoph Hellwig wrote:
> Hi Andrew,
> 
> this small series drops the not needed follow_pte_pmd exports, and
> simplifies the follow_pte family of functions a bit.

any comments?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] x86/mm: Fix phys_to_target_node() export

2020-11-03 Thread Christoph Hellwig
This version looks sensible to me:

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] x86/mm: Fix phys_to_target_node() export

2020-10-31 Thread Christoph Hellwig
On Thu, Oct 29, 2020 at 07:29:45PM -0700, Dan Williams wrote:
> The core-mm has a default __weak implementation of phys_to_target_node()
> when the architecture does not override it. That symbol is exported
> for modules. However, while the export in mm/memory_hotplug.c exported
> the symbol in the configuration cases of:

Which just means that we should never export weak symbols.  So instead
of hacking around this introduce a symbol that indicates that the
architecture impements phys_to_target_node, and don't defined it at all
in common code for that case.

> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -365,9 +365,14 @@ int __weak phys_to_target_node(u64 start)
>   start);
>   return 0;
>  }
> +
> +/* If the arch did not export a strong symbol, export the weak one. */
> +#ifndef CONFIG_NUMA_KEEP_MEMINFO
>  EXPORT_SYMBOL_GPL(phys_to_target_node);
>  #endif
>  
> +#endif

i.e. move the ifdef to include the actual phys_to_target_node
definition, and remove the __weak from it here.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 2/2] mm: simplify follow_pte{,pmd}

2020-10-29 Thread Christoph Hellwig
Merge __follow_pte_pmd, follow_pte_pmd and follow_pte into a single
follow_pte function and just pass two additional NULL arguments for
the two previous follow_pte callers.

Signed-off-by: Christoph Hellwig 
---
 fs/dax.c   |  9 -
 include/linux/mm.h |  6 +++---
 mm/memory.c| 35 +--
 3 files changed, 12 insertions(+), 38 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5b47834f2e1bb5..26d5dcd2d69e5c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -810,12 +810,11 @@ static void dax_entry_mkclean(struct address_space 
*mapping, pgoff_t index,
address = pgoff_address(index, vma);
 
/*
-* Note because we provide range to follow_pte_pmd it will
-* call mmu_notifier_invalidate_range_start() on our behalf
-* before taking any lock.
+* Note because we provide range to follow_pte it will call
+* mmu_notifier_invalidate_range_start() on our behalf before
+* taking any lock.
 */
-   if (follow_pte_pmd(vma->vm_mm, address, ,
-  , , ))
+   if (follow_pte(vma->vm_mm, address, , , , ))
continue;
 
/*
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aafcf..113b0b4fd90af5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1655,9 +1655,9 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long 
addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
 int
 copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct 
*src_vma);
-int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
-  struct mmu_notifier_range *range,
-  pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
+int follow_pte(struct mm_struct *mm, unsigned long address,
+   struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
+   spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
unsigned long *pfn);
 int follow_phys(struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/memory.c b/mm/memory.c
index 00458e7b49fef8..fa00682f7a4312 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4696,9 +4696,9 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, 
unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
-   struct mmu_notifier_range *range,
-   pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+int follow_pte(struct mm_struct *mm, unsigned long address,
+  struct mmu_notifier_range *range, pte_t **ptepp, pmd_t **pmdpp,
+  spinlock_t **ptlp)
 {
pgd_t *pgd;
p4d_t *p4d;
@@ -4763,31 +4763,6 @@ static int __follow_pte_pmd(struct mm_struct *mm, 
unsigned long address,
return -EINVAL;
 }
 
-static inline int follow_pte(struct mm_struct *mm, unsigned long address,
-pte_t **ptepp, spinlock_t **ptlp)
-{
-   int res;
-
-   /* (void) is needed to make gcc happy */
-   (void) __cond_lock(*ptlp,
-  !(res = __follow_pte_pmd(mm, address, NULL,
-   ptepp, NULL, ptlp)));
-   return res;
-}
-
-int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
-  struct mmu_notifier_range *range,
-  pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
-{
-   int res;
-
-   /* (void) is needed to make gcc happy */
-   (void) __cond_lock(*ptlp,
-  !(res = __follow_pte_pmd(mm, address, range,
-   ptepp, pmdpp, ptlp)));
-   return res;
-}
-
 /**
  * follow_pfn - look up PFN at a user virtual address
  * @vma: memory mapping
@@ -4808,7 +4783,7 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long 
address,
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
return ret;
 
-   ret = follow_pte(vma->vm_mm, address, , );
+   ret = follow_pte(vma->vm_mm, address, NULL, , NULL, );
if (ret)
return ret;
*pfn = pte_pfn(*ptep);
@@ -4829,7 +4804,7 @@ int follow_phys(struct vm_area_struct *vma,
if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
goto out;
 
-   if (follow_pte(vma->vm_mm, address, , ))
+   if (follow_pte(vma->vm_mm, address, NULL, , NULL, ))
goto out;
pte = *ptep;
 
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 1/2] mm: unexport follow_pte_pmd

2020-10-29 Thread Christoph Hellwig
follow_pte_pmd is only used by the DAX code, which can't be modular.

Signed-off-by: Christoph Hellwig 
---
 mm/memory.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index c48f8df6e50268..00458e7b49fef8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4787,7 +4787,6 @@ int follow_pte_pmd(struct mm_struct *mm, unsigned long 
address,
ptepp, pmdpp, ptlp)));
return res;
 }
-EXPORT_SYMBOL(follow_pte_pmd);
 
 /**
  * follow_pfn - look up PFN at a user virtual address
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


simplify follow_pte a bit

2020-10-29 Thread Christoph Hellwig
Hi Andrew,

this small series drops the not needed follow_pte_pmd exports, and
simplifies the follow_pte family of functions a bit.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] mm/mremap_pages: Fix static key devmap_managed_key updates

2020-10-24 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] mm/mremap_pages: Fix static key devmap_managed_key updates

2020-10-23 Thread Christoph Hellwig
On Thu, Oct 22, 2020 at 11:19:43AM -0700, Ralph Campbell wrote:
> 
> On 10/22/20 8:41 AM, Ira Weiny wrote:
> > On Thu, Oct 22, 2020 at 11:37:53AM +0530, Aneesh Kumar K.V wrote:
> > > commit 6f42193fd86e ("memremap: don't use a separate devm action for
> > > devmap_managed_enable_get") changed the static key updates such that we
> > > now call devmap_managed_enable_put() without doing the equivalent
> > > devmap_managed_enable_get().
> > > 
> > > devmap_managed_enable_get() is only called for MEMORY_DEVICE_PRIVATE and
> > > MEMORY_DEVICE_FS_DAX, But memunmap_pages() get called for other pgmap
> > > types too. This results in the below warning when switching between
> > > system-ram and devdax mode for devdax namespace.
> > > 
> > >   jump label: negative count!
> > >   WARNING: CPU: 52 PID: 1335 at kernel/jump_label.c:235 
> > > static_key_slow_try_dec+0x88/0xa0
> > >   Modules linked in:
> > >   
> > > 
> > >   NIP [c0433318] static_key_slow_try_dec+0x88/0xa0
> > >   LR [c0433314] static_key_slow_try_dec+0x84/0xa0
> > >   Call Trace:
> > >   [c00025c1f660] [c0433314] static_key_slow_try_dec+0x84/0xa0 
> > > (unreliable)
> > >   [c00025c1f6d0] [c0433664] 
> > > __static_key_slow_dec_cpuslocked+0x34/0xd0
> > >   [c00025c1f700] [c04337a4] static_key_slow_dec+0x54/0xf0
> > >   [c00025c1f770] [c059c49c] memunmap_pages+0x36c/0x500
> > >   [c00025c1f820] [c0d91d10] devm_action_release+0x30/0x50
> > >   [c00025c1f840] [c0d92e34] release_nodes+0x2f4/0x3e0
> > >   [c00025c1f8f0] [c0d8b15c] 
> > > device_release_driver_internal+0x17c/0x280
> > >   [c00025c1f930] [c0d883a4] bus_remove_device+0x124/0x210
> > >   [c00025c1f9b0] [c0d80ef4] device_del+0x1d4/0x530
> > >   [c00025c1fa70] [c0e341e8] unregister_dev_dax+0x48/0xe0
> > >   [c00025c1fae0] [c0d91d10] devm_action_release+0x30/0x50
> > >   [c00025c1fb00] [c0d92e34] release_nodes+0x2f4/0x3e0
> > >   [c00025c1fbb0] [c0d8b15c] 
> > > device_release_driver_internal+0x17c/0x280
> > >   [c00025c1fbf0] [c0d87000] unbind_store+0x130/0x170
> > >   [c00025c1fc30] [c0d862a0] drv_attr_store+0x40/0x60
> > >   [c00025c1fc50] [c06d316c] sysfs_kf_write+0x6c/0xb0
> > >   [c00025c1fc90] [c06d2328] kernfs_fop_write+0x118/0x280
> > >   [c00025c1fce0] [c05a79f8] vfs_write+0xe8/0x2a0
> > >   [c00025c1fd30] [c05a7d94] ksys_write+0x84/0x140
> > >   [c00025c1fd80] [c003a430] system_call_exception+0x120/0x270
> > >   [c00025c1fe20] [c000c540] system_call_common+0xf0/0x27c
> > > 
> > > Cc: Christoph Hellwig 
> > > Cc: Dan Williams 
> > > Cc: Sachin Sant 
> > > Cc: linux-nvdimm@lists.01.org
> > > Cc: Ira Weiny 
> > > Cc: Jason Gunthorpe 
> > > Signed-off-by: Aneesh Kumar K.V 
> > > ---
> > >   mm/memremap.c | 19 +++
> > >   1 file changed, 15 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/memremap.c b/mm/memremap.c
> > > index 73a206d0f645..d4402ff3e467 100644
> > > --- a/mm/memremap.c
> > > +++ b/mm/memremap.c
> > > @@ -158,6 +158,16 @@ void memunmap_pages(struct dev_pagemap *pgmap)
> > >   {
> > >   unsigned long pfn;
> > >   int i;
> > > + bool need_devmap_managed = false;
> > > +
> > > + switch (pgmap->type) {
> > > + case MEMORY_DEVICE_PRIVATE:
> > > + case MEMORY_DEVICE_FS_DAX:
> > > + need_devmap_managed = true;
> > > + break;
> > > + default:
> > > + break;
> > > + }
> > 
> > Is it overkill to avoid duplicating this switch logic in
> > page_is_devmap_managed() by creating another call which can be used here?
> 
> Perhaps. I can imagine a helper defined in include/linux/mm.h which
> page_is_devmap_managed() could also call but that would impact a lot of
> places that include mm.h. Since memremap.c already has to have intimate
> knowledge of the pgmap->type, I think limiting the change to just what
> is needed is better for now. So the patch looks OK to me.
> 
> Looking at this some more, I would suggest changing 
> devmap_managed_enable_get()
> and devmap_managed_enable_put() to do the special case checking instead of
> doing it in memremap_pages() and memunmap_pages().
> Then devmap_managed_enable_get() doesn't need to return an error if
> CONFIG_DEV_PAGEMAP_OPS isn't defined. I have only compile tested the
> following.
> 
> Signed-off-by: Ralph Campbell 

This looks ok as well.  Can you submit it as a proper standalone patch?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] mm/mremap_pages: Fix static key devmap_managed_key updates

2020-10-22 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [ndctl PATCH] Clarify COPYING

2020-10-16 Thread Christoph Hellwig
On Thu, Oct 15, 2020 at 06:12:22PM -0700, Dan Williams wrote:
> The tools and the libraries this project produces are licensed under the
> GPLv2 and LGPLv2.1 licenses respectively. Add a COPYING file that
> highlights this arrangement and fixup files in */lib that mistakenly had
> a GPLv2 header applied.

Maybe it is time to move to formal SPDX annotations and the LICENSES/
infrastructure from the kernel?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 00/22] add Object Storage Media Pool (mpool)

2020-10-15 Thread Christoph Hellwig
I don't think this belongs into the kernel.  It is a classic case for
infrastructure that should be built in userspace.  If anything is
missing to implement it in userspace with equivalent performance we
need to improve out interfaces, although io_uring should cover pretty
much everything you need.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH RFC PKS/PMEM 24/58] fs/freevxfs: Utilize new kmap_thread()

2020-10-13 Thread Christoph Hellwig
> - kaddr = kmap(pp);
> + kaddr = kmap_thread(pp);
>   memcpy(kaddr, vip->vii_immed.vi_immed + offset, PAGE_SIZE);
> - kunmap(pp);
> + kunmap_thread(pp);

You only Cced me on this particular patch, which means I have absolutely
no idea what kmap_thread and kunmap_thread actually do, and thus can't
provide an informed review.

That being said I think your life would be a lot easier if you add
helpers for the above code sequence and its counterpart that copies
to a potential hughmem page first, as that hides the implementation
details from most users.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache

2020-09-17 Thread Christoph Hellwig
On Wed, Sep 16, 2020 at 10:40:13AM -0700, Dan Williams wrote:
> > Before nvfs gets included in the kernel, I need to distribute it as a
> > module. So, it would make my maintenance easier. But if you don't want to
> > export it now, no problem, I can just copy __copy_user_flushcache from the
> > kernel to the module.
> 
> That sounds a better plan than exporting symbols with no in-kernel consumer.

Exporting symbols without a user is a complete no-go.

> > > My first question about nvfs is how it compares to a daxfs with
> > > executables and other binaries configured to use page cache with the
> > > new per-file dax facility?
> >
> > nvfs is faster than dax-based filesystems on metadata-heavy operations
> > because it doesn't have the overhead of the buffer cache and bios. See
> > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
> 
> ...and that metadata problem is intractable upstream? Christoph poked
> at bypassing the block layer for xfs metadata operations [1], I just
> have not had time to carry that further.
> 
> [1]: "xfs: use dax_direct_access for log writes", although it seems
> he's dropped that branch from his xfs.git

I've pushed the old branch out again:

http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-log-dax

The main sticking points here are:

 - currently all our nvdimm/DAX code does totally pointless pmem_flush
   calls just to be on the safe side.  That probably is one of the big
   speedups of nova and other academic snake oil projects over our
   stack.  We need to handle this properly
 - what do we do about write error handling?  That is the other big
   thing in the pmem/dax stack that all of the direct writers (including
   MAP_SYNC mmaps) pretty much ignore

Once that is sorted out we can not just put the log changes like above
in, but also move the buffer cache over to do a direct access and
basically stop using the block layer for a pure DAX XFS.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 9/9] iomap: Change calling convention for zeroing

2020-09-11 Thread Christoph Hellwig
On Fri, Sep 11, 2020 at 12:47:07AM +0100, Matthew Wilcox (Oracle) wrote:
> Pass the full length to iomap_zero() and dax_iomap_zero(), and have
> them return how many bytes they actually handled.  This is preparatory
> work for handling THP, although it looks like DAX could actually take
> advantage of it if there's a larger contiguous area.

Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 6/9] iomap: Convert read_count to read_bytes_pending

2020-09-10 Thread Christoph Hellwig
On Fri, Sep 11, 2020 at 12:47:04AM +0100, Matthew Wilcox (Oracle) wrote:
> Instead of counting bio segments, count the number of bytes submitted.
> This insulates us from the block layer's definition of what a 'same page'
> is, which is not necessarily clear once THPs are involved.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 5/9] iomap: Support arbitrarily many blocks per page

2020-09-10 Thread Christoph Hellwig
On Fri, Sep 11, 2020 at 12:47:03AM +0100, Matthew Wilcox (Oracle) wrote:
> Size the uptodate array dynamically to support larger pages in the
> page cache.  With a 64kB page, we're only saving 8 bytes per page today,
> but with a 2MB maximum page size, we'd have to allocate more than 4kB
> per page.  Add a few debugging assertions.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Dave Chinner 

Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 2/2] xfs: don't update mtime on COW faults

2020-09-07 Thread Christoph Hellwig
> +static bool
> +xfs_is_write_fault(
> + struct vm_fault *vmf)
> +{
> + return vmf->flags & FAULT_FLAG_WRITE && vmf->vma->vm_flags & VM_SHARED;
> +}

This function does not look xfs specific at all.  Why isn't it it in
fs.h?  While we're at it the name sounds rather generic, and there are
no good comments.

Maybe we just need to split FAULT_FLAG_WRITE into two and check those
instead of such crazy workarounds?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 8/9] nvdimm: simplify revalidate_disk handling

2020-09-01 Thread Christoph Hellwig
The nvdimm block driver abuse revalidate_disk in a strange way, and
totally unrelated to what other drivers do.  Simplify this by just
calling nvdimm_revalidate_disk (which seems rather misnamed) from the
probe routines, as the additional bdev size revalidation is pointless
at this point, and remove the revalidate_disk methods given that
it can only be triggered from add_disk, which is right before the
manual calls.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvdimm/blk.c  | 3 +--
 drivers/nvdimm/btt.c  | 3 +--
 drivers/nvdimm/bus.c  | 9 +++--
 drivers/nvdimm/nd.h   | 2 +-
 drivers/nvdimm/pmem.c | 3 +--
 5 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 1f718381a04553..22e5617b2cea14 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -226,7 +226,6 @@ static int nsblk_rw_bytes(struct nd_namespace_common *ndns,
 static const struct block_device_operations nd_blk_fops = {
.owner = THIS_MODULE,
.submit_bio =  nd_blk_submit_bio,
-   .revalidate_disk = nvdimm_revalidate_disk,
 };
 
 static void nd_blk_release_queue(void *q)
@@ -284,7 +283,7 @@ static int nsblk_attach_disk(struct nd_namespace_blk *nsblk)
 
set_capacity(disk, available_disk_size >> SECTOR_SHIFT);
device_add_disk(dev, disk, NULL);
-   revalidate_disk(disk);
+   nvdimm_check_and_set_ro(disk);
return 0;
 }
 
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 0ff610e728ff6f..0d710140bf93be 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1513,7 +1513,6 @@ static const struct block_device_operations btt_fops = {
.submit_bio =   btt_submit_bio,
.rw_page =  btt_rw_page,
.getgeo =   btt_getgeo,
-   .revalidate_disk =  nvdimm_revalidate_disk,
 };
 
 static int btt_blk_init(struct btt *btt)
@@ -1558,7 +1557,7 @@ static int btt_blk_init(struct btt *btt)
set_capacity(btt->btt_disk, btt->nlba * btt->sector_size >> 9);
device_add_disk(>nd_btt->dev, btt->btt_disk, NULL);
btt->nd_btt->size = btt->nlba * (u64)btt->sector_size;
-   revalidate_disk(btt->btt_disk);
+   nvdimm_check_and_set_ro(btt->btt_disk);
 
return 0;
 }
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 955265656b96c7..2304c6183822e9 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -628,7 +628,7 @@ int __nd_driver_register(struct nd_device_driver *nd_drv, 
struct module *owner,
 }
 EXPORT_SYMBOL(__nd_driver_register);
 
-int nvdimm_revalidate_disk(struct gendisk *disk)
+void nvdimm_check_and_set_ro(struct gendisk *disk)
 {
struct device *dev = disk_to_dev(disk)->parent;
struct nd_region *nd_region = to_nd_region(dev->parent);
@@ -639,16 +639,13 @@ int nvdimm_revalidate_disk(struct gendisk *disk)
 * read-only if the disk is already read-only.
 */
if (disk_ro || nd_region->ro == disk_ro)
-   return 0;
+   return;
 
dev_info(dev, "%s read-only, marking %s read-only\n",
dev_name(_region->dev), disk->disk_name);
set_disk_ro(disk, 1);
-
-   return 0;
-
 }
-EXPORT_SYMBOL(nvdimm_revalidate_disk);
+EXPORT_SYMBOL(nvdimm_check_and_set_ro);
 
 static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
char *buf)
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 85c1ae813ea318..72740108ba4206 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -361,7 +361,7 @@ u64 nd_region_interleave_set_altcookie(struct nd_region 
*nd_region);
 void nvdimm_bus_lock(struct device *dev);
 void nvdimm_bus_unlock(struct device *dev);
 bool is_nvdimm_bus_locked(struct device *dev);
-int nvdimm_revalidate_disk(struct gendisk *disk);
+void nvdimm_check_and_set_ro(struct gendisk *disk);
 void nvdimm_drvdata_release(struct kref *kref);
 void put_ndd(struct nvdimm_drvdata *ndd);
 int nd_label_reserve_dpa(struct nvdimm_drvdata *ndd);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fab29b514372d0..140cf3b9000c60 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -281,7 +281,6 @@ static const struct block_device_operations pmem_fops = {
.owner =THIS_MODULE,
.submit_bio =   pmem_submit_bio,
.rw_page =  pmem_rw_page,
-   .revalidate_disk =  nvdimm_revalidate_disk,
 };
 
 static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff,
@@ -501,7 +500,7 @@ static int pmem_attach_disk(struct device *dev,
if (devm_add_action_or_reset(dev, pmem_release_disk, pmem))
return -ENOMEM;
 
-   revalidate_disk(disk);
+   nvdimm_check_and_set_ro(disk);
 
pmem->bb_state = sysfs_get_dirent(disk_to_dev(disk)->kobj.sd,
  "badblock

[PATCH 5/9] block: use revalidate_disk_size in set_capacity_revalidate_and_notify

2020-09-01 Thread Christoph Hellwig
Only virtio_blk and xen-blkfront set the revalidate argument to true,
and both do not implement the ->revalidate_disk method.  So switch
to the helper that just updates the size instead.

Signed-off-by: Christoph Hellwig 
---
 block/genhd.c | 7 +++
 include/linux/genhd.h | 4 ++--
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index a2c0ec694918e5..431d4081b50ec7 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -50,14 +50,13 @@ static void disk_release_events(struct gendisk *disk);
  * zero and will not be set to zero
  */
 void set_capacity_revalidate_and_notify(struct gendisk *disk, sector_t size,
-   bool revalidate)
+   bool update_bdev)
 {
sector_t capacity = get_capacity(disk);
 
set_capacity(disk, size);
-
-   if (revalidate)
-   revalidate_disk(disk);
+   if (update_bdev)
+   revalidate_disk_size(disk, true);
 
if (capacity != size && capacity != 0 && size != 0) {
char *envp[] = { "RESIZE=1", NULL };
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 8e9c9d3a493fae..c340b392452ce6 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -315,8 +315,8 @@ static inline int get_disk_ro(struct gendisk *disk)
 extern void disk_block_events(struct gendisk *disk);
 extern void disk_unblock_events(struct gendisk *disk);
 extern void disk_flush_events(struct gendisk *disk, unsigned int mask);
-extern void set_capacity_revalidate_and_notify(struct gendisk *disk,
-   sector_t size, bool revalidate);
+void set_capacity_revalidate_and_notify(struct gendisk *disk, sector_t size,
+   bool update_bdev);
 extern unsigned int disk_clear_events(struct gendisk *disk, unsigned int mask);
 
 /* drivers/char/random.c */
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 6/9] nvme: opencode revalidate_disk in nvme_validate_ns

2020-09-01 Thread Christoph Hellwig
Keep control in the NVMe driver instead of going through an indirect
call back into ->revalidate_disk.  Also reorder the function a bit to be
easier to follow with the additional code.

And now that we have removed all callers of revalidate_disk() in the nvme
code, ->revalidate_disk is only called from the open code when first
opening the device.  Which is of course totally pointless as we have
a valid size since the initial scan, and will get an updated view
through the asynchronous notifiation everytime the size changes.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bc18523774b4be..9428e7deb68b09 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2323,7 +2323,6 @@ static const struct block_device_operations nvme_fops = {
.open   = nvme_open,
.release= nvme_release,
.getgeo = nvme_getgeo,
-   .revalidate_disk= nvme_revalidate_disk,
.report_zones   = nvme_report_zones,
.pr_ops = _pr_ops,
 };
@@ -4020,14 +4019,19 @@ static void nvme_ns_remove_by_nsid(struct nvme_ctrl 
*ctrl, u32 nsid)
 static void nvme_validate_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 {
struct nvme_ns *ns;
+   int ret;
 
ns = nvme_find_get_ns(ctrl, nsid);
-   if (ns) {
-   if (revalidate_disk(ns->disk))
-   nvme_ns_remove(ns);
-   nvme_put_ns(ns);
-   } else
+   if (!ns) {
nvme_alloc_ns(ctrl, nsid);
+   return;
+   }
+
+   ret = nvme_revalidate_disk(ns->disk);
+   revalidate_disk_size(ns->disk, ret == 0);
+   if (ret)
+   nvme_ns_remove(ns);
+   nvme_put_ns(ns);
 }
 
 static void nvme_remove_invalid_namespaces(struct nvme_ctrl *ctrl,
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 4/9] block: add a new revalidate_disk_size helper

2020-09-01 Thread Christoph Hellwig
revalidate_disk is a relative awkward helper for driver use, as it first
calls an optional driver method and then updates the block device size,
while most callers either don't need the method call at all, or want to
keep state between the caller and the called method.

Add a revalidate_disk_size helper that just performs the update of the
block device size from the gendisk one, and switch all drivers that do
not implement ->revalidate_disk to use the new helper instead of
revalidate_disk()

Signed-off-by: Christoph Hellwig 
---
 drivers/block/rbd.c   |  2 +-
 drivers/block/rnbd/rnbd-clt.c | 10 ++---
 drivers/block/virtio_blk.c|  2 +-
 drivers/block/zram/zram_drv.c |  4 ++--
 drivers/md/dm-raid.c  |  2 +-
 drivers/md/md-cluster.c   |  6 ++---
 drivers/md/md-linear.c|  2 +-
 drivers/md/md.c   | 10 -
 fs/block_dev.c| 42 ---
 include/linux/genhd.h |  1 +
 10 files changed, 46 insertions(+), 35 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 01153903969321..5d3923c0997ce0 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -4921,7 +4921,7 @@ static void rbd_dev_update_size(struct rbd_device 
*rbd_dev)
size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE;
dout("setting size to %llu sectors", (unsigned long long)size);
set_capacity(rbd_dev->disk, size);
-   revalidate_disk(rbd_dev->disk);
+   revalidate_disk_size(rbd_dev->disk, true);
}
 }
 
diff --git a/drivers/block/rnbd/rnbd-clt.c b/drivers/block/rnbd/rnbd-clt.c
index cc6a4e2587aec2..157538fc8be515 100644
--- a/drivers/block/rnbd/rnbd-clt.c
+++ b/drivers/block/rnbd/rnbd-clt.c
@@ -102,18 +102,12 @@ static int rnbd_clt_set_dev_attr(struct rnbd_clt_dev *dev,
 static int rnbd_clt_change_capacity(struct rnbd_clt_dev *dev,
size_t new_nsectors)
 {
-   int err = 0;
-
rnbd_clt_info(dev, "Device size changed from %zu to %zu sectors\n",
   dev->nsectors, new_nsectors);
dev->nsectors = new_nsectors;
set_capacity(dev->gd, dev->nsectors);
-   err = revalidate_disk(dev->gd);
-   if (err)
-   rnbd_clt_err(dev,
- "Failed to change device size from %zu to %zu, 
err: %d\n",
- dev->nsectors, new_nsectors, err);
-   return err;
+   revalidate_disk_size(dev->gd, true);
+   return 0;
 }
 
 static int process_msg_open_rsp(struct rnbd_clt_dev *dev,
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index ca63a41059d68e..a314b9382442b6 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -598,7 +598,7 @@ static void virtblk_update_cache_mode(struct virtio_device 
*vdev)
struct virtio_blk *vblk = vdev->priv;
 
blk_queue_write_cache(vblk->disk->queue, writeback, false);
-   revalidate_disk(vblk->disk);
+   revalidate_disk_size(vblk->disk, true);
 }
 
 static const char *const virtblk_cache_types[] = {
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 9100ac36670afc..a356275605b104 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1739,7 +1739,7 @@ static ssize_t disksize_store(struct device *dev,
zram->disksize = disksize;
set_capacity(zram->disk, zram->disksize >> SECTOR_SHIFT);
 
-   revalidate_disk(zram->disk);
+   revalidate_disk_size(zram->disk, true);
up_write(>init_lock);
 
return len;
@@ -1786,7 +1786,7 @@ static ssize_t reset_store(struct device *dev,
/* Make sure all the pending I/O are finished */
fsync_bdev(bdev);
zram_reset_device(zram);
-   revalidate_disk(zram->disk);
+   revalidate_disk_size(zram->disk, true);
bdput(bdev);
 
mutex_lock(>bd_mutex);
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 8d2b835d7a108c..56b723d012ac15 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -701,7 +701,7 @@ static void rs_set_capacity(struct raid_set *rs)
struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table));
 
set_capacity(gendisk, rs->md.array_sectors);
-   revalidate_disk(gendisk);
+   revalidate_disk_size(gendisk, true);
 }
 
 /*
diff --git a/drivers/md/md-cluster.c b/drivers/md/md-cluster.c
index d50737ec403946..0580b51a156a1c 100644
--- a/drivers/md/md-cluster.c
+++ b/drivers/md/md-cluster.c
@@ -582,7 +582,7 @@ static int process_recvd_msg(struct mddev *mddev, struct 
cluster_msg *msg)
break;
case CHANGE_CAPACITY:
set_capacity(mddev->gendisk, mddev->array_sectors);
-   revalidate_disk(mddev->gendisk);
+   revalidate_di

[PATCH 3/9] block: rename bd_invalidated

2020-09-01 Thread Christoph Hellwig
Replace bd_invalidate with a new BDEV_NEED_PART_SCAN flag in a bd_flags
variable to better describe the condition.

Signed-off-by: Christoph Hellwig 
---
 block/genhd.c |  2 +-
 drivers/block/nbd.c   |  8 
 fs/block_dev.c| 10 +-
 include/linux/blk_types.h |  4 +++-
 4 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 99c64641c3148c..a2c0ec694918e5 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -734,7 +734,7 @@ static void register_disk(struct device *parent, struct 
gendisk *disk,
if (!bdev)
goto exit;
 
-   bdev->bd_invalidated = 1;
+   set_bit(BDEV_NEED_PART_SCAN, >bd_flags);
err = blkdev_get(bdev, FMODE_READ, NULL);
if (err < 0)
goto exit;
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index a54f2d155a31a5..15eed210feeff4 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -315,7 +315,7 @@ static void nbd_size_update(struct nbd_device *nbd)
bd_set_nr_sectors(bdev, nr_sectors);
set_blocksize(bdev, config->blksize);
} else
-   bdev->bd_invalidated = 1;
+   set_bit(BDEV_NEED_PART_SCAN, >bd_flags);
bdput(bdev);
}
kobject_uevent(_to_dev(nbd)->kobj, KOBJ_CHANGE);
@@ -1322,7 +1322,7 @@ static int nbd_start_device_ioctl(struct nbd_device *nbd, 
struct block_device *b
return ret;
 
if (max_part)
-   bdev->bd_invalidated = 1;
+   set_bit(BDEV_NEED_PART_SCAN, >bd_flags);
mutex_unlock(>config_lock);
ret = wait_event_interruptible(config->recv_wq,
 atomic_read(>recv_threads) == 
0);
@@ -1500,9 +1500,9 @@ static int nbd_open(struct block_device *bdev, fmode_t 
mode)
refcount_set(>config_refs, 1);
refcount_inc(>refs);
mutex_unlock(>config_lock);
-   bdev->bd_invalidated = 1;
+   set_bit(BDEV_NEED_PART_SCAN, >bd_flags);
} else if (nbd_disconnected(nbd->config)) {
-   bdev->bd_invalidated = 1;
+   set_bit(BDEV_NEED_PART_SCAN, >bd_flags);
}
 out:
mutex_unlock(_index_mutex);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2760292045c082..0207623769715d 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -881,7 +881,7 @@ struct block_device *bdget(dev_t dev)
bdev->bd_super = NULL;
bdev->bd_inode = inode;
bdev->bd_part_count = 0;
-   bdev->bd_invalidated = 0;
+   bdev->bd_flags = 0;
inode->i_mode = S_IFBLK;
inode->i_rdev = dev;
inode->i_bdev = bdev;
@@ -1365,7 +1365,7 @@ int check_disk_change(struct block_device *bdev)
if (__invalidate_device(bdev, true))
pr_warn("VFS: busy inodes on changed media %s\n",
disk->disk_name);
-   bdev->bd_invalidated = 1;
+   set_bit(BDEV_NEED_PART_SCAN, >bd_flags);
if (bdops->revalidate_disk)
bdops->revalidate_disk(bdev->bd_disk);
return 1;
@@ -1390,7 +1390,7 @@ int bdev_disk_changed(struct block_device *bdev, bool 
invalidate)
 
lockdep_assert_held(>bd_mutex);
 
-   bdev->bd_invalidated = 0;
+   clear_bit(BDEV_NEED_PART_SCAN, >bd_flags);
 
 rescan:
ret = blk_drop_partitions(bdev);
@@ -1528,7 +1528,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, void *holder,
 * The latter is necessary to prevent ghost
 * partitions on a removed medium.
 */
-   if (bdev->bd_invalidated &&
+   if (test_bit(BDEV_NEED_PART_SCAN, >bd_flags) &&
(!ret || ret == -ENOMEDIUM))
bdev_disk_changed(bdev, ret == -ENOMEDIUM);
 
@@ -1558,7 +1558,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, void *holder,
if (bdev->bd_disk->fops->open)
ret = bdev->bd_disk->fops->open(bdev, mode);
/* the same as first opener case, read comment there */
-   if (bdev->bd_invalidated &&
+   if (test_bit(BDEV_NEED_PART_SCAN, >bd_flags) &&
(!ret || ret == -ENOMEDIUM))
bdev_disk_changed(bdev, ret == -ENOMEDIUM);
if (ret)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 63a39e47fc6047..c21eff7efda237 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types

[PATCH 2/9] block: don't clear bd_invalidated in check_disk_size_change

2020-09-01 Thread Christoph Hellwig
bd_invalidated is set by check_disk_change or in add_disk to initiate a
partition scan.  Move it from check_disk_size_change which is called
from both revalidate_disk() and bdev_disk_changed() to only the latter,
as that is what is called from the block device open code (and nbd) to
deal with the bd_invalidated event.  revalidate_disk() on the other hand
is mostly used to propagate a size update from the gendisk to the block
device, which is entirely unrelated.

Signed-off-by: Christoph Hellwig 
---
 fs/block_dev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 08158bb2e76c85..2760292045c082 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1302,7 +1302,6 @@ static void check_disk_size_change(struct gendisk *disk,
}
i_size_write(bdev->bd_inode, disk_size);
}
-   bdev->bd_invalidated = 0;
spin_unlock(>bd_size_lock);
 
if (bdev_size > disk_size) {
@@ -1391,6 +1390,8 @@ int bdev_disk_changed(struct block_device *bdev, bool 
invalidate)
 
lockdep_assert_held(>bd_mutex);
 
+   bdev->bd_invalidated = 0;
+
 rescan:
ret = blk_drop_partitions(bdev);
if (ret)
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 7/9] sd: open code revalidate_disk

2020-09-01 Thread Christoph Hellwig
Instead of calling revalidate_disk just do the work directly by
calling sd_revalidate_disk, and revalidate_disk_size where needed.

Signed-off-by: Christoph Hellwig 
---
 drivers/scsi/sd.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 95018e650f2d0c..2bec8cd526164d 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -217,7 +217,7 @@ cache_type_store(struct device *dev, struct 
device_attribute *attr,
sd_print_sense_hdr(sdkp, );
return -EINVAL;
}
-   revalidate_disk(sdkp->disk);
+   sd_revalidate_disk(sdkp->disk);
return count;
 }
 
@@ -1706,8 +1706,10 @@ static int sd_sync_cache(struct scsi_disk *sdkp, struct 
scsi_sense_hdr *sshdr)
 static void sd_rescan(struct device *dev)
 {
struct scsi_disk *sdkp = dev_get_drvdata(dev);
+   int ret;
 
-   revalidate_disk(sdkp->disk);
+   ret = sd_revalidate_disk(sdkp->disk);
+   revalidate_disk_size(sdkp->disk, ret == 0);
 }
 
 static int sd_ioctl(struct block_device *bdev, fmode_t mode,
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 9/9] block: remove revalidate_disk()

2020-09-01 Thread Christoph Hellwig
Remove the now unused helper.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/md.h   |  2 +-
 fs/block_dev.c| 19 ---
 include/linux/genhd.h |  1 -
 3 files changed, 1 insertion(+), 21 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index d9c4e6b7e9398d..f9e2ccdd22c478 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -397,7 +397,7 @@ struct mddev {
 * These locks are separate due to conflicting interactions
 * with bdev->bd_mutex.
 * Lock ordering is:
-*  reconfig_mutex -> bd_mutex : e.g. do_md_run -> revalidate_disk
+*  reconfig_mutex -> bd_mutex
 *  bd_mutex -> open_mutex:  e.g. __blkdev_get -> md_open
 */
struct mutexopen_mutex;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 85f013315d48b3..0771836d0220bd 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1339,25 +1339,6 @@ void revalidate_disk_size(struct gendisk *disk, bool 
verbose)
 }
 EXPORT_SYMBOL(revalidate_disk_size);
 
-/**
- * revalidate_disk - wrapper for lower-level driver's revalidate_disk call-back
- * @disk: struct gendisk to be revalidated
- *
- * This routine is a wrapper for lower-level driver's revalidate_disk
- * call-backs.  It is used to do common pre and post operations needed
- * for all revalidate_disk operations.
- */
-int revalidate_disk(struct gendisk *disk)
-{
-   int ret = 0;
-
-   if (disk->fops->revalidate_disk)
-   ret = disk->fops->revalidate_disk(disk);
-   revalidate_disk_size(disk, ret == 0);
-   return ret;
-}
-EXPORT_SYMBOL(revalidate_disk);
-
 /*
  * This routine checks whether a removable media has been changed,
  * and invalidates all buffer-cache-entries in that case. This
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index c340b392452ce6..2cdc41a3fb6a57 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -372,7 +372,6 @@ extern void blk_unregister_region(dev_t devt, unsigned long 
range);
 int register_blkdev(unsigned int major, const char *name);
 void unregister_blkdev(unsigned int major, const char *name);
 
-int revalidate_disk(struct gendisk *disk);
 void revalidate_disk_size(struct gendisk *disk, bool verbose);
 int check_disk_change(struct block_device *bdev);
 int __invalidate_device(struct block_device *bdev, bool kill_dirty);
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


remove revalidate_disk()

2020-09-01 Thread Christoph Hellwig
Hi Jens,

this series removes the revalidate_disk() function, which has been a
really odd duck in the last years.  The prime reason why most people
use it is because it propagates a size change from the gendisk to
the block_device structure.  But it also calls into the rather ill
defined ->revalidate_disk method which is rather useless for the
callers.  So this adds a new helper to just propagate the size, and
cleans up all kinds of mess around this area.  Follow on patches
will eventuall kill of ->revalidate_disk entirely, but ther are a lot
more patches needed for that.

Diffstat:
 Documentation/filesystems/locking.rst |3 --
 block/genhd.c |9 ++
 drivers/block/nbd.c   |8 ++---
 drivers/block/rbd.c   |2 -
 drivers/block/rnbd/rnbd-clt.c |   10 +--
 drivers/block/virtio_blk.c|2 -
 drivers/block/zram/zram_drv.c |4 +-
 drivers/md/dm-raid.c  |2 -
 drivers/md/md-cluster.c   |6 ++--
 drivers/md/md-linear.c|2 -
 drivers/md/md.c   |   10 +++
 drivers/md/md.h   |2 -
 drivers/nvdimm/blk.c  |3 --
 drivers/nvdimm/btt.c  |3 --
 drivers/nvdimm/bus.c  |9 ++
 drivers/nvdimm/nd.h   |2 -
 drivers/nvdimm/pmem.c |3 --
 drivers/nvme/host/core.c  |   16 +++
 drivers/scsi/sd.c |6 ++--
 fs/block_dev.c|   46 --
 include/linux/blk_types.h |4 ++
 include/linux/genhd.h |6 ++--
 22 files changed, 74 insertions(+), 84 deletions(-)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 1/9] Documentation/filesystems/locking.rst: remove an incorrect sentence

2020-09-01 Thread Christoph Hellwig
unlock_native_capacity is never called from check_disk_change(), and
while revalidate_disk can be called from it, it can also be called
from two other places at the moment.

Signed-off-by: Christoph Hellwig 
---
 Documentation/filesystems/locking.rst | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/Documentation/filesystems/locking.rst 
b/Documentation/filesystems/locking.rst
index 64f94a18d97e75..c0f2c7586531b0 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -488,9 +488,6 @@ getgeo: no
 swap_slot_free_notify: no  (see below)
 === ===
 
-unlock_native_capacity and revalidate_disk are called only from
-check_disk_change().
-
 swap_slot_free_notify is called with swap_lock and sometimes the page lock
 held.
 
-- 
2.28.0
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 8/9] iomap: Convert iomap_write_end types

2020-08-27 Thread Christoph Hellwig
On Mon, Aug 24, 2020 at 03:55:09PM +0100, Matthew Wilcox (Oracle) wrote:
> iomap_write_end cannot return an error, so switch it to return
> size_t instead of int and remove the error checking from the callers.
> Also convert the arguments to size_t from unsigned int, in case anyone
> ever wants to support a page size larger than 2GB.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Looks good:

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 9/9] iomap: Change calling convention for zeroing

2020-08-27 Thread Christoph Hellwig
On Tue, Aug 25, 2020 at 03:23:55PM -0700, Darrick J. Wong wrote:
> Sorry for my ultra-slow response to this.  The u64 length seems ok to me
> (or uint64_t, I don't care all /that/ much), but using loff_t as a
> return type bothers me because I see that and think that this function
> is returning a new file offset, e.g. (pos + number of bytes zeroed).
> 
> So please, let's use s64 or something that isn't so misleading.
> 
> FWIW, Linus also[0] doesn't[1] like using loff_t for the number of bytes
> copied.

Let's just switch to u64 and s64 then.  Unless we want to come up with
our own typedefs.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 7/9] iomap: Convert write_count to byte count

2020-08-27 Thread Christoph Hellwig
On Mon, Aug 24, 2020 at 03:55:08PM +0100, Matthew Wilcox (Oracle) wrote:
> Instead of counting bio segments, count the number of bytes submitted.
> This insulates us from the block layer's definition of what a 'same page'
> is, which is not necessarily clear once THPs are involved.

Looks good (module the field naming as comment on the previous patch):

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 6/9] iomap: Convert read_count to byte count

2020-08-27 Thread Christoph Hellwig
> @@ -269,20 +263,17 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   if (ctx->bio && bio_end_sector(ctx->bio) == sector)
>   is_contig = true;
>  
> -
>   /*
> -  * If we start a new segment we need to increase the read count, and we
> -  * need to do so before submitting any previous full bio to make sure
> -  * that we don't prematurely unlock the page.
> +  * We need to increase the read count before submitting any
> +  * previous bio to make sure that we don't prematurely unlock
> +  * the page.
>*/
>   if (iop)
> - atomic_inc(>read_count);
> + atomic_add(plen, >read_count);
> +
> + if (is_contig &&
> + __bio_try_merge_page(ctx->bio, page, plen, poff, _page))
> + goto done;
>  
>   if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {

I think we can simplify this a bit by reordering the checks:

if (iop)
atomic_add(plen, >read_count);

if (ctx->bio && bio_end_sector(ctx->bio) == sector)
if (__bio_try_merge_page(ctx->bio, page, plen, poff,
_page))
goto done;
is_contig = true;
}

if (!is_contig || bio_full(ctx->bio, plen)) {
// the existing case to allocate a new bio
}

Also I think read_count should be renamed to be more descriptive,
something like read_bytes_pending?  Same for the write side.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 5/9] iomap: Support arbitrarily many blocks per page

2020-08-27 Thread Christoph Hellwig
>  static inline struct iomap_page *to_iomap_page(struct page *page)
>  {
> + VM_BUG_ON_PGFLAGS(PageTail(page), page);
>   if (page_has_private(page))
>   return (struct iomap_page *)page_private(page);
>   return NULL;

Nit: can you add an empty line after the VM_BUG_ON_PGFLAGS assert to
keep the function readable?  Maybe also add a comment on the assert,
as it isn't totally obvious.

Otherwise looks good:

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 1/9] iomap: Fix misplaced page flushing

2020-08-27 Thread Christoph Hellwig
On Mon, Aug 24, 2020 at 03:55:02PM +0100, Matthew Wilcox (Oracle) wrote:
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index bcfc288dba3f..cffd575e57b6 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -715,6 +715,7 @@ iomap_write_end_inline(struct inode *inode, struct page 
> *page,
>  {
>   void *addr;
>  
> + flush_dcache_page(page);
>   WARN_ON_ONCE(!PageUptodate(page));
>   BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));

Please move the call down below the asserts.

Otherwise looks good:

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v2 11/12] PM, libnvdimm: Add 'mem-quiet' state and callback for firmware activation

2020-07-09 Thread Christoph Hellwig
On Mon, Jul 06, 2020 at 06:59:32PM -0700, Dan Williams wrote:
> The runtime firmware activation capability of Intel NVDIMM devices
> requires memory transactions to be disabled for 100s of microseconds.
> This timeout is large enough to cause in-flight DMA to fail and other
> application detectable timeouts. Arrange for firmware activation to be
> executed while the system is "quiesced", all processes and device-DMA
> frozen.
> 
> It is already required that invoking device ->freeze() callbacks is
> sufficient to cease DMA. A device that continues memory writes outside
> of user-direction violates expectations of the PM core to be to
> establish a coherent hibernation image.
> 
> That said, RDMA devices are an example of a device that access memory
> outside of user process direction. RDMA drivers also typically assume
> the system they are operating in will never be hibernated. A solution
> for RDMA collisions with firmware activation is outside the scope of
> this change and may need to rely on being able to survive the platform
> imposed memory controller quiesce period.

Yikes.  I don't think we should support such a broken runtime firmware
activation.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH 18/20] block: refator submit_bio_noacct

2020-07-02 Thread Christoph Hellwig
On Thu, Jul 02, 2020 at 10:10:10AM -0400, Qian Cai wrote:
> On Mon, Jun 29, 2020 at 09:39:45PM +0200, Christoph Hellwig wrote:
> > Split out a __submit_bio_noacct helper for the actual de-recursion
> > algorithm, and simplify the loop by using a continue when we can't
> > enter the queue for a bio.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Reverting this commit and its dependencies,
> 
> 5a6c35f9af41 block: remove direct_make_request
> ff93ea0ce763 block: shortcut __submit_bio_noacct for blk-mq drivers
> 
> fixed the stack-out-of-bounds during boot,
> 
> https://lore.kernel.org/linux-block/bcdeaa05a9728...@google.com/

Yikes.  bio_alloc_bioset pokes into bio_list[1] in a totally
undocumented way.  But even with that the problem should only show
up with "block: shortcut __submit_bio_noacct for blk-mq drivers".

Can you try this patch?

diff --git a/block/blk-core.c b/block/blk-core.c
index bf882b8d84450c..9f1bf8658b611a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1155,11 +1155,10 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
 static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
 {
struct gendisk *disk = bio->bi_disk;
-   struct bio_list bio_list;
+   struct bio_list bio_list[2] = { };
blk_qc_t ret = BLK_QC_T_NONE;
 
-   bio_list_init(_list);
-   current->bio_list = _list;
+   current->bio_list = bio_list;
 
do {
WARN_ON_ONCE(bio->bi_disk != disk);
@@ -1174,7 +1173,7 @@ static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
}
 
ret = blk_mq_submit_bio(bio);
-   } while ((bio = bio_list_pop(_list)));
+   } while ((bio = bio_list_pop(_list[0])));
 
current->bio_list = NULL;
return ret;
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 12/20] block: remove the request_queue argument from blk_queue_split

2020-07-01 Thread Christoph Hellwig
The queue can be trivially derived from the bio, so pass one less
argument.

Signed-off-by: Christoph Hellwig 
---
 block/blk-merge.c | 21 ++---
 block/blk-mq.c|  2 +-
 block/blk.h   |  3 +--
 drivers/block/drbd/drbd_req.c |  2 +-
 drivers/block/pktcdvd.c   |  2 +-
 drivers/block/ps3vram.c   |  2 +-
 drivers/block/rsxx/dev.c  |  2 +-
 drivers/block/umem.c  |  2 +-
 drivers/lightnvm/pblk-init.c  |  4 ++--
 drivers/md/dm.c   |  2 +-
 drivers/md/md.c   |  2 +-
 drivers/nvme/host/multipath.c |  9 -
 drivers/s390/block/dcssblk.c  |  2 +-
 drivers/s390/block/xpram.c|  2 +-
 include/linux/blkdev.h|  2 +-
 15 files changed, 28 insertions(+), 31 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 9c9fb21584b64e..20fa2290604105 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -283,20 +283,20 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 
 /**
  * __blk_queue_split - split a bio and submit the second half
- * @q:   [in] request queue pointer
  * @bio: [in, out] bio to be split
  * @nr_segs: [out] number of segments in the first bio
  *
  * Split a bio into two bios, chain the two bios, submit the second half and
  * store a pointer to the first half in *@bio. If the second bio is still too
  * big it will be split by a recursive call to this function. Since this
- * function may allocate a new bio from @q->bio_split, it is the responsibility
- * of the caller to ensure that @q is only released after processing of the
+ * function may allocate a new bio from @bio->bi_disk->queue->bio_split, it is
+ * the responsibility of the caller to ensure that
+ * @bio->bi_disk->queue->bio_split is only released after processing of the
  * split bio has finished.
  */
-void __blk_queue_split(struct request_queue *q, struct bio **bio,
-   unsigned int *nr_segs)
+void __blk_queue_split(struct bio **bio, unsigned int *nr_segs)
 {
+   struct request_queue *q = (*bio)->bi_disk->queue;
struct bio *split = NULL;
 
switch (bio_op(*bio)) {
@@ -345,20 +345,19 @@ void __blk_queue_split(struct request_queue *q, struct 
bio **bio,
 
 /**
  * blk_queue_split - split a bio and submit the second half
- * @q:   [in] request queue pointer
  * @bio: [in, out] bio to be split
  *
  * Split a bio into two bios, chains the two bios, submit the second half and
  * store a pointer to the first half in *@bio. Since this function may allocate
- * a new bio from @q->bio_split, it is the responsibility of the caller to
- * ensure that @q is only released after processing of the split bio has
- * finished.
+ * a new bio from @bio->bi_disk->queue->bio_split, it is the responsibility of
+ * the caller to ensure that @bio->bi_disk->queue->bio_split is only released
+ * after processing of the split bio has finished.
  */
-void blk_queue_split(struct request_queue *q, struct bio **bio)
+void blk_queue_split(struct bio **bio)
 {
unsigned int nr_segs;
 
-   __blk_queue_split(q, bio, _segs);
+   __blk_queue_split(bio, _segs);
 }
 EXPORT_SYMBOL(blk_queue_split);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 65e0846fd06519..dbadb7defd618a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2166,7 +2166,7 @@ blk_qc_t blk_mq_make_request(struct request_queue *q, 
struct bio *bio)
blk_status_t ret;
 
blk_queue_bounce(q, );
-   __blk_queue_split(q, , _segs);
+   __blk_queue_split(, _segs);
 
if (!bio_integrity_prep(bio))
goto queue_exit;
diff --git a/block/blk.h b/block/blk.h
index 0184a31fe4dfaf..0114fd92c8a05b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -220,8 +220,7 @@ ssize_t part_timeout_show(struct device *, struct 
device_attribute *, char *);
 ssize_t part_timeout_store(struct device *, struct device_attribute *,
const char *, size_t);
 
-void __blk_queue_split(struct request_queue *q, struct bio **bio,
-   unsigned int *nr_segs);
+void __blk_queue_split(struct bio **bio, unsigned int *nr_segs);
 int ll_back_merge_fn(struct request *req, struct bio *bio,
unsigned int nr_segs);
 int ll_front_merge_fn(struct request *req,  struct bio *bio,
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index 3f09b2ab977822..9368680474223a 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1598,7 +1598,7 @@ blk_qc_t drbd_make_request(struct request_queue *q, 
struct bio *bio)
struct drbd_device *device = bio->bi_disk->private_data;
unsigned long start_jif;
 
-   blk_queue_split(q, );
+   blk_queue_split();
 
start_jif = jiffies;
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 27a33adc41e487..29b0c62dc86c1f 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c

[PATCH 07/20] umem: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/umem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index 1e2aa5ae27963c..5498f1cf36b3fe 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -521,7 +521,8 @@ static int mm_check_plugged(struct cardinfo *card)
 
 static blk_qc_t mm_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct cardinfo *card = q->queuedata;
+   struct cardinfo *card = bio->bi_disk->private_data;
+
pr_debug("mm_make_request %llu %u\n",
 (unsigned long long)bio->bi_iter.bi_sector,
 bio->bi_iter.bi_size);
@@ -888,7 +889,6 @@ static int mm_pci_probe(struct pci_dev *dev, const struct 
pci_device_id *id)
card->queue = blk_alloc_queue(mm_make_request, NUMA_NO_NODE);
if (!card->queue)
goto failed_alloc;
-   card->queue->queuedata = card;
 
tasklet_init(>tasklet, process_page, (unsigned long)card);
 
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 16/20] block: move ->make_request_fn to struct block_device_operations

2020-07-01 Thread Christoph Hellwig
The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector.  Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does).  Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig 
---
 Documentation/block/biodoc.rst|  2 +-
 .../block/writeback_cache_control.rst |  2 +-
 arch/m68k/emu/nfblock.c   |  5 +-
 arch/xtensa/platforms/iss/simdisk.c   |  5 +-
 block/blk-cgroup.c|  2 +-
 block/blk-core.c  | 53 +++
 block/blk-mq.c| 10 ++--
 block/blk.h   |  2 -
 drivers/block/brd.c   |  5 +-
 drivers/block/drbd/drbd_int.h |  2 +-
 drivers/block/drbd/drbd_main.c|  9 ++--
 drivers/block/drbd/drbd_req.c |  2 +-
 drivers/block/null_blk_main.c | 17 --
 drivers/block/pktcdvd.c   | 11 ++--
 drivers/block/ps3vram.c   | 15 +++---
 drivers/block/rsxx/dev.c  |  7 ++-
 drivers/block/umem.c  |  5 +-
 drivers/block/zram/zram_drv.c | 11 ++--
 drivers/lightnvm/core.c   |  8 +--
 drivers/lightnvm/pblk-init.c  | 12 +++--
 drivers/md/bcache/request.c   |  4 +-
 drivers/md/bcache/request.h   |  4 +-
 drivers/md/bcache/super.c | 23 +---
 drivers/md/dm.c   | 23 
 drivers/md/md.c   |  5 +-
 drivers/nvdimm/blk.c  |  5 +-
 drivers/nvdimm/btt.c  |  5 +-
 drivers/nvdimm/pmem.c |  5 +-
 drivers/nvme/host/core.c  |  1 +
 drivers/nvme/host/multipath.c |  5 +-
 drivers/nvme/host/nvme.h  |  1 +
 drivers/s390/block/dcssblk.c  |  9 ++--
 drivers/s390/block/xpram.c|  6 +--
 include/linux/blk-mq.h|  2 +-
 include/linux/blkdev.h|  7 +--
 include/linux/lightnvm.h  |  3 +-
 36 files changed, 153 insertions(+), 140 deletions(-)

diff --git a/Documentation/block/biodoc.rst b/Documentation/block/biodoc.rst
index b964796ec9c780..267384159bf793 100644
--- a/Documentation/block/biodoc.rst
+++ b/Documentation/block/biodoc.rst
@@ -1036,7 +1036,7 @@ Now the generic block layer performs partition-remapping 
early and thus
 provides drivers with a sector number relative to whole device, rather than
 having to take partition number into account in order to arrive at the true
 sector number. The routine blk_partition_remap() is invoked by
-generic_make_request even before invoking the queue specific make_request_fn,
+generic_make_request even before invoking the queue specific ->submit_bio,
 so the i/o scheduler also gets to operate on whole disk sector numbers. This
 should typically not require changes to block drivers, it just never gets
 to invoke its own partition sector offset calculations since all bios
diff --git a/Documentation/block/writeback_cache_control.rst 
b/Documentation/block/writeback_cache_control.rst
index 2c752c57c14c62..b208488d0aae85 100644
--- a/Documentation/block/writeback_cache_control.rst
+++ b/Documentation/block/writeback_cache_control.rst
@@ -47,7 +47,7 @@ the Forced Unit Access is implemented.  The REQ_PREFLUSH and 
REQ_FUA flags
 may both be set on a single bio.
 
 
-Implementation details for make_request_fn based block drivers
+Implementation details for bio based block drivers
 --
 
 These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit
diff --git a/arch/m68k/emu/nfblock.c b/arch/m68k/emu/nfblock.c
index 87e8b1700acd28..92d26c81244134 100644
--- a/arch/m68k/emu/nfblock.c
+++ b/arch/m68k/emu/nfblock.c
@@ -59,7 +59,7 @@ struct nfhd_device {
struct gendisk *disk;
 };
 
-static blk_qc_t nfhd_make_request(struct request_queue *queue, struct bio *bio)
+static blk_qc_t nfhd_submit_bio(struct bio *bio)
 {
struct nfhd_device *dev = bio->bi_disk->private_data;
struct bio_vec bvec;
@@ -93,6 +93,7 @@ static int nfhd_getgeo(struct block_device *bdev, struct 
hd_geometry *geo)
 
 static const struct block_device_operations nfhd_ops = {
.owner  = THIS_MODULE,
+   .submit_bio = nfhd_submit_bio,
.getgeo = nfhd_getgeo,
 };
 
@@ -118,7 +119,7 @@ static int __init nfhd_init_one(int id, u32 blocks, u32 
bsize)
dev->bsize = bsize;
dev->bshift = ffs(bsize) - 10;
 
-   dev->queue = blk_alloc_queue(nfhd_make_request, NUMA_NO_NODE);
+   dev->qu

[PATCH 03/20] drbd: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/drbd/drbd_main.c | 1 -
 drivers/block/drbd/drbd_req.c  | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 45fbd526c453bc..26f4e0aa7393b4 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2805,7 +2805,6 @@ enum drbd_ret_code drbd_create_device(struct 
drbd_config_context *adm_ctx, unsig
if (!q)
goto out_no_q;
device->rq_queue = q;
-   q->queuedata   = device;
 
disk = alloc_disk(1);
if (!disk)
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index c80a2f1c3c2a73..3f09b2ab977822 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1595,7 +1595,7 @@ void do_submit(struct work_struct *ws)
 
 blk_qc_t drbd_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct drbd_device *device = (struct drbd_device *) q->queuedata;
+   struct drbd_device *device = bio->bi_disk->private_data;
unsigned long start_jif;
 
blk_queue_split(q, );
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 20/20] block: remove direct_make_request

2020-07-01 Thread Christoph Hellwig
Now that submit_bio_noacct has a decent blk-mq fast path there is no
more need for this bypass.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c  | 28 
 drivers/md/dm.c   |  5 +
 drivers/nvme/host/multipath.c |  2 +-
 include/linux/blkdev.h|  1 -
 4 files changed, 2 insertions(+), 34 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2ff166f0d24ee3..bf882b8d84450c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1211,34 +1211,6 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
 }
 EXPORT_SYMBOL(submit_bio_noacct);
 
-/**
- * direct_make_request - hand a buffer directly to its device driver for I/O
- * @bio:  The bio describing the location in memory and on the device.
- *
- * This function behaves like submit_bio_noacct(), but does not protect
- * against recursion.  Must only be used if the called driver is known
- * to be blk-mq based.
- */
-blk_qc_t direct_make_request(struct bio *bio)
-{
-   struct gendisk *disk = bio->bi_disk;
-
-   if (WARN_ON_ONCE(!disk->queue->mq_ops)) {
-   bio_io_error(bio);
-   return BLK_QC_T_NONE;
-   }
-   if (!submit_bio_checks(bio))
-   return BLK_QC_T_NONE;
-   if (unlikely(bio_queue_enter(bio)))
-   return BLK_QC_T_NONE;
-   if (!blk_crypto_bio_prep()) {
-   blk_queue_exit(disk->queue);
-   return BLK_QC_T_NONE;
-   }
-   return blk_mq_submit_bio(bio);
-}
-EXPORT_SYMBOL_GPL(direct_make_request);
-
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The  bio which describes the I/O
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b32b539dbace56..2cb33896198c4c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1302,10 +1302,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
/* the bio has been remapped so dispatch it */
trace_block_bio_remap(clone->bi_disk->queue, clone,
  bio_dev(io->orig_bio), sector);
-   if (md->type == DM_TYPE_NVME_BIO_BASED)
-   ret = direct_make_request(clone);
-   else
-   ret = submit_bio_noacct(clone);
+   ret = submit_bio_noacct(clone);
break;
case DM_MAPIO_KILL:
free_tio(tio);
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index f07fa47c251d9d..a986ac52c4cc7f 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -314,7 +314,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
trace_block_bio_remap(bio->bi_disk->queue, bio,
  disk_devt(ns->head->disk),
  bio->bi_iter.bi_sector);
-   ret = direct_make_request(bio);
+   ret = submit_bio_noacct(bio);
} else if (nvme_available_path(head)) {
dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index b73cfa6a5141df..1cc913ffdbe21e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -853,7 +853,6 @@ static inline void rq_flush_dcache_pages(struct request *rq)
 extern int blk_register_queue(struct gendisk *disk);
 extern void blk_unregister_queue(struct gendisk *disk);
 blk_qc_t submit_bio_noacct(struct bio *bio);
-extern blk_qc_t direct_make_request(struct bio *bio);
 extern void blk_rq_init(struct request_queue *q, struct request *rq);
 extern void blk_put_request(struct request *);
 extern struct request *blk_get_request(struct request_queue *, unsigned int op,
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 06/20] rsxx: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/rsxx/dev.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index 3ba07ab30c84f5..6a4d8d26e32cbd 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -119,7 +119,7 @@ static void bio_dma_done_cb(struct rsxx_cardinfo *card,
 
 static blk_qc_t rsxx_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct rsxx_cardinfo *card = q->queuedata;
+   struct rsxx_cardinfo *card = bio->bi_disk->private_data;
struct rsxx_bio_meta *bio_meta;
blk_status_t st = BLK_STS_IOERR;
 
@@ -267,8 +267,6 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card)
card->queue->limits.discard_alignment   = RSXX_HW_BLK_SIZE;
}
 
-   card->queue->queuedata = card;
-
snprintf(card->gendisk->disk_name, sizeof(card->gendisk->disk_name),
 "rsxx%d", card->disk_id);
card->gendisk->major = card->major;
@@ -289,7 +287,6 @@ void rsxx_destroy_dev(struct rsxx_cardinfo *card)
card->gendisk = NULL;
 
blk_cleanup_queue(card->queue);
-   card->queue->queuedata = NULL;
unregister_blkdev(card->major, DRIVER_NAME);
 }
 
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 19/20] block: shortcut __submit_bio_noacct for blk-mq drivers

2020-07-01 Thread Christoph Hellwig
For blk-mq drivers bios can only be inserted for the same queue.  So
bypass the complicated sorting logic in __submit_bio_noacct with
a blk-mq simpler submission helper.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 57b5dc00d44cb1..2ff166f0d24ee3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1152,6 +1152,34 @@ static blk_qc_t __submit_bio_noacct(struct bio *bio)
return ret;
 }
 
+static blk_qc_t __submit_bio_noacct_mq(struct bio *bio)
+{
+   struct gendisk *disk = bio->bi_disk;
+   struct bio_list bio_list;
+   blk_qc_t ret = BLK_QC_T_NONE;
+
+   bio_list_init(_list);
+   current->bio_list = _list;
+
+   do {
+   WARN_ON_ONCE(bio->bi_disk != disk);
+
+   if (unlikely(bio_queue_enter(bio) != 0))
+   continue;
+
+   if (!blk_crypto_bio_prep()) {
+   blk_queue_exit(disk->queue);
+   ret = BLK_QC_T_NONE;
+   continue;
+   }
+
+   ret = blk_mq_submit_bio(bio);
+   } while ((bio = bio_list_pop(_list)));
+
+   current->bio_list = NULL;
+   return ret;
+}
+
 /**
  * submit_bio_noacct - re-submit a bio to the block device layer for I/O
  * @bio:  The bio describing the location in memory and on the device.
@@ -1177,6 +1205,8 @@ blk_qc_t submit_bio_noacct(struct bio *bio)
return BLK_QC_T_NONE;
}
 
+   if (!bio->bi_disk->fops->submit_bio)
+   return __submit_bio_noacct_mq(bio);
return __submit_bio_noacct(bio);
 }
 EXPORT_SYMBOL(submit_bio_noacct);
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


rename ->make_request_fn and move it to the block_device_operations v2

2020-07-01 Thread Christoph Hellwig
Hi Jens,

this series moves the make_request_fn method into block_device_operations
with the much more descriptive ->submit_bio name.  It then also gives
generic_make_request a more descriptive name, and further optimize the
path to issue to blk-mq, removing the need for the direct_make_request
bypass.

Changes since v1:
 - fix a null pointer dereference when dispatching from bio to request
   based drivers
 - clean up a few more comments

Diffstat:
 Documentation/block/biodoc.rst|2 
 Documentation/block/writeback_cache_control.rst   |2 
 Documentation/fault-injection/fault-injection.rst |2 
 Documentation/trace/ftrace.rst|4 
 arch/m68k/emu/nfblock.c   |8 
 arch/xtensa/platforms/iss/simdisk.c   |9 
 block/bio.c   |   14 -
 block/blk-cgroup.c|2 
 block/blk-core.c  |  255 +-
 block/blk-crypto-fallback.c   |2 
 block/blk-crypto.c|2 
 block/blk-merge.c |   23 -
 block/blk-mq.c|   12 -
 block/blk-throttle.c  |4 
 block/blk.h   |5 
 block/bounce.c|2 
 drivers/block/brd.c   |5 
 drivers/block/drbd/drbd_int.h |8 
 drivers/block/drbd/drbd_main.c|   12 -
 drivers/block/drbd/drbd_receiver.c|2 
 drivers/block/drbd/drbd_req.c |8 
 drivers/block/drbd/drbd_worker.c  |2 
 drivers/block/null_blk_main.c |   19 +
 drivers/block/pktcdvd.c   |   15 -
 drivers/block/ps3vram.c   |   20 -
 drivers/block/rsxx/dev.c  |   14 -
 drivers/block/umem.c  |   11 
 drivers/block/zram/zram_drv.c |   14 -
 drivers/lightnvm/core.c   |8 
 drivers/lightnvm/pblk-init.c  |   16 -
 drivers/lightnvm/pblk-read.c  |2 
 drivers/md/bcache/bcache.h|2 
 drivers/md/bcache/btree.c |2 
 drivers/md/bcache/request.c   |   11 
 drivers/md/bcache/request.h   |4 
 drivers/md/bcache/super.c |   24 +-
 drivers/md/dm-cache-target.c  |6 
 drivers/md/dm-clone-target.c  |   10 
 drivers/md/dm-crypt.c |6 
 drivers/md/dm-delay.c |2 
 drivers/md/dm-era-target.c|2 
 drivers/md/dm-integrity.c |4 
 drivers/md/dm-mpath.c |2 
 drivers/md/dm-raid1.c |2 
 drivers/md/dm-snap-persistent.c   |2 
 drivers/md/dm-snap.c  |6 
 drivers/md/dm-thin.c  |4 
 drivers/md/dm-verity-target.c |2 
 drivers/md/dm-writecache.c|2 
 drivers/md/dm-zoned-target.c  |2 
 drivers/md/dm.c   |   41 +--
 drivers/md/md-faulty.c|4 
 drivers/md/md-linear.c|4 
 drivers/md/md-multipath.c |4 
 drivers/md/md.c   |7 
 drivers/md/raid0.c|8 
 drivers/md/raid1.c|   14 -
 drivers/md/raid10.c   |   28 +-
 drivers/md/raid5.c|   10 
 drivers/nvdimm/blk.c  |5 
 drivers/nvdimm/btt.c  |5 
 drivers/nvdimm/pmem.c |5 
 drivers/nvme/host/core.c  |1 
 drivers/nvme/host/multipath.c |   18 -
 drivers/nvme/host/nvme.h  |1 
 drivers/s390/block/dcssblk.c  |   11 
 drivers/s390/block/xpram.c|8 
 fs/buffer.c   |5 
 include/linux/blk-mq.h|2 
 include/linux/blkdev.h|   12 -
 include/linux/lightnvm.h  |3 
 71 files changed, 387 insertions(+), 408 deletions(-)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 02/20] simdisk: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 arch/xtensa/platforms/iss/simdisk.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/xtensa/platforms/iss/simdisk.c 
b/arch/xtensa/platforms/iss/simdisk.c
index 49322b66cda931..31b5020077a059 100644
--- a/arch/xtensa/platforms/iss/simdisk.c
+++ b/arch/xtensa/platforms/iss/simdisk.c
@@ -103,7 +103,7 @@ static void simdisk_transfer(struct simdisk *dev, unsigned 
long sector,
 
 static blk_qc_t simdisk_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct simdisk *dev = q->queuedata;
+   struct simdisk *dev = bio->bi_disk->private_data;
struct bio_vec bvec;
struct bvec_iter iter;
sector_t sector = bio->bi_iter.bi_sector;
@@ -273,8 +273,6 @@ static int __init simdisk_setup(struct simdisk *dev, int 
which,
goto out_alloc_queue;
}
 
-   dev->queue->queuedata = dev;
-
dev->gd = alloc_disk(SIMDISK_MINORS);
if (dev->gd == NULL) {
pr_err("alloc_disk failed\n");
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 05/20] ps3vram: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/ps3vram.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index 821d4d8b1d763e..5a1d1d137c7248 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -587,7 +587,7 @@ static struct bio *ps3vram_do_bio(struct 
ps3_system_bus_device *dev,
 
 static blk_qc_t ps3vram_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct ps3_system_bus_device *dev = q->queuedata;
+   struct ps3_system_bus_device *dev = bio->bi_disk->private_data;
struct ps3vram_priv *priv = ps3_system_bus_get_drvdata(dev);
int busy;
 
@@ -745,7 +745,6 @@ static int ps3vram_probe(struct ps3_system_bus_device *dev)
}
 
priv->queue = queue;
-   queue->queuedata = dev;
blk_queue_max_segments(queue, BLK_MAX_SEGMENTS);
blk_queue_max_segment_size(queue, BLK_MAX_SEGMENT_SIZE);
blk_queue_max_hw_sectors(queue, BLK_SAFE_MAX_SECTORS);
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 10/20] dm: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/md/dm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e44473fe0f4873..c8d91f271c272e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1789,7 +1789,7 @@ static blk_qc_t dm_process_bio(struct mapped_device *md,
 
 static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
 {
-   struct mapped_device *md = q->queuedata;
+   struct mapped_device *md = bio->bi_disk->private_data;
blk_qc_t ret = BLK_QC_T_NONE;
int srcu_idx;
struct dm_table *map;
@@ -1995,7 +1995,6 @@ static struct mapped_device *alloc_dev(int minor)
md->queue = blk_alloc_queue(dm_make_request, numa_node_id);
if (!md->queue)
goto bad;
-   md->queue->queuedata = md;
 
md->disk = alloc_disk_node(1, md->numa_node_id);
if (!md->disk)
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 04/20] null_blk: stop using ->queuedata for bio mode

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/null_blk_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/null_blk_main.c b/drivers/block/null_blk_main.c
index 82259242b9b5c9..93ce0a00b2ae01 100644
--- a/drivers/block/null_blk_main.c
+++ b/drivers/block/null_blk_main.c
@@ -1392,7 +1392,7 @@ static blk_qc_t null_queue_bio(struct request_queue *q, 
struct bio *bio)
 {
sector_t sector = bio->bi_iter.bi_sector;
sector_t nr_sectors = bio_sectors(bio);
-   struct nullb *nullb = q->queuedata;
+   struct nullb *nullb = bio->bi_disk->private_data;
struct nullb_queue *nq = nullb_to_queue(nullb);
struct nullb_cmd *cmd;
 
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 08/20] zram: stop using ->queuedata

2020-07-01 Thread Christoph Hellwig
Instead of setting up the queuedata as well just use one private data
field.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/zram/zram_drv.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 6e2ad90b17a376..0564e3f384089e 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1586,7 +1586,7 @@ static void __zram_make_request(struct zram *zram, struct 
bio *bio)
  */
 static blk_qc_t zram_make_request(struct request_queue *queue, struct bio *bio)
 {
-   struct zram *zram = queue->queuedata;
+   struct zram *zram = bio->bi_disk->private_data;
 
if (!valid_io_request(zram, bio->bi_iter.bi_sector,
bio->bi_iter.bi_size)) {
@@ -1912,7 +1912,6 @@ static int zram_add(void)
zram->disk->first_minor = device_id;
zram->disk->fops = _devops;
zram->disk->queue = queue;
-   zram->disk->queue->queuedata = zram;
zram->disk->private_data = zram;
snprintf(zram->disk->disk_name, 16, "zram%d", device_id);
 
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH 14/20] block: remove the NULL queue check in generic_make_request_checks

2020-07-01 Thread Christoph Hellwig
All registers disks must have a valid queue pointer, so don't bother to
log a warning for that case.

Signed-off-by: Christoph Hellwig 
---
 block/blk-core.c | 12 +---
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 95dca74534ff73..37435d0d433564 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -973,22 +973,12 @@ static inline blk_status_t blk_check_zone_append(struct 
request_queue *q,
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
-   struct request_queue *q;
+   struct request_queue *q = bio->bi_disk->queue;
int nr_sectors = bio_sectors(bio);
blk_status_t status = BLK_STS_IOERR;
-   char b[BDEVNAME_SIZE];
 
might_sleep();
 
-   q = bio->bi_disk->queue;
-   if (unlikely(!q)) {
-   printk(KERN_ERR
-  "generic_make_request: Trying to access "
-   "nonexistent block-device %s (%Lu)\n",
-   bio_devname(bio, b), (long long)bio->bi_iter.bi_sector);
-   goto end_io;
-   }
-
/*
 * For a REQ_NOWAIT based request, return -EOPNOTSUPP
 * if queue is not a request based queue.
-- 
2.26.2
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


  1   2   3   4   5   6   7   8   9   >