Re: [dm-devel] [PATCH] nvme: allow ANA support to be independent of native multipathing

2018-11-16 Thread Hannes Reinecke

On 11/16/18 10:14 AM, Christoph Hellwig wrote:

On Thu, Nov 15, 2018 at 12:46:05PM -0500, Mike Snitzer wrote:

Whether or not ANA is present is a choice of the target implementation;
the host (and whether it supports multipathing) has _zero_ influence on
this.  If the target declares a path as 'inaccessible' the path _is_
inaccessible to the host.  As such, ANA support should be functional
even if native multipathing is not.

Introduce ability to always re-read ANA log page as required due to ANA
error and make current ANA state available via sysfs -- even if native
multipathing is disabled on the host (e.g. nvme_core.multipath=N).


The first part I could see, but I still want to make it conditional
in some way as nvme is going into deeply embedded setups, and I don't
want to carry the weight of the ANA code around for everyone.


Can you clarify this a bit?
We _do_ have the NVME multipath config option to deconfigure the whole 
thing during compile time; that isn't influenced with this patch.

So are you worried about the size of the ANA implementation itself?
Or are you worried about the size of the ANA structures?


The second I fundamentally disagree with.  And even if you found agreement
it would have to be in a separate patch as it is a separate feature.

Why? Where's the problem with re-reading the ANA log pages if we get an 
event indicating that we should?


Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH] nvme: allow ANA support to be independent of native multipathing

2018-11-16 Thread Christoph Hellwig
On Thu, Nov 15, 2018 at 12:46:05PM -0500, Mike Snitzer wrote:
> Whether or not ANA is present is a choice of the target implementation;
> the host (and whether it supports multipathing) has _zero_ influence on
> this.  If the target declares a path as 'inaccessible' the path _is_
> inaccessible to the host.  As such, ANA support should be functional
> even if native multipathing is not.
> 
> Introduce ability to always re-read ANA log page as required due to ANA
> error and make current ANA state available via sysfs -- even if native
> multipathing is disabled on the host (e.g. nvme_core.multipath=N).

The first part I could see, but I still want to make it conditional
in some way as nvme is going into deeply embedded setups, and I don't
want to carry the weight of the ANA code around for everyone.

The second I fundamentally disagree with.  And even if you found agreement
it would have to be in a separate patch as it is a separate feature.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 03/19] block: use bio_for_each_bvec() to compute multi-page bvec count

2018-11-16 Thread Christoph Hellwig
On Thu, Nov 15, 2018 at 02:18:47PM -0800, Omar Sandoval wrote:
> My only reason to prefer unsigned int is consistency. unsigned int is
> much more common in the kernel:
> 
> $ ag --cc -s 'unsigned\s+int' | wc -l
> 129632
> $ ag --cc -s 'unsigned\s+(?!char|short|int|long)' | wc -l
> 22435
> 
> checkpatch also warns on plain unsigned.

Talk about chicken and egg.  unsigned is perfectly valid C, and being
shorter often helps being more readable.  checkpath is as so often
wrongly opinionated..

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 3/3] block: use a driver-specific handler for the "inflight" value

2018-11-16 Thread Christoph Hellwig
On Fri, Nov 16, 2018 at 01:04:19AM +0100, Mikulas Patocka wrote:
> Device mapper was converted to percpu inflight counters. In order to
> display the correct values in the "inflight" sysfs file and in
> /proc/diskstats, we need a custom callback that sums the percpu counters.
> 
> The function part_round_stats calculates the number of in-flight I/Os
> every jiffy and uses this to calculate the counters time_in_queue and
> io_ticks. In order to avoid excessive memory traffic on systems with high
> number of CPUs, this functionality is disabled when percpu inflight values
> are used and the values time_in_queue and io_ticks are calculated
> differently - the result is less precise.

And none of that is device mapper specific.  Please submit this code
to the block layer for use by all make_request based drivers.  Depending
on what Jens as the maintainers thinkgs of the tradeoffs we can discuss
if the summing should be on or off by default, or if we maybe even
need the current code as a fallback.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 01/19] block: introduce multi-page page bvec helpers

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:52:48PM +0800, Ming Lei wrote:
> This patch introduces helpers of 'mp_bvec_iter_*' for multipage
> bvec support.
> 
> The introduced helpers treate one bvec as real multi-page segment,
> which may include more than one pages.
> 
> The existed helpers of bvec_iter_* are interfaces for supporting current
> bvec iterator which is thought as single-page by drivers, fs, dm and
> etc. These introduced helpers will build single-page bvec in flight, so
> this way won't break current bio/bvec users, which needn't any change.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com

Reviewed-by: Omar Sandoval 

But a couple of comments below.

> Signed-off-by: Ming Lei 
> ---
>  include/linux/bvec.h | 63 
> +---
>  1 file changed, 60 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index 02c73c6aa805..8ef904a50577 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -23,6 +23,44 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +
> +/*
> + * What is multi-page bvecs?
> + *
> + * - bvecs stored in bio->bi_io_vec is always multi-page(mp) style
> + *
> + * - bvec(struct bio_vec) represents one physically contiguous I/O
> + *   buffer, now the buffer may include more than one pages after
> + *   multi-page(mp) bvec is supported, and all these pages represented
> + *   by one bvec is physically contiguous. Before mp support, at most
> + *   one page is included in one bvec, we call it single-page(sp)
> + *   bvec.
> + *
> + * - .bv_page of the bvec represents the 1st page in the mp bvec
> + *
> + * - .bv_offset of the bvec represents offset of the buffer in the bvec
> + *
> + * The effect on the current drivers/filesystem/dm/bcache/...:
> + *
> + * - almost everyone supposes that one bvec only includes one single
> + *   page, so we keep the sp interface not changed, for example,
> + *   bio_for_each_segment() still returns bvec with single page
> + *
> + * - bio_for_each_segment*() will be changed to return single-page
> + *   bvec too
> + *
> + * - during iterating, iterator variable(struct bvec_iter) is always
> + *   updated in multipage bvec style and that means bvec_iter_advance()
> + *   is kept not changed
> + *
> + * - returned(copied) single-page bvec is built in flight by bvec
> + *   helpers from the stored multipage bvec
> + *
> + * - In case that some components(such as iov_iter) need to support
> + *   multi-page bvec, we introduce new helpers(mp_bvec_iter_*) for
> + *   them.
> + */

This comment sounds more like a commit message (i.e., how were things
before, and how are we changing them). In a couple of years when I read
this code, I probably won't care how it was changed, just how it works.
So I think a comment explaining the concepts of multi-page and
single-page bvecs is very useful, but please move all of the "foo was
changed" and "before mp support" type stuff to the commit message.

>  /*
>   * was unsigned short, but we might as well be ready for > 64kB I/O pages
> @@ -50,16 +88,35 @@ struct bvec_iter {
>   */
>  #define __bvec_iter_bvec(bvec, iter) (&(bvec)[(iter).bi_idx])
>  
> -#define bvec_iter_page(bvec, iter)   \
> +#define mp_bvec_iter_page(bvec, iter)\
>   (__bvec_iter_bvec((bvec), (iter))->bv_page)
>  
> -#define bvec_iter_len(bvec, iter)\
> +#define mp_bvec_iter_len(bvec, iter) \
>   min((iter).bi_size, \
>   __bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
>  
> -#define bvec_iter_offset(bvec, iter) \
> +#define mp_bvec_iter_offset(bvec, iter)  \
>   (__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
>  
> +#define mp_bvec_iter_page_idx(bvec, iter)\
> + (mp_bvec_iter_offset((bvec), (iter)) / PAGE_SIZE)
> +
> +/*
> + *  of single-page(sp) segment.
> + *
> + * This helpers are for building sp bvec in flight.
> + */
> +#define bvec_iter_offset(bvec, iter) \
> + (mp_bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
> +
> +#define bvec_iter_len(bvec, iter)\
> + min_t(unsigned, mp_bvec_iter_len((bvec), (iter)),   \
> + (PAGE_SIZE - (bvec_iter_offset((bvec), (iter)

The parentheses around 

Re: [dm-devel] [PATCH V10 03/19] block: use bio_for_each_bvec() to compute multi-page bvec count

2018-11-16 Thread Gao Xiang


On 2018/11/16 17:19, Christoph Hellwig wrote:
> On Thu, Nov 15, 2018 at 02:18:47PM -0800, Omar Sandoval wrote:
>> My only reason to prefer unsigned int is consistency. unsigned int is
>> much more common in the kernel:
>>
>> $ ag --cc -s 'unsigned\s+int' | wc -l
>> 129632
>> $ ag --cc -s 'unsigned\s+(?!char|short|int|long)' | wc -l
>> 22435
>>
>> checkpatch also warns on plain unsigned.
> 
> Talk about chicken and egg.  unsigned is perfectly valid C, and being
> shorter often helps being more readable.  checkpath is as so often
> wrongly opinionated..
> 

sigh...I personally tend to use "unsigned" instead of "unsigned int" as well,
but checkpatch.pl also suggests erofs to use "unsigned int" :-(

Thanks,
Gao Xiang

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 03/19] block: use bio_for_each_bvec() to compute multi-page bvec count

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:52:50PM +0800, Ming Lei wrote:
> First it is more efficient to use bio_for_each_bvec() in both
> blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
> many multi-page bvecs there are in the bio.
> 
> Secondly once bio_for_each_bvec() is used, the bvec may need to be
> splitted because its length can be very longer than max segment size,
> so we have to split the big bvec into several segments.
> 
> Thirdly when splitting multi-page bvec into segments, the max segment
> limit may be reached, so the bio split need to be considered under
> this situation too.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com
> Signed-off-by: Ming Lei 
> ---
>  block/blk-merge.c | 90 
> ++-
>  1 file changed, 76 insertions(+), 14 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 91b2af332a84..6f7deb94a23f 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -160,6 +160,62 @@ static inline unsigned get_max_io_size(struct 
> request_queue *q,
>   return sectors;
>  }
>  
> +/*
> + * Split the bvec @bv into segments, and update all kinds of
> + * variables.
> + */
> +static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
> + unsigned *nsegs, unsigned *last_seg_size,
> + unsigned *front_seg_size, unsigned *sectors)
> +{
> + bool need_split = false;
> + unsigned len = bv->bv_len;
> + unsigned total_len = 0;
> + unsigned new_nsegs = 0, seg_size = 0;

"unsigned int" here and everywhere else.

> + if ((*nsegs >= queue_max_segments(q)) || !len)
> + return need_split;
> +
> + /*
> +  * Multipage bvec may be too big to hold in one segment,
> +  * so the current bvec has to be splitted as multiple
> +  * segments.
> +  */
> + while (new_nsegs + *nsegs < queue_max_segments(q)) {
> + seg_size = min(queue_max_segment_size(q), len);
> +
> + new_nsegs++;
> + total_len += seg_size;
> + len -= seg_size;
> +
> + if ((queue_virt_boundary(q) && ((bv->bv_offset +
> + total_len) & queue_virt_boundary(q))) || !len)
> + break;

Checking queue_virt_boundary(q) != 0 is superfluous, and the len check
could just control the loop, i.e.,

while (len && new_nsegs + *nsegs < queue_max_segments(q)) {
seg_size = min(queue_max_segment_size(q), len);

new_nsegs++;
total_len += seg_size;
len -= seg_size;

if ((bv->bv_offset + total_len) & queue_virt_boundary(q))
break;
}

And if you rewrite it this way, I _think_ you can get rid of this
special case:

if ((*nsegs >= queue_max_segments(q)) || !len)
return need_split;

above.

> + }
> +
> + /* split in the middle of the bvec */
> + if (len)
> + need_split = true;

need_split is unnecessary, just return len != 0.

> +
> + /* update front segment size */
> + if (!*nsegs) {
> + unsigned first_seg_size = seg_size;
> +
> + if (new_nsegs > 1)
> + first_seg_size = queue_max_segment_size(q);
> + if (*front_seg_size < first_seg_size)
> + *front_seg_size = first_seg_size;
> + }
> +
> + /* update other varibles */
> + *last_seg_size = seg_size;
> + *nsegs += new_nsegs;
> + if (sectors)
> + *sectors += total_len >> 9;
> +
> + return need_split;
> +}
> +
>  static struct bio *blk_bio_segment_split(struct request_queue *q,
>struct bio *bio,
>struct bio_set *bs,
> @@ -173,7 +229,7 @@ static struct bio *blk_bio_segment_split(struct 
> request_queue *q,
>   struct bio *new = NULL;
>   const unsigned max_sectors = get_max_io_size(q, bio);
>  
> - bio_for_each_segment(bv, bio, iter) {
> + bio_for_each_bvec(bv, bio, iter) {
>   /*
>* If the queue doesn't support SG gaps and adding this
>* offset would create a gap, disallow it.
> @@ -188,8 +244,12 @@ static struct bio *blk_bio_segment_split(struct 
> request_queue *q,
>*/
>   if (nsegs < queue_max_segments(q) &&
>   sectors < max_sectors) {
> -

Re: [dm-devel] [PATCH V10 08/19] btrfs: move bio_pages_all() to btrfs

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:52:55PM +0800, Ming Lei wrote:
> BTRFS is the only user of this helper, so move this helper into
> BTRFS, and implement it via bio_for_each_segment_all(), since
> bio->bi_vcnt may not equal to number of pages after multipage bvec
> is enabled.

Shouldn't you also get rid of bio_pages_all() in this patch?

> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com
> Signed-off-by: Ming Lei 
> ---
>  fs/btrfs/extent_io.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 5d5965297e7e..874bb9aeebdc 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2348,6 +2348,18 @@ struct bio *btrfs_create_repair_bio(struct inode 
> *inode, struct bio *failed_bio,
>   return bio;
>  }
>  
> +static unsigned btrfs_bio_pages_all(struct bio *bio)
> +{
> + unsigned i;
> + struct bio_vec *bv;
> +
> + WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
> +
> + bio_for_each_segment_all(bv, bio, i)
> + ;
> + return i;
> +}
> +
>  /*
>   * this is a generic handler for readpage errors (default
>   * readpage_io_failed_hook). if other copies exist, read those and write back
> @@ -2368,7 +2380,7 @@ static int bio_readpage_error(struct bio *failed_bio, 
> u64 phy_offset,
>   int read_mode = 0;
>   blk_status_t status;
>   int ret;
> - unsigned failed_bio_pages = bio_pages_all(failed_bio);
> + unsigned failed_bio_pages = btrfs_bio_pages_all(failed_bio);
>  
>   BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
>  
> -- 
> 2.9.5
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 17/19] block: don't use bio->bi_vcnt to figure out segment number

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:53:04PM +0800, Ming Lei wrote:
> It is wrong to use bio->bi_vcnt to figure out how many segments
> there are in the bio even though CLONED flag isn't set on this bio,
> because this bio may be splitted or advanced.
> 
> So always use bio_segments() in blk_recount_segments(), and it shouldn't
> cause any performance loss now because the physical segment number is figured
> out in blk_queue_split() and BIO_SEG_VALID is set meantime since
> bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Fixes: 7f60dcaaf91 ("block: blk-merge: fix blk_recount_segments()")

>From what I can tell, the problem was originally introduced by
76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max 
segments")

Is that right?

> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com
> Signed-off-by: Ming Lei 
> ---
>  block/blk-merge.c | 8 +---
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index cb9f49bcfd36..153a659fde74 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -429,13 +429,7 @@ void blk_recalc_rq_segments(struct request *rq)
>  
>  void blk_recount_segments(struct request_queue *q, struct bio *bio)
>  {
> - unsigned short seg_cnt;
> -
> - /* estimate segment number by bi_vcnt for non-cloned bio */
> - if (bio_flagged(bio, BIO_CLONED))
> - seg_cnt = bio_segments(bio);
> - else
> - seg_cnt = bio->bi_vcnt;
> + unsigned short seg_cnt = bio_segments(bio);
>  
>   if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
>   (seg_cnt < queue_max_segments(q)))
> -- 
> 2.9.5
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 14/19] block: enable multipage bvecs

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:53:01PM +0800, Ming Lei wrote:
> This patch pulls the trigger for multi-page bvecs.
> 
> Now any request queue which supports queue cluster will see multi-page
> bvecs.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com
> Signed-off-by: Ming Lei 
> ---
>  block/bio.c | 24 ++--
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 6486722d4d4b..ed6df6f8e63d 100644
> --- a/block/bio.c
> +++ b/block/bio.c

This comment above __bio_try_merge_page() doesn't make sense after this
change:

 This is a
 a useful optimisation for file systems with a block size smaller than the
 page size.

Can you please get rid of it in this patch?

> @@ -767,12 +767,24 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
> *page,
>  
>   if (bio->bi_vcnt > 0) {
>   struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
> -
> - if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
> - bv->bv_len += len;
> - bio->bi_iter.bi_size += len;
> - return true;
> - }
> + struct request_queue *q = NULL;
> +
> + if (page == bv->bv_page && off == (bv->bv_offset + bv->bv_len)
> + && (off + len) <= PAGE_SIZE)
> + goto merge;

The parentheses around (bv->bv_offset + bv->bv_len) and (off + len) are
unnecessary noise.

What's the point of the new (off + len) <= PAGE_SIZE check?

> +
> + if (bio->bi_disk)
> + q = bio->bi_disk->queue;
> +
> + /* disable multi-page bvec too if cluster isn't enabled */
> + if (!q || !blk_queue_cluster(q) ||
> + ((page_to_phys(bv->bv_page) + bv->bv_offset + bv->bv_len) !=
> +  (page_to_phys(page) + off)))

More unnecessary parentheses here.

> + return false;
> + merge:
> + bv->bv_len += len;
> + bio->bi_iter.bi_size += len;
> + return true;
>   }
>   return false;
>  }
> -- 
> 2.9.5
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:53:05PM +0800, Ming Lei wrote:
> Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
> physical segment number is mainly figured out in blk_queue_split() for
> fast path, and the flag of BIO_SEG_VALID is set there too.
> 
> Now only blk_recount_segments() and blk_recalc_rq_segments() use this
> flag.
> 
> Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
> is set in blk_queue_split().
> 
> For another user of blk_recalc_rq_segments():
> 
> - run in partial completion branch of blk_update_request, which is an unusual 
> case
> 
> - run in blk_cloned_rq_check_limits(), still not a big problem if the flag is 
> killed
> since dm-rq is the only user.
> 
> Multi-page bvec is enabled now, QUEUE_FLAG_NO_SG_MERGE doesn't make sense any 
> more.

This commit message wasn't very clear. Is it the case that
QUEUE_FLAG_NO_SG_MERGE is no longer set by any drivers?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 07/19] btrfs: use bvec_last_segment to get bio's last page

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:52:54PM +0800, Ming Lei wrote:
> Preparing for supporting multi-page bvec.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com

Reviewed-by: Omar Sandoval 

> Signed-off-by: Ming Lei 
> ---
>  fs/btrfs/compression.c | 5 -
>  fs/btrfs/extent_io.c   | 5 +++--
>  2 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 2955a4ea2fa8..161e14b8b180 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -400,8 +400,11 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
> *inode, u64 start,
>  static u64 bio_end_offset(struct bio *bio)
>  {
>   struct bio_vec *last = bio_last_bvec_all(bio);
> + struct bio_vec bv;
>  
> - return page_offset(last->bv_page) + last->bv_len + last->bv_offset;
> + bvec_last_segment(last, );
> +
> + return page_offset(bv.bv_page) + bv.bv_len + bv.bv_offset;
>  }
>  
>  static noinline int add_ra_bio_pages(struct inode *inode,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index d228f706ff3e..5d5965297e7e 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2720,11 +2720,12 @@ static int __must_check submit_one_bio(struct bio 
> *bio, int mirror_num,
>  {
>   blk_status_t ret = 0;
>   struct bio_vec *bvec = bio_last_bvec_all(bio);
> - struct page *page = bvec->bv_page;
> + struct bio_vec bv;
>   struct extent_io_tree *tree = bio->bi_private;
>   u64 start;
>  
> - start = page_offset(page) + bvec->bv_offset;
> + bvec_last_segment(bvec, );
> + start = page_offset(bv.bv_page) + bv.bv_offset;
>  
>   bio->bi_private = NULL;
>  
> -- 
> 2.9.5
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 16/19] block: document usage of bio iterator helpers

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:53:03PM +0800, Ming Lei wrote:
> Now multi-page bvec is supported, some helpers may return page by
> page, meantime some may return segment by segment, this patch
> documents the usage.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com
> Signed-off-by: Ming Lei 
> ---
>  Documentation/block/biovecs.txt | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
> index 25689584e6e0..bfafb70d0d9e 100644
> --- a/Documentation/block/biovecs.txt
> +++ b/Documentation/block/biovecs.txt
> @@ -117,3 +117,29 @@ Other implications:
> size limitations and the limitations of the underlying devices. Thus
> there's no need to define ->merge_bvec_fn() callbacks for individual block
> drivers.
> +
> +Usage of helpers:
> +=
> +
> +* The following helpers whose names have the suffix of "_all" can only be 
> used
> +on non-BIO_CLONED bio, and usually they are used by filesystem code, and 
> driver
> +shouldn't use them because bio may have been split before they got to the 
> driver:

Putting an english teacher hat on, this is quite the run-on sentence.
How about:

* The following helpers whose names have the suffix of "_all" can only be used
on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers
shouldn't use them because the bio may have been split before it reached the
driver.

Maybe also an explanation of why the filesystem would want to use these?

> + bio_for_each_segment_all()
> + bio_first_bvec_all()
> + bio_first_page_all()
> + bio_last_bvec_all()
> +
> +* The following helpers iterate over single-page bvec, and the local
> +variable of 'struct bio_vec' or the reference records single-page IO
> +vector during the itearation:


* The following helpers iterate over single-page bvecs. The passed 'struct
bio_vec' will contain a single-page IO vector during the iteration.

> + bio_for_each_segment()
> + bio_for_each_segment_all()
> +
> +* The following helper iterates over multi-page bvec, and each bvec may
> +include multiple physically contiguous pages, and the local variable of
> +'struct bio_vec' or the reference records multi-page IO vector during the
> +itearation:

* The following helper iterates over multi-page bvecs. Each bvec may
include multiple physically contiguous pages. The passed 'struct bio_vec' will
contain a multi-page IO vector during the iteration.

> + bio_for_each_bvec()
> -- 
> 2.9.5
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:52:56PM +0800, Ming Lei wrote:
> There are still cases in which we need to use bio_bvecs() for get the
> number of multi-page segment, so introduce it.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com

Reviewed-by: Omar Sandoval 

> Signed-off-by: Ming Lei 
> ---
>  include/linux/bio.h | 30 +-
>  1 file changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1f0dcf109841..3496c816946e 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -196,7 +196,6 @@ static inline unsigned bio_segments(struct bio *bio)
>* We special case discard/write same/write zeroes, because they
>* interpret bi_size differently:
>*/
> -
>   switch (bio_op(bio)) {
>   case REQ_OP_DISCARD:
>   case REQ_OP_SECURE_ERASE:
> @@ -205,13 +204,34 @@ static inline unsigned bio_segments(struct bio *bio)
>   case REQ_OP_WRITE_SAME:
>   return 1;
>   default:
> - break;
> + bio_for_each_segment(bv, bio, iter)
> + segs++;
> + return segs;
>   }
> +}
>  
> - bio_for_each_segment(bv, bio, iter)
> - segs++;
> +static inline unsigned bio_bvecs(struct bio *bio)
> +{
> + unsigned bvecs = 0;
> + struct bio_vec bv;
> + struct bvec_iter iter;
>  
> - return segs;
> + /*
> +  * We special case discard/write same/write zeroes, because they
> +  * interpret bi_size differently:
> +  */
> + switch (bio_op(bio)) {
> + case REQ_OP_DISCARD:
> + case REQ_OP_SECURE_ERASE:
> + case REQ_OP_WRITE_ZEROES:
> + return 0;
> + case REQ_OP_WRITE_SAME:
> + return 1;
> + default:
> + bio_for_each_bvec(bv, bio, iter)
> + bvecs++;
> + return bvecs;
> + }
>  }
>  
>  /*
> -- 
> 2.9.5
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] nvme: allow ANA support to be independent of native multipathing

2018-11-16 Thread Christoph Hellwig
On Fri, Nov 16, 2018 at 11:06:32AM +0100, Hannes Reinecke wrote:
> Ok, so would you be happy with making ANA support configurable?

I've looked a bit over the whole situation, and what I think we need
to do is:

 a) warn if we see a ANA capable device without multipath support
so people know it is not going to work properly.
 b) deprecate the multipath module option.  It was only intended as
a migration for any pre-existing PCIe multipath user if there
were any, not to support any new functionality.  So for 4.20
put in a patch that prints a clear warning when it is used,
including a link to the nvme list, and then for 4.25 or so
remove it entirely unless something unexpected come up.

This whole drama of optional multipath use has wasted way too much
of everyones time already.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 05/19] block: introduce bvec_last_segment()

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:52:52PM +0800, Ming Lei wrote:
> BTRFS and guard_bio_eod() need to get the last singlepage segment
> from one multipage bvec, so introduce this helper to make them happy.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com

Reviewed-by: Omar Sandoval 

Minor comments below.

> Signed-off-by: Ming Lei 
> ---
>  include/linux/bvec.h | 25 +
>  1 file changed, 25 insertions(+)
> 
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index 3d61352cd8cf..01616a0b6220 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -216,4 +216,29 @@ static inline bool mp_bvec_iter_advance(const struct 
> bio_vec *bv,
>   .bi_bvec_done   = 0,\
>  }
>  
> +/*
> + * Get the last singlepage segment from the multipage bvec and store it
> + * in @seg
> + */
> +static inline void bvec_last_segment(const struct bio_vec *bvec,
> + struct bio_vec *seg)

Indentation is all messed up here.

> +{
> + unsigned total = bvec->bv_offset + bvec->bv_len;
> + unsigned last_page = total / PAGE_SIZE;
> +
> + if (last_page * PAGE_SIZE == total)
> + last_page--;

I think this could just be

unsigned int last_page = (total - 1) / PAGE_SIZE;

> + seg->bv_page = nth_page(bvec->bv_page, last_page);
> +
> + /* the whole segment is inside the last page */
> + if (bvec->bv_offset >= last_page * PAGE_SIZE) {
> + seg->bv_offset = bvec->bv_offset % PAGE_SIZE;
> + seg->bv_len = bvec->bv_len;
> + } else {
> + seg->bv_offset = 0;
> + seg->bv_len = total - last_page * PAGE_SIZE;
> + }
> +}
> +
>  #endif /* __LINUX_BVEC_ITER_H */
> -- 
> 2.9.5
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 03/19] block: use bio_for_each_bvec() to compute multi-page bvec count

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:05:10PM -0500, Mike Snitzer wrote:
> On Thu, Nov 15 2018 at  3:20pm -0500,
> Omar Sandoval  wrote:
> 
> > On Thu, Nov 15, 2018 at 04:52:50PM +0800, Ming Lei wrote:
> > > First it is more efficient to use bio_for_each_bvec() in both
> > > blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
> > > many multi-page bvecs there are in the bio.
> > > 
> > > Secondly once bio_for_each_bvec() is used, the bvec may need to be
> > > splitted because its length can be very longer than max segment size,
> > > so we have to split the big bvec into several segments.
> > > 
> > > Thirdly when splitting multi-page bvec into segments, the max segment
> > > limit may be reached, so the bio split need to be considered under
> > > this situation too.
> > > 
> > > Cc: Dave Chinner 
> > > Cc: Kent Overstreet 
> > > Cc: Mike Snitzer 
> > > Cc: dm-devel@redhat.com
> > > Cc: Alexander Viro 
> > > Cc: linux-fsde...@vger.kernel.org
> > > Cc: Shaohua Li 
> > > Cc: linux-r...@vger.kernel.org
> > > Cc: linux-er...@lists.ozlabs.org
> > > Cc: David Sterba 
> > > Cc: linux-bt...@vger.kernel.org
> > > Cc: Darrick J. Wong 
> > > Cc: linux-...@vger.kernel.org
> > > Cc: Gao Xiang 
> > > Cc: Christoph Hellwig 
> > > Cc: Theodore Ts'o 
> > > Cc: linux-e...@vger.kernel.org
> > > Cc: Coly Li 
> > > Cc: linux-bca...@vger.kernel.org
> > > Cc: Boaz Harrosh 
> > > Cc: Bob Peterson 
> > > Cc: cluster-de...@redhat.com
> > > Signed-off-by: Ming Lei 
> > > ---
> > >  block/blk-merge.c | 90 
> > > ++-
> > >  1 file changed, 76 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > > index 91b2af332a84..6f7deb94a23f 100644
> > > --- a/block/blk-merge.c
> > > +++ b/block/blk-merge.c
> > > @@ -160,6 +160,62 @@ static inline unsigned get_max_io_size(struct 
> > > request_queue *q,
> > >   return sectors;
> > >  }
> > >  
> > > +/*
> > > + * Split the bvec @bv into segments, and update all kinds of
> > > + * variables.
> > > + */
> > > +static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
> > > + unsigned *nsegs, unsigned *last_seg_size,
> > > + unsigned *front_seg_size, unsigned *sectors)
> > > +{
> > > + bool need_split = false;
> > > + unsigned len = bv->bv_len;
> > > + unsigned total_len = 0;
> > > + unsigned new_nsegs = 0, seg_size = 0;
> > 
> > "unsigned int" here and everywhere else.
> 
> Curious why?  I've wondered what govens use of "unsigned" vs "unsigned
> int" recently and haven't found _the_ reason to pick one over the other.

My only reason to prefer unsigned int is consistency. unsigned int is
much more common in the kernel:

$ ag --cc -s 'unsigned\s+int' | wc -l
129632
$ ag --cc -s 'unsigned\s+(?!char|short|int|long)' | wc -l
22435

checkpatch also warns on plain unsigned.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 02/19] block: introduce bio_for_each_bvec()

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:52:49PM +0800, Ming Lei wrote:
> This helper is used for iterating over multi-page bvec for bio
> split & merge code.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com

Reviewed-by: Omar Sandoval 

One comment below.

> Signed-off-by: Ming Lei 
> ---
>  include/linux/bio.h  | 34 +++---
>  include/linux/bvec.h | 36 
>  2 files changed, 63 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 056fb627edb3..1f0dcf109841 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -76,6 +76,9 @@
>  #define bio_data_dir(bio) \
>   (op_is_write(bio_op(bio)) ? WRITE : READ)
>  
> +#define bio_iter_mp_iovec(bio, iter) \
> + mp_bvec_iter_bvec((bio)->bi_io_vec, (iter))
> +
>  /*
>   * Check whether this bio carries any data or not. A NULL bio is allowed.
>   */
> @@ -135,18 +138,33 @@ static inline bool bio_full(struct bio *bio)
>  #define bio_for_each_segment_all(bvl, bio, i)
> \
>   for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
>  
> -static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> - unsigned bytes)
> +static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter 
> *iter,
> +   unsigned bytes, bool mp)
>  {
>   iter->bi_sector += bytes >> 9;
>  
>   if (bio_no_advance_iter(bio))
>   iter->bi_size -= bytes;
>   else
> - bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> + if (!mp)
> + bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> + else
> + mp_bvec_iter_advance(bio->bi_io_vec, iter, bytes);

if (!foo) {} else {} hurts my brain, please do

if (mp)
mp_bvec_iter_advance(bio->bi_io_vec, iter, bytes);
else
bvec_iter_advance(bio->bi_io_vec, iter, bytes);

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:52:58PM +0800, Ming Lei wrote:
> bch_bio_alloc_pages() is always called on one new bio, so it is safe
> to access the bvec table directly. Given it is the only kind of this
> case, open code the bvec table access since bio_for_each_segment_all()
> will be changed to support for iterating over multipage bvec.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Acked-by: Coly Li 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com
> Signed-off-by: Ming Lei 
> ---
>  drivers/md/bcache/util.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
> index 20eddeac1531..8517aebcda2d 100644
> --- a/drivers/md/bcache/util.c
> +++ b/drivers/md/bcache/util.c
> @@ -270,7 +270,7 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
>   int i;
>   struct bio_vec *bv;
>  
> - bio_for_each_segment_all(bv, bio, i) {
> + for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++) {

This is missing an i++.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 10/19] block: loop: pass multi-page bvec to iov_iter

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:52:57PM +0800, Ming Lei wrote:
> iov_iter is implemented with bvec itererator, so it is safe to pass
> multipage bvec to it, and this way is much more efficient than
> passing one page in each bvec.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com

Reviewed-by: Omar Sandoval 

Comments below.

> Signed-off-by: Ming Lei 
> ---
>  drivers/block/loop.c | 23 ---
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index bf6bc35aaf88..a3fd418ec637 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -515,16 +515,16 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>   struct bio *bio = rq->bio;
>   struct file *file = lo->lo_backing_file;
>   unsigned int offset;
> - int segments = 0;
> + int nr_bvec = 0;
>   int ret;
>  
>   if (rq->bio != rq->biotail) {
> - struct req_iterator iter;
> + struct bvec_iter iter;
>   struct bio_vec tmp;
>  
>   __rq_for_each_bio(bio, rq)
> - segments += bio_segments(bio);
> - bvec = kmalloc_array(segments, sizeof(struct bio_vec),
> + nr_bvec += bio_bvecs(bio);
> + bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
>GFP_NOIO);
>   if (!bvec)
>   return -EIO;
> @@ -533,13 +533,14 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>   /*
>* The bios of the request may be started from the middle of
>* the 'bvec' because of bio splitting, so we can't directly
> -  * copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
> +  * copy bio->bi_iov_vec to new bvec. The bio_for_each_bvec
>* API will take care of all details for us.
>*/
> - rq_for_each_segment(tmp, rq, iter) {
> - *bvec = tmp;
> - bvec++;
> - }
> + __rq_for_each_bio(bio, rq)
> + bio_for_each_bvec(tmp, bio, iter) {
> + *bvec = tmp;
> + bvec++;
> + }

Even if they're not strictly necessary, could you please include the
curly braces for __rq_for_each_bio() here?

>   bvec = cmd->bvec;
>   offset = 0;
>   } else {
> @@ -550,11 +551,11 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>*/
>   offset = bio->bi_iter.bi_bvec_done;
>   bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> - segments = bio_segments(bio);
> + nr_bvec = bio_bvecs(bio);

This scared me for a second, but it's fine to do here because we haven't
actually enabled multipage bvecs yet, right?

>   }
>   atomic_set(>ref, 2);
>  
> - iov_iter_bvec(, rw, bvec, segments, blk_rq_bytes(rq));
> + iov_iter_bvec(, rw, bvec, nr_bvec, blk_rq_bytes(rq));
>   iter.iov_offset = offset;
>  
>   cmd->iocb.ki_pos = pos;
> -- 
> 2.9.5
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 13/19] iomap & xfs: only account for new added page

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:53:00PM +0800, Ming Lei wrote:
> After multi-page is enabled, one new page may be merged to a segment
> even though it is a new added page.
> 
> This patch deals with this issue by post-check in case of merge, and
> only a freshly new added page need to be dealt with for iomap & xfs.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com
> Signed-off-by: Ming Lei 
> ---
>  fs/iomap.c  | 22 ++
>  fs/xfs/xfs_aops.c   | 10 --
>  include/linux/bio.h | 11 +++
>  3 files changed, 33 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index df0212560b36..a1b97a5c726a 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -288,6 +288,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   loff_t orig_pos = pos;
>   unsigned poff, plen;
>   sector_t sector;
> + bool need_account = false;
>  
>   if (iomap->type == IOMAP_INLINE) {
>   WARN_ON_ONCE(pos);
> @@ -313,18 +314,15 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>*/
>   sector = iomap_sector(iomap, pos);
>   if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
> - if (__bio_try_merge_page(ctx->bio, page, plen, poff))
> + if (__bio_try_merge_page(ctx->bio, page, plen, poff)) {
> + need_account = iop && bio_is_last_segment(ctx->bio,
> + page, plen, poff);

It's redundant to make this iop && ... since you already check
iop && need_account below. Maybe rename it to added_page? Also, this
indentation is wack.

>   goto done;
> + }
>   is_contig = true;
>   }
>  
> - /*
> -  * If we start a new segment we need to increase the read count, and we
> -  * need to do so before submitting any previous full bio to make sure
> -  * that we don't prematurely unlock the page.
> -  */
> - if (iop)
> - atomic_inc(>read_count);
> + need_account = true;
>  
>   if (!ctx->bio || !is_contig || bio_full(ctx->bio)) {
>   gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
> @@ -347,6 +345,14 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   __bio_add_page(ctx->bio, page, plen, poff);
>  done:
>   /*
> +  * If we add a new page we need to increase the read count, and we
> +  * need to do so before submitting any previous full bio to make sure
> +  * that we don't prematurely unlock the page.
> +  */
> + if (iop && need_account)
> + atomic_inc(>read_count);
> +
> + /*
>* Move the caller beyond our range so that it keeps making progress.
>* For that we have to include any leading non-uptodate ranges, but
>* we can skip trailing ones as they will be handled in the next
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 1f1829e506e8..d8e9cc9f751a 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -603,6 +603,7 @@ xfs_add_to_ioend(
>   unsignedlen = i_blocksize(inode);
>   unsignedpoff = offset & (PAGE_SIZE - 1);
>   sector_tsector;
> + boolneed_account;
>  
>   sector = xfs_fsb_to_db(ip, wpc->imap.br_startblock) +
>   ((offset - XFS_FSB_TO_B(mp, wpc->imap.br_startoff)) >> 9);
> @@ -617,13 +618,18 @@ xfs_add_to_ioend(
>   }
>  
>   if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
> - if (iop)
> - atomic_inc(>write_count);
> + need_account = true;
>   if (bio_full(wpc->ioend->io_bio))
>   xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
>   __bio_add_page(wpc->ioend->io_bio, page, len, poff);
> + } else {
> + need_account = iop && bio_is_last_segment(wpc->ioend->io_bio,
> + page, len, poff);

Same here, no need for iop &&, rename it added_page, indentation is off.

>   }
>  
> + if (iop && need_account)
> + atomic_inc(>write_count);
> +
>   wpc->ioend->io_size += len;
>  }
>  
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1a2430a8b89d..5040e9a2eb09 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -341,6 +341,17 @@ static inline struct bio_vec 

Re: [dm-devel] [PATCH V10 04/19] block: use bio_for_each_bvec() to map sg

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:52:51PM +0800, Ming Lei wrote:
> It is more efficient to use bio_for_each_bvec() to map sg, meantime
> we have to consider splitting multipage bvec as done in 
> blk_bio_segment_split().
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com

Reviewed-by: Omar Sandoval 

> Signed-off-by: Ming Lei 
> ---
>  block/blk-merge.c | 72 
> +++
>  1 file changed, 52 insertions(+), 20 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 6f7deb94a23f..cb9f49bcfd36 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -473,6 +473,56 @@ static int blk_phys_contig_segment(struct request_queue 
> *q, struct bio *bio,
>   return biovec_phys_mergeable(q, _bv, _bv);
>  }
>  
> +static struct scatterlist *blk_next_sg(struct scatterlist **sg,
> + struct scatterlist *sglist)
> +{
> + if (!*sg)
> + return sglist;
> + else {
> + /*
> +  * If the driver previously mapped a shorter
> +  * list, we could see a termination bit
> +  * prematurely unless it fully inits the sg
> +  * table on each mapping. We KNOW that there
> +  * must be more entries here or the driver
> +  * would be buggy, so force clear the
> +  * termination bit to avoid doing a full
> +  * sg_init_table() in drivers for each command.
> +  */
> + sg_unmark_end(*sg);
> + return sg_next(*sg);
> + }
> +}
> +
> +static unsigned blk_bvec_map_sg(struct request_queue *q,
> + struct bio_vec *bvec, struct scatterlist *sglist,
> + struct scatterlist **sg)
> +{
> + unsigned nbytes = bvec->bv_len;
> + unsigned nsegs = 0, total = 0;
> +
> + while (nbytes > 0) {
> + unsigned seg_size;
> + struct page *pg;
> + unsigned offset, idx;
> +
> + *sg = blk_next_sg(sg, sglist);
> +
> + seg_size = min(nbytes, queue_max_segment_size(q));
> + offset = (total + bvec->bv_offset) % PAGE_SIZE;
> + idx = (total + bvec->bv_offset) / PAGE_SIZE;
> + pg = nth_page(bvec->bv_page, idx);
> +
> + sg_set_page(*sg, pg, seg_size, offset);
> +
> + total += seg_size;
> + nbytes -= seg_size;
> + nsegs++;
> + }
> +
> + return nsegs;
> +}
> +
>  static inline void
>  __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
>struct scatterlist *sglist, struct bio_vec *bvprv,
> @@ -490,25 +540,7 @@ __blk_segment_map_sg(struct request_queue *q, struct 
> bio_vec *bvec,
>   (*sg)->length += nbytes;
>   } else {
>  new_segment:
> - if (!*sg)
> - *sg = sglist;
> - else {
> - /*
> -  * If the driver previously mapped a shorter
> -  * list, we could see a termination bit
> -  * prematurely unless it fully inits the sg
> -  * table on each mapping. We KNOW that there
> -  * must be more entries here or the driver
> -  * would be buggy, so force clear the
> -  * termination bit to avoid doing a full
> -  * sg_init_table() in drivers for each command.
> -  */
> - sg_unmark_end(*sg);
> - *sg = sg_next(*sg);
> - }
> -
> - sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
> - (*nsegs)++;
> + (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
>   }
>   *bvprv = *bvec;
>  }
> @@ -530,7 +562,7 @@ static int __blk_bios_map_sg(struct request_queue *q, 
> struct bio *bio,
>   int cluster = blk_queue_cluster(q), nsegs = 0;
>  
>   for_each_bio(bio)
> - bio_for_each_segment(bvec, bio, iter)
> + bio_for_each_bvec(bvec, bio, iter)
>   __blk_segment_map_sg(q, , sglist, , sg,
>, );
>  
> -- 
> 2.9.5
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 06/19] fs/buffer.c: use bvec iterator to truncate the bio

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:52:53PM +0800, Ming Lei wrote:
> Once multi-page bvec is enabled, the last bvec may include more than one
> page, this patch use bvec_last_segment() to truncate the bio.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com

Reviewed-by: Omar Sandoval 

> Signed-off-by: Ming Lei 
> ---
>  fs/buffer.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 1286c2b95498..fa37ad52e962 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3032,7 +3032,10 @@ void guard_bio_eod(int op, struct bio *bio)
>  
>   /* ..and clear the end of the buffer for reads */
>   if (op == REQ_OP_READ) {
> - zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
> + struct bio_vec bv;
> +
> + bvec_last_segment(bvec, );
> + zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
>   truncated_bytes);
>   }
>  }
> -- 
> 2.9.5
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] nvme: allow ANA support to be independent of native multipathing

2018-11-16 Thread Christoph Hellwig
On Fri, Nov 16, 2018 at 10:40:40AM +0100, Hannes Reinecke wrote:
>>> Introduce ability to always re-read ANA log page as required due to ANA
>>> error and make current ANA state available via sysfs -- even if native
>>> multipathing is disabled on the host (e.g. nvme_core.multipath=N).
>>
>> The first part I could see, but I still want to make it conditional
>> in some way as nvme is going into deeply embedded setups, and I don't
>> want to carry the weight of the ANA code around for everyone.
>>
> Can you clarify this a bit?
> We _do_ have the NVME multipath config option to deconfigure the whole 
> thing during compile time; that isn't influenced with this patch.
> So are you worried about the size of the ANA implementation itself?
> Or are you worried about the size of the ANA structures?

I just see the next step of wanting to move ANA code into the core
which is implied above.
>
>> The second I fundamentally disagree with.  And even if you found agreement
>> it would have to be in a separate patch as it is a separate feature.
>>
> Why? Where's the problem with re-reading the ANA log pages if we get an 
> event indicating that we should?

"second" here means the sysfs file.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] nvme: allow ANA support to be independent of native multipathing

2018-11-16 Thread Hannes Reinecke

On 11/16/18 10:49 AM, Christoph Hellwig wrote:

On Fri, Nov 16, 2018 at 10:40:40AM +0100, Hannes Reinecke wrote:

Introduce ability to always re-read ANA log page as required due to ANA
error and make current ANA state available via sysfs -- even if native
multipathing is disabled on the host (e.g. nvme_core.multipath=N).


The first part I could see, but I still want to make it conditional
in some way as nvme is going into deeply embedded setups, and I don't
want to carry the weight of the ANA code around for everyone.


Can you clarify this a bit?
We _do_ have the NVME multipath config option to deconfigure the whole
thing during compile time; that isn't influenced with this patch.
So are you worried about the size of the ANA implementation itself?
Or are you worried about the size of the ANA structures?


I just see the next step of wanting to move ANA code into the core
which is implied above.


Really, I couldn't care less _where_ the ANA code lives.
If the size of which is any concern we can even make it configurable of 
sorts.





The second I fundamentally disagree with.  And even if you found agreement
it would have to be in a separate patch as it is a separate feature.


Why? Where's the problem with re-reading the ANA log pages if we get an
event indicating that we should?


"second" here means the sysfs file.


Ok, so would you be happy with making ANA support configurable?

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH V10 15/19] block: always define BIO_MAX_PAGES as 256

2018-11-16 Thread Omar Sandoval
On Thu, Nov 15, 2018 at 04:53:02PM +0800, Ming Lei wrote:
> Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
> increase BIO_MAX_PAGES for it.

You mentioned to it in the cover letter, but this needs more explanation
in the commit message. Why did CONFIG_THP_SWAP require > 256? Why does
multipage bvecs remove that requirement?

> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com
> Signed-off-by: Ming Lei 
> ---
>  include/linux/bio.h | 8 
>  1 file changed, 8 deletions(-)
> 
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 5040e9a2eb09..277921ad42e7 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -34,15 +34,7 @@
>  #define BIO_BUG_ON
>  #endif
>  
> -#ifdef CONFIG_THP_SWAP
> -#if HPAGE_PMD_NR > 256
> -#define BIO_MAX_PAGESHPAGE_PMD_NR
> -#else
>  #define BIO_MAX_PAGES256
> -#endif
> -#else
> -#define BIO_MAX_PAGES256
> -#endif
>  
>  #define bio_prio(bio)(bio)->bi_ioprio
>  #define bio_set_prio(bio, prio)  ((bio)->bi_ioprio = prio)
> -- 
> 2.9.5
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 02/19] block: introduce bio_for_each_bvec()

2018-11-16 Thread Christoph Hellwig
> +static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter 
> *iter,
> +   unsigned bytes, bool mp)

I think these magic 'bool np' arguments and wrappers over wrapper
don't help anyone to actually understand the code.  I'd vote for
removing as many wrappers as we really don't need, and passing the
actual segment limit instead of the magic bool flag.  Something like
this untested patch:

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 277921ad42e7..dcad0b69f57a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -138,30 +138,21 @@ static inline bool bio_full(struct bio *bio)
bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), 
i, iter_all)
 
 static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
- unsigned bytes, bool mp)
+ unsigned bytes, unsigned max_segment)
 {
iter->bi_sector += bytes >> 9;
 
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
else
-   if (!mp)
-   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
-   else
-   mp_bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_segment);
/* TODO: It is reasonable to complete bio with error here. */
 }
 
 static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
unsigned bytes)
 {
-   __bio_advance_iter(bio, iter, bytes, false);
-}
-
-static inline void bio_advance_mp_iter(struct bio *bio, struct bvec_iter *iter,
-  unsigned bytes)
-{
-   __bio_advance_iter(bio, iter, bytes, true);
+   __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
 }
 
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
@@ -177,7 +168,7 @@ static inline void bio_advance_mp_iter(struct bio *bio, 
struct bvec_iter *iter,
for (iter = (start);\
 (iter).bi_size &&  \
((bvl = bio_iter_mp_iovec((bio), (iter))), 1);  \
-bio_advance_mp_iter((bio), &(iter), (bvl).bv_len))
+__bio_advance_iter((bio), &(iter), (bvl).bv_len, 0))
 
 /* returns one real segment(multipage bvec) each time */
 #define bio_for_each_bvec(bvl, bio, iter)  \
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 02f26d2b59ad..5e2ed46c1c88 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -138,8 +138,7 @@ struct bvec_iter_all {
 })
 
 static inline bool __bvec_iter_advance(const struct bio_vec *bv,
-  struct bvec_iter *iter,
-  unsigned bytes, bool mp)
+   struct bvec_iter *iter, unsigned bytes, unsigned max_segment)
 {
if (WARN_ONCE(bytes > iter->bi_size,
 "Attempted to advance past end of bvec iter\n")) {
@@ -148,18 +147,18 @@ static inline bool __bvec_iter_advance(const struct 
bio_vec *bv,
}
 
while (bytes) {
-   unsigned len;
+   unsigned segment_len = mp_bvec_iter_len(bv, *iter);
 
-   if (mp)
-   len = mp_bvec_iter_len(bv, *iter);
-   else
-   len = bvec_iter_len(bv, *iter);
+   if (max_segment) {
+   max_segment -= bvec_iter_offset(bv, *iter);
+   segment_len = min(segment_len, max_segment);
+   }
 
-   len = min(bytes, len);
+   segment_len = min(bytes, segment_len);
 
-   bytes -= len;
-   iter->bi_size -= len;
-   iter->bi_bvec_done += len;
+   bytes -= segment_len;
+   iter->bi_size -= segment_len;
+   iter->bi_bvec_done += segment_len;
 
if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
iter->bi_bvec_done = 0;
@@ -197,14 +196,7 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
 struct bvec_iter *iter,
 unsigned bytes)
 {
-   return __bvec_iter_advance(bv, iter, bytes, false);
-}
-
-static inline bool mp_bvec_iter_advance(const struct bio_vec *bv,
-   struct bvec_iter *iter,
-   unsigned bytes)
-{
-   return __bvec_iter_advance(bv, iter, bytes, true);
+   return __bvec_iter_advance(bv, iter, bytes, PAGE_SIZE);
 }
 
 #define for_each_bvec(bvl, bio_vec, iter, start)   \

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 04/19] block: use bio_for_each_bvec() to map sg

2018-11-16 Thread Christoph Hellwig
> + if (!*sg)
> + return sglist;
> + else {

No need for an else after an early return.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 06/19] fs/buffer.c: use bvec iterator to truncate the bio

2018-11-16 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 07/19] btrfs: use bvec_last_segment to get bio's last page

2018-11-16 Thread Christoph Hellwig
On Thu, Nov 15, 2018 at 04:52:54PM +0800, Ming Lei wrote:
> index 2955a4ea2fa8..161e14b8b180 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -400,8 +400,11 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
> *inode, u64 start,
>  static u64 bio_end_offset(struct bio *bio)
>  {
>   struct bio_vec *last = bio_last_bvec_all(bio);
> + struct bio_vec bv;
>  
> - return page_offset(last->bv_page) + last->bv_len + last->bv_offset;
> + bvec_last_segment(last, );
> +
> + return page_offset(bv.bv_page) + bv.bv_len + bv.bv_offset;

I don't think we need this.  If last is a multi-page bvec bv_offset
will already contain the correct offset from the first page.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 08/19] btrfs: move bio_pages_all() to btrfs

2018-11-16 Thread Christoph Hellwig
On Thu, Nov 15, 2018 at 04:52:55PM +0800, Ming Lei wrote:
> BTRFS is the only user of this helper, so move this helper into
> BTRFS, and implement it via bio_for_each_segment_all(), since
> bio->bi_vcnt may not equal to number of pages after multipage bvec
> is enabled.

btrfs only uses the value to check if it is larger than 1.  No amount
of multipage bio merging should ever make bi_vcnt go from 0 to 1 or
vice versa.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2018-11-16 Thread Christoph Hellwig
> - bio_for_each_segment_all(bv, bio, i) {
> + for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++) {

This really needs a comment.  Otherwise it looks fine to me.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 13/19] iomap & xfs: only account for new added page

2018-11-16 Thread Christoph Hellwig
I'd much rather have __bio_try_merge_page only do merges in
the same page, and have a new __bio_try_merge_segment that does
multi-page merges.  This will keep the accounting a lot simpler.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-16 Thread Christoph Hellwig
On Thu, Nov 15, 2018 at 04:52:56PM +0800, Ming Lei wrote:
> There are still cases in which we need to use bio_bvecs() for get the
> number of multi-page segment, so introduce it.

The only user in your final tree seems to be the loop driver, and
even that one only uses the helper for read/write bios.

I think something like this would be much simpler in the end:

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d509902a8046..712511815ac6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -514,16 +514,18 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
struct request *rq = blk_mq_rq_from_pdu(cmd);
struct bio *bio = rq->bio;
struct file *file = lo->lo_backing_file;
+   struct bvec_iter bvec_iter;
+   struct bio_vec tmp;
unsigned int offset;
int nr_bvec = 0;
int ret;
 
+   __rq_for_each_bio(bio, rq)
+   bio_for_each_bvec(tmp, bio, bvec_iter)
+   nr_bvec++;
+
if (rq->bio != rq->biotail) {
-   struct bvec_iter iter;
-   struct bio_vec tmp;
 
-   __rq_for_each_bio(bio, rq)
-   nr_bvec += bio_bvecs(bio);
bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
 GFP_NOIO);
if (!bvec)
@@ -537,7 +539,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 * API will take care of all details for us.
 */
__rq_for_each_bio(bio, rq)
-   bio_for_each_bvec(tmp, bio, iter) {
+   bio_for_each_bvec(tmp, bio, bvec_iter) {
*bvec = tmp;
bvec++;
}
@@ -551,7 +553,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 */
offset = bio->bi_iter.bi_bvec_done;
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-   nr_bvec = bio_bvecs(bio);
}
atomic_set(>ref, 2);
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index dcad0b69f57a..379440d1ced0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -200,30 +200,6 @@ static inline unsigned bio_segments(struct bio *bio)
}
 }
 
-static inline unsigned bio_bvecs(struct bio *bio)
-{
-   unsigned bvecs = 0;
-   struct bio_vec bv;
-   struct bvec_iter iter;
-
-   /*
-* We special case discard/write same/write zeroes, because they
-* interpret bi_size differently:
-*/
-   switch (bio_op(bio)) {
-   case REQ_OP_DISCARD:
-   case REQ_OP_SECURE_ERASE:
-   case REQ_OP_WRITE_ZEROES:
-   return 0;
-   case REQ_OP_WRITE_SAME:
-   return 1;
-   default:
-   bio_for_each_bvec(bv, bio, iter)
-   bvecs++;
-   return bvecs;
-   }
-}
-
 /*
  * get a reference to a bio, so it won't disappear. the intended use is
  * something like:

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 3/3] block: use a driver-specific handler for the "inflight" value

2018-11-16 Thread Mike Snitzer
On Fri, Nov 16 2018 at  4:11am -0500,
Christoph Hellwig  wrote:

> On Fri, Nov 16, 2018 at 01:04:19AM +0100, Mikulas Patocka wrote:
> > Device mapper was converted to percpu inflight counters. In order to
> > display the correct values in the "inflight" sysfs file and in
> > /proc/diskstats, we need a custom callback that sums the percpu counters.
> > 
> > The function part_round_stats calculates the number of in-flight I/Os
> > every jiffy and uses this to calculate the counters time_in_queue and
> > io_ticks. In order to avoid excessive memory traffic on systems with high
> > number of CPUs, this functionality is disabled when percpu inflight values
> > are used and the values time_in_queue and io_ticks are calculated
> > differently - the result is less precise.
> 
> And none of that is device mapper specific.  Please submit this code
> to the block layer for use by all make_request based drivers.  Depending
> on what Jens as the maintainers thinkgs of the tradeoffs we can discuss
> if the summing should be on or off by default, or if we maybe even
> need the current code as a fallback.

I agree.

Mikulas, we discussed that the changes would be made to block core.  I
know that makes you less comfortable (I assume because you need to
consider more than DM) but it is the right way forward.

Now that the legacy IO path is gone we have less to consider; these
counters are only impacting bio-based.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 14/19] block: enable multipage bvecs

2018-11-16 Thread Christoph Hellwig
> -
> - if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
> - bv->bv_len += len;
> - bio->bi_iter.bi_size += len;
> - return true;
> - }
> + struct request_queue *q = NULL;
> +
> + if (page == bv->bv_page && off == (bv->bv_offset + bv->bv_len)
> + && (off + len) <= PAGE_SIZE)

How could the page struct be the same, but the range beyond PAGE_SIZE
(at least with the existing callers)?

Also no need for the inner btraces, and the && always goes on the
first line.

> + if (bio->bi_disk)
> + q = bio->bi_disk->queue;
> +
> + /* disable multi-page bvec too if cluster isn't enabled */
> + if (!q || !blk_queue_cluster(q) ||
> + ((page_to_phys(bv->bv_page) + bv->bv_offset + bv->bv_len) !=
> +  (page_to_phys(page) + off)))
> + return false;
> + merge:
> + bv->bv_len += len;
> + bio->bi_iter.bi_size += len;
> + return true;

Ok, this is scary, as it will give differen results depending on when
bi_disk is assigned.  But then again we shouldn't really do the cluster
check here, but rather when splitting the bio for the actual low-level
driver.

(and eventually we should kill this clustering setting off in favor
of our normal segment limits).

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 17/19] block: don't use bio->bi_vcnt to figure out segment number

2018-11-16 Thread Christoph Hellwig
On Thu, Nov 15, 2018 at 04:53:04PM +0800, Ming Lei wrote:
> It is wrong to use bio->bi_vcnt to figure out how many segments
> there are in the bio even though CLONED flag isn't set on this bio,
> because this bio may be splitted or advanced.
> 
> So always use bio_segments() in blk_recount_segments(), and it shouldn't
> cause any performance loss now because the physical segment number is figured
> out in blk_queue_split() and BIO_SEG_VALID is set meantime since
> bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").

Looks good, but shouldn't this go to the beginning of the series?

Reviewed-by: Christoph Hellwig 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 01/19] block: introduce multi-page page bvec helpers

2018-11-16 Thread Christoph Hellwig
> -#define bvec_iter_page(bvec, iter)   \
> +#define mp_bvec_iter_page(bvec, iter)\
>   (__bvec_iter_bvec((bvec), (iter))->bv_page)
>  
> -#define bvec_iter_len(bvec, iter)\
> +#define mp_bvec_iter_len(bvec, iter) \

I'd much prefer if we would stick to the segment naming that
we also use in the higher level helper.

So segment_iter_page, segment_iter_len, etc.

> + * This helpers are for building sp bvec in flight.

Please spell out single page, sp is not easy understandable.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 05/19] block: introduce bvec_last_segment()

2018-11-16 Thread Christoph Hellwig
> + unsigned last_page = total / PAGE_SIZE;
> +
> + if (last_page * PAGE_SIZE == total)

Not sure it matters much with modern compilers, but generally
we shit by PAGE_SHIFT instead of multiplying with or dividing
by PAGE_SIZE.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 19/19] block: kill BLK_MQ_F_SG_MERGE

2018-11-16 Thread Christoph Hellwig
On Thu, Nov 15, 2018 at 04:53:06PM +0800, Ming Lei wrote:
> QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.

Looks fine,

Reviewed-by: Christoph Hellwig 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 00/19] block: support multi-page bvec

2018-11-16 Thread Christoph Hellwig
It seems like bi_phys_segments is still around of this series.
Shouldn't it be superflous now?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE

2018-11-16 Thread Christoph Hellwig
On Thu, Nov 15, 2018 at 04:53:05PM +0800, Ming Lei wrote:
> Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
> physical segment number is mainly figured out in blk_queue_split() for
> fast path, and the flag of BIO_SEG_VALID is set there too.
> 
> Now only blk_recount_segments() and blk_recalc_rq_segments() use this
> flag.
> 
> Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
> is set in blk_queue_split().
> 
> For another user of blk_recalc_rq_segments():
> 
> - run in partial completion branch of blk_update_request, which is an unusual 
> case
> 
> - run in blk_cloned_rq_check_limits(), still not a big problem if the flag is 
> killed
> since dm-rq is the only user.
> 
> Multi-page bvec is enabled now, QUEUE_FLAG_NO_SG_MERGE doesn't make sense any 
> more.
> 
> Cc: Dave Chinner 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org
> Cc: Shaohua Li 
> Cc: linux-r...@vger.kernel.org
> Cc: linux-er...@lists.ozlabs.org
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Cc: Darrick J. Wong 
> Cc: linux-...@vger.kernel.org
> Cc: Gao Xiang 
> Cc: Christoph Hellwig 
> Cc: Theodore Ts'o 
> Cc: linux-e...@vger.kernel.org
> Cc: Coly Li 
> Cc: linux-bca...@vger.kernel.org
> Cc: Boaz Harrosh 
> Cc: Bob Peterson 
> Cc: cluster-de...@redhat.com
> Signed-off-by: Ming Lei 
> ---
>  block/blk-merge.c  | 31 ++-
>  block/blk-mq-debugfs.c |  1 -
>  block/blk-mq.c |  3 ---
>  drivers/md/dm-table.c  | 13 -
>  include/linux/blkdev.h |  1 -
>  5 files changed, 6 insertions(+), 43 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 153a659fde74..06be298be332 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -351,8 +351,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
> **bio)
>  EXPORT_SYMBOL(blk_queue_split);
>  
>  static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
> -  struct bio *bio,
> -  bool no_sg_merge)
> +  struct bio *bio)
>  {
>   struct bio_vec bv, bvprv = { NULL };
>   int cluster, prev = 0;
> @@ -379,13 +378,6 @@ static unsigned int __blk_recalc_rq_segments(struct 
> request_queue *q,
>   nr_phys_segs = 0;
>   for_each_bio(bio) {
>   bio_for_each_bvec(bv, bio, iter) {
> - /*
> -  * If SG merging is disabled, each bio vector is
> -  * a segment
> -  */
> - if (no_sg_merge)
> - goto new_segment;
> -
>   if (prev && cluster) {
>   if (seg_size + bv.bv_len
>   > queue_max_segment_size(q))
> @@ -420,27 +412,16 @@ static unsigned int __blk_recalc_rq_segments(struct 
> request_queue *q,
>  
>  void blk_recalc_rq_segments(struct request *rq)
>  {
> - bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
> - >q->queue_flags);
> -
> - rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio,
> - no_sg_merge);
> + rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);

Can we rename __blk_recalc_rq_segments to blk_recalc_rq_segments
can kill the old blk_recalc_rq_segments now?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE

2018-11-16 Thread Christoph Hellwig
On Thu, Nov 15, 2018 at 06:18:11PM -0800, Omar Sandoval wrote:
> This commit message wasn't very clear. Is it the case that
> QUEUE_FLAG_NO_SG_MERGE is no longer set by any drivers?

I think he wants to say that not doing S/G merging is rather pointless
with the current setup of the I/O path, as it isn't going to save
you a significant amount of cycles.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] nvme: allow ANA support to be independent of native multipathing

2018-11-16 Thread Mike Snitzer
On Fri, Nov 16 2018 at  4:14am -0500,
Christoph Hellwig  wrote:

> On Thu, Nov 15, 2018 at 12:46:05PM -0500, Mike Snitzer wrote:
> > Whether or not ANA is present is a choice of the target implementation;
> > the host (and whether it supports multipathing) has _zero_ influence on
> > this.  If the target declares a path as 'inaccessible' the path _is_
> > inaccessible to the host.  As such, ANA support should be functional
> > even if native multipathing is not.
> > 
> > Introduce ability to always re-read ANA log page as required due to ANA
> > error and make current ANA state available via sysfs -- even if native
> > multipathing is disabled on the host (e.g. nvme_core.multipath=N).
> 
> The first part I could see, but I still want to make it conditional
> in some way as nvme is going into deeply embedded setups, and I don't
> want to carry the weight of the ANA code around for everyone.

So disabling CONFIG_NVME_MULTIPATH isn't adequate?  You see a need for
enabling CONFIG_NVME_MULTIPATH but disallowing consideration of ANA
state management for these embedded cases?  That doesn't make much sense
in that ANA won't be enabled on the target, and if it is then shouldn't
the NVMe host enable ANA support?
 
> The second I fundamentally disagree with.  And even if you found agreement
> it would have to be in a separate patch as it is a separate feature.

In a later mail you clarified that by "second" you meant the sysfs
export.

nvme_ns_id_attrs_are_visible() calls nvme_ctrl_use_ana() to know whether
to export the ana state via sysfs.  Makes no sense to enable ANA to
work, independent of native multipathing, and then even though ANA is
active disallow exporting the associated ANA state.  We need
consistency.. if the sysfs file exists it means ANA state is being
managed.

All I can figure is that what you're suggesting is born out of
disallowing uses you fundamentally disagree with (e.g. multipath-tools
reading the ANA state from this sysfs file).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] nvme: allow ANA support to be independent of native multipathing

2018-11-16 Thread Mike Snitzer
On Fri, Nov 16 2018 at  2:25am -0500,
Hannes Reinecke  wrote:

> On 11/15/18 6:46 PM, Mike Snitzer wrote:
> >Whether or not ANA is present is a choice of the target implementation;
> >the host (and whether it supports multipathing) has _zero_ influence on
> >this.  If the target declares a path as 'inaccessible' the path _is_
> >inaccessible to the host.  As such, ANA support should be functional
> >even if native multipathing is not.
> >
> >Introduce ability to always re-read ANA log page as required due to ANA
> >error and make current ANA state available via sysfs -- even if native
> >multipathing is disabled on the host (e.g. nvme_core.multipath=N).
> >
> >This affords userspace access to the current ANA state independent of
> >which layer might be doing multipathing.  It also allows multipath-tools
> >to rely on the NVMe driver for ANA support while dm-multipath takes care
> >of multipathing.
> >
> >While implementing these changes care was taken to preserve the exact
> >ANA functionality and code sequence native multipathing has provided.
> >This manifests as native multipathing's nvme_failover_req() being
> >tweaked to call __nvme_update_ana() which was factored out to allow
> >nvme_update_ana() to be called independent of nvme_failover_req().
> >
> >And as always, if embedded NVMe users do not want any performance
> >overhead associated with ANA or native NVMe multipathing they can
> >disable CONFIG_NVME_MULTIPATH.
> >
> >Signed-off-by: Mike Snitzer 
> >---
> >  drivers/nvme/host/core.c  | 10 +
> >  drivers/nvme/host/multipath.c | 49 
> > +--
> >  drivers/nvme/host/nvme.h  |  4 
> >  3 files changed, 48 insertions(+), 15 deletions(-)
> >
> >diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> >index fe957166c4a9..3df607905628 100644
> >--- a/drivers/nvme/host/core.c
> >+++ b/drivers/nvme/host/core.c
> >@@ -255,10 +255,12 @@ void nvme_complete_rq(struct request *req)
> > nvme_req(req)->ctrl->comp_seen = true;
> > if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
> >-if ((req->cmd_flags & REQ_NVME_MPATH) &&
> >-blk_path_error(status)) {
> >-nvme_failover_req(req);
> >-return;
> >+if (blk_path_error(status)) {
> >+if (req->cmd_flags & REQ_NVME_MPATH) {
> >+nvme_failover_req(req);
> >+return;
> >+}
> >+nvme_update_ana(req);
> > }
> > if (!blk_queue_dying(req->q)) {

...

> >diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> >index 8e03cda770c5..0adbcff5fba2 100644
> >--- a/drivers/nvme/host/multipath.c
> >+++ b/drivers/nvme/host/multipath.c
> >@@ -58,25 +87,22 @@ void nvme_failover_req(struct request *req)
> > spin_unlock_irqrestore(>head->requeue_lock, flags);
> > blk_mq_end_request(req, 0);
> >-switch (status & 0x7ff) {
> >-case NVME_SC_ANA_TRANSITION:
> >-case NVME_SC_ANA_INACCESSIBLE:
> >-case NVME_SC_ANA_PERSISTENT_LOSS:
> >+if (nvme_ana_error(status)) {
> > /*
> >  * If we got back an ANA error we know the controller is alive,
> >  * but not ready to serve this namespaces.  The spec suggests
> >  * we should update our general state here, but due to the fact
> >  * that the admin and I/O queues are not serialized that is
> >  * fundamentally racy.  So instead just clear the current path,
> >- * mark the the path as pending and kick of a re-read of the ANA
> >+ * mark the path as pending and kick off a re-read of the ANA
> >  * log page ASAP.
> >  */
> > nvme_mpath_clear_current_path(ns);
> >-if (ns->ctrl->ana_log_buf) {
> >-set_bit(NVME_NS_ANA_PENDING, >flags);
> >-queue_work(nvme_wq, >ctrl->ana_work);
> >-}
> >-break;
> >+__nvme_update_ana(ns);
> >+goto kick_requeue;
> >+}
> >+
> >+switch (status & 0x7ff) {
> > case NVME_SC_HOST_PATH_ERROR:
> > /*
> >  * Temporary transport disruption in talking to the controller.
> >@@ -93,6 +119,7 @@ void nvme_failover_req(struct request *req)
> > break;
> > }
> >+kick_requeue:
> > kblockd_schedule_work(>head->requeue_work);
> >  }
> Doesn't the need to be protected by 'if (ns->head->disk)' or somesuch?

No.  nvme_failover_req() is only ever called by native multipathing; see
nvme_complete_rq()'s check for req->cmd_flags & REQ_NVME_MPATH as the
condition for calling nvme_complete_rq().

The previos RFC-style patch I posted muddled ANA and multipathing in
nvme_update_ana() but this final patch submission was fixed because I
saw a cleaner way forward by having nvme_failover_req() also do ANA work
just 

Re: [dm-devel] [PATCH 3/3] block: use a driver-specific handler for the "inflight" value

2018-11-16 Thread Jens Axboe
On 11/16/18 6:55 AM, Mike Snitzer wrote:
> On Fri, Nov 16 2018 at  4:11am -0500,
> Christoph Hellwig  wrote:
> 
>> On Fri, Nov 16, 2018 at 01:04:19AM +0100, Mikulas Patocka wrote:
>>> Device mapper was converted to percpu inflight counters. In order to
>>> display the correct values in the "inflight" sysfs file and in
>>> /proc/diskstats, we need a custom callback that sums the percpu counters.
>>>
>>> The function part_round_stats calculates the number of in-flight I/Os
>>> every jiffy and uses this to calculate the counters time_in_queue and
>>> io_ticks. In order to avoid excessive memory traffic on systems with high
>>> number of CPUs, this functionality is disabled when percpu inflight values
>>> are used and the values time_in_queue and io_ticks are calculated
>>> differently - the result is less precise.
>>
>> And none of that is device mapper specific.  Please submit this code
>> to the block layer for use by all make_request based drivers.  Depending
>> on what Jens as the maintainers thinkgs of the tradeoffs we can discuss
>> if the summing should be on or off by default, or if we maybe even
>> need the current code as a fallback.
> 
> I agree.
> 
> Mikulas, we discussed that the changes would be made to block core.  I
> know that makes you less comfortable (I assume because you need to
> consider more than DM) but it is the right way forward.
> 
> Now that the legacy IO path is gone we have less to consider; these
> counters are only impacting bio-based.

Agree, either the new code is good enough to be used in general, and
then it should be generally used, or it should not exist in the first
place. We've always worked very hard to provide the most efficient
helpers and infrastructure we can in the block layer itself, so that
drivers don't have to reinvent the wheel.

-- 
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH v2] nvme: allow ANA support to be independent of native multipathing

2018-11-16 Thread Mike Snitzer
Whether or not ANA is present is a choice of the target implementation;
the host (and whether it supports multipathing) has _zero_ influence on
this.  If the target declares a path as 'inaccessible' the path _is_
inaccessible to the host.  As such, ANA support should be functional
even if native multipathing is not.

Introduce ability to always re-read ANA log page as required due to ANA
error and make current ANA state available via sysfs -- even if native
multipathing is disabled on the host (via nvme_core.multipath=N).

While implementing these changes care was taken to preserve the exact
ANA functionality and code sequence native multipathing has provided.
This manifests as native multipathing's nvme_failover_req() being
tweaked to call __nvme_update_ana() which was factored out to allow
nvme_update_ana() to be called independent of nvme_failover_req().

Add new module param to allow ANA to be disabled via nvme_core.ana=N.
Also, emit warning if ANA is enabled but native multipathing isn't.

And as always, if embedded NVMe users do not want any performance
overhead associated with ANA or native NVMe multipathing they can
disable CONFIG_NVME_MULTIPATH.

Signed-off-by: Mike Snitzer 
---
 drivers/nvme/host/core.c  | 10 +---
 drivers/nvme/host/multipath.c | 59 ++-
 drivers/nvme/host/nvme.h  |  4 +++
 3 files changed, 57 insertions(+), 16 deletions(-)

v2: add nvme_core.ana modparam and emit warning if ana but !multipath

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fe957166c4a9..3df607905628 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -255,10 +255,12 @@ void nvme_complete_rq(struct request *req)
nvme_req(req)->ctrl->comp_seen = true;
 
if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
-   if ((req->cmd_flags & REQ_NVME_MPATH) &&
-   blk_path_error(status)) {
-   nvme_failover_req(req);
-   return;
+   if (blk_path_error(status)) {
+   if (req->cmd_flags & REQ_NVME_MPATH) {
+   nvme_failover_req(req);
+   return;
+   }
+   nvme_update_ana(req);
}
 
if (!blk_queue_dying(req->q)) {
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 8e03cda770c5..8b45cad2734d 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -18,11 +18,16 @@
 static bool multipath = true;
 module_param(multipath, bool, 0444);
 MODULE_PARM_DESC(multipath,
-   "turn on native support for multiple controllers per subsystem");
+   "toggle native support for multiple controllers per subsystem");
+
+static bool ana = true;
+module_param(ana, bool, 0444);
+MODULE_PARM_DESC(ana,
+   "toggle support for Asynchronous Namespace Access");
 
 inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
 {
-   return multipath && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
+   return ana && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
 }
 
 /*
@@ -47,6 +52,35 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
}
 }
 
+static bool nvme_ana_error(u16 status)
+{
+   switch (status & 0x7ff) {
+   case NVME_SC_ANA_TRANSITION:
+   case NVME_SC_ANA_INACCESSIBLE:
+   case NVME_SC_ANA_PERSISTENT_LOSS:
+   return true;
+   }
+   return false;
+}
+
+static void __nvme_update_ana(struct nvme_ns *ns)
+{
+   if (!ns->ctrl->ana_log_buf)
+   return;
+
+   set_bit(NVME_NS_ANA_PENDING, >flags);
+   queue_work(nvme_wq, >ctrl->ana_work);
+}
+
+void nvme_update_ana(struct request *req)
+{
+   struct nvme_ns *ns = req->q->queuedata;
+   u16 status = nvme_req(req)->status;
+
+   if (nvme_ana_error(status))
+   __nvme_update_ana(ns);
+}
+
 void nvme_failover_req(struct request *req)
 {
struct nvme_ns *ns = req->q->queuedata;
@@ -58,25 +92,22 @@ void nvme_failover_req(struct request *req)
spin_unlock_irqrestore(>head->requeue_lock, flags);
blk_mq_end_request(req, 0);
 
-   switch (status & 0x7ff) {
-   case NVME_SC_ANA_TRANSITION:
-   case NVME_SC_ANA_INACCESSIBLE:
-   case NVME_SC_ANA_PERSISTENT_LOSS:
+   if (nvme_ana_error(status)) {
/*
 * If we got back an ANA error we know the controller is alive,
 * but not ready to serve this namespaces.  The spec suggests
 * we should update our general state here, but due to the fact
 * that the admin and I/O queues are not serialized that is
 * fundamentally racy.  So instead just clear the current path,
-* mark the the path as pending and kick of a re-read of the ANA
+* mark the path as pending and kick off a re-read 

Re: [dm-devel] nvme: allow ANA support to be independent of native multipathing

2018-11-16 Thread Mike Snitzer
On Fri, Nov 16 2018 at  5:17am -0500,
Christoph Hellwig  wrote:

> On Fri, Nov 16, 2018 at 11:06:32AM +0100, Hannes Reinecke wrote:
> > Ok, so would you be happy with making ANA support configurable?
> 
> I've looked a bit over the whole situation, and what I think we need
> to do is:
> 
>  a) warn if we see a ANA capable device without multipath support
> so people know it is not going to work properly.

I disagree with your cynicism but v2 of this patch now emits a warning
accordingly.

>  b) deprecate the multipath module option.  It was only intended as
> a migration for any pre-existing PCIe multipath user if there
> were any, not to support any new functionality.  So for 4.20
> put in a patch that prints a clear warning when it is used,
> including a link to the nvme list, and then for 4.25 or so
> remove it entirely unless something unexpected come up.

You rejected the idea of allowing fine-grained control over whether
native NVMe multipathing is enabled or not on a per-namespace basis.
All we have is the coarse-grained nvme_core.multipath=N knob.  Now
you're forecasting removing even that.  Please don't do that.

> This whole drama of optional multipath use has wasted way too much
> of everyones time already.

It has wasted _way_ too much time.

But the drama is born out of you rejecting that we need to preserve
multipath-tools and dm-multipath's ability to work across any
transport.  You don't need to do that work: Hannes, myself and others
have always been willing and able -- if you'd let us.

IIRC it was at 2016's LSF in Boston where Ewan Milne and I had a
face-to-face conversation with you in the hallway track where you agreed
that ANA support would be activated if the capability was advertised by
the target.  The model we discussed is that it would be comparable to
how ALUA gets enabled during SCSI LUN discovery.

I hope you can see your way forward to be more accommodating now.
Especially given the proposed changes are backed by NVMe standards.

Please, PLEASE take v2 of this patch.. please? ;)

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] nvme: allow ANA support to be independent of native multipathing

2018-11-16 Thread Laurence Oberman
On Fri, 2018-11-16 at 14:28 -0500, Mike Snitzer wrote:
> On Fri, Nov 16 2018 at  5:17am -0500,
> Christoph Hellwig  wrote:
> 
> > On Fri, Nov 16, 2018 at 11:06:32AM +0100, Hannes Reinecke wrote:
> > > Ok, so would you be happy with making ANA support configurable?
> > 
> > I've looked a bit over the whole situation, and what I think we
> > need
> > to do is:
> > 
> >  a) warn if we see a ANA capable device without multipath support
> > so people know it is not going to work properly.
> 
> I disagree with your cynicism but v2 of this patch now emits a
> warning
> accordingly.
> 
> >  b) deprecate the multipath module option.  It was only intended as
> > a migration for any pre-existing PCIe multipath user if there
> > were any, not to support any new functionality.  So for 4.20
> > put in a patch that prints a clear warning when it is used,
> > including a link to the nvme list, and then for 4.25 or so
> > remove it entirely unless something unexpected come up.
> 
> You rejected the idea of allowing fine-grained control over whether
> native NVMe multipathing is enabled or not on a per-namespace basis.
> All we have is the coarse-grained nvme_core.multipath=N knob.  Now
> you're forecasting removing even that.  Please don't do that.
> 
> > This whole drama of optional multipath use has wasted way too much
> > of everyones time already.
> 
> It has wasted _way_ too much time.
> 
> But the drama is born out of you rejecting that we need to preserve
> multipath-tools and dm-multipath's ability to work across any
> transport.  You don't need to do that work: Hannes, myself and others
> have always been willing and able -- if you'd let us.
> 
> IIRC it was at 2016's LSF in Boston where Ewan Milne and I had a
> face-to-face conversation with you in the hallway track where you
> agreed
> that ANA support would be activated if the capability was advertised
> by
> the target.  The model we discussed is that it would be comparable to
> how ALUA gets enabled during SCSI LUN discovery.
> 
> I hope you can see your way forward to be more accommodating now.
> Especially given the proposed changes are backed by NVMe standards.
> 
> Please, PLEASE take v2 of this patch.. please? ;)
> 
> Thanks,
> Mike

I am begging you take it too please 
Thanks
Laurence

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH V10 15/19] block: always define BIO_MAX_PAGES as 256

2018-11-16 Thread Christoph Hellwig
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 5040e9a2eb09..277921ad42e7 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -34,15 +34,7 @@
>  #define BIO_BUG_ON
>  #endif
>  
> -#ifdef CONFIG_THP_SWAP
> -#if HPAGE_PMD_NR > 256
> -#define BIO_MAX_PAGESHPAGE_PMD_NR
> -#else
>  #define BIO_MAX_PAGES256
> -#endif
> -#else
> -#define BIO_MAX_PAGES256
> -#endif

Looks good.  This mess should have never gone in.

Reviewed-by: Christoph Hellwig 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH V10 00/19] block: support multi-page bvec

2018-11-16 Thread Ming Lei
On Fri, Nov 16, 2018 at 03:03:14PM +0100, Christoph Hellwig wrote:
> It seems like bi_phys_segments is still around of this series.
> Shouldn't it be superflous now?

Even though multi-page bvec is supported, the segment number doesn't
equal to the actual bvec count yet, for example, one bvec may be too
bigger to be held in one single segment.


Thanks,
Ming

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel