date:20170501

Re: [PATCH 25/27] block: remove the discard_zeroes_data flag

2017-05-01 Thread Nicholas A. Bellinger

On Mon, 2017-05-01 at 20:45 +, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> > Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
> > kill this hack.
> > 
> > Signed-off-by: Christoph Hellwig 
> > Reviewed-by: Martin K. Petersen 
> > Reviewed-by: Hannes Reinecke 
> > [ ... ]
> > diff --git a/drivers/target/target_core_device.c 
> > b/drivers/target/target_core_device.c
> > index c754ae33bf7b..d2f089cfa9ae 100644
> > --- a/drivers/target/target_core_device.c
> > +++ b/drivers/target/target_core_device.c
> > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct 
> > se_dev_attrib *attrib,
> > attrib->unmap_granularity = q->limits.discard_granularity / block_size;
> > attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> > block_size;
> > -   attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
> > +   attrib->unmap_zeroes_data = 0;
> > return true;
> >  }
> >  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> 
> Hello Christoph,
> 
> Sorry that I hadn't noticed this before but I think that this patch
> introduces a significant performance regressions for LIO users. Before
> this patch the LBPRZ flag was reported correctly to initiator systems
> through the thin provisioning VPD. With this patch applied that flag
> will always be reported as zero, forcing initiators to submit WRITE
> commands with zeroed data buffers instead of submitting the SCSI UNMAP
> command to block devices for which discard_zeroes_data was set. From
> target_core_spc.c:
> 
> /* Thin Provisioning VPD */
> static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char 
> *buf)
> {
>   [ ... ]
>   /*
>* The unmap_zeroes_data set means that the underlying device supports
>* REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
>* the SBC requirements for LBPRZ, meaning that a subsequent read
>* will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
>* See sbc4r36 6.6.4.
>*/
>   if (((dev->dev_attrib.emulate_tpu != 0) ||
>(dev->dev_attrib.emulate_tpws != 0)) &&
>(dev->dev_attrib.unmap_zeroes_data != 0))
>   buf[5] |= 0x04;
>   [ ... ]
> }
> 

According to sd_config_discard(), it's SD_LBP_WS16, SD_LBP_WS10 and
SD_LBP_ZERO that where ever setting unmap_zeros_data = 1 to begin with.
For UNMAP, q->limits.discard_zeroes_data was never set.

That said, it's pretty much implied that supporting DISCARD means
subsequent READs return zeros, so target_configure_unmap_from_queue()
should be setting attrib->unmap_zeroes_data = 1, or just dropping it
all-together.

Post -rc1, I'll push a patch to do the latter.

Re: [PATCH 7/7] Revert "mtip32xx: pass BLK_MQ_F_NO_SCHED"

2017-05-01 Thread Hannes Reinecke

On 04/28/2017 04:55 PM, Jens Axboe wrote:
> This reverts commit 4981d04dd8f1ab19e2cce008da556d7f099b6e68.
> 
> The driver has been converted to using the proper infrastructure
> for issuing internal commands. This means it's now safe to use with
> the scheduling infrastruture, so we can now revert the change
> that turned off scheduling for mtip32xx.
> 
> Reviewed-by: Ming Lei 
> Signed-off-by: Jens Axboe 
> ---
>  drivers/block/mtip32xx/mtip32xx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/block/mtip32xx/mtip32xx.c 
> b/drivers/block/mtip32xx/mtip32xx.c
> index 9780c4d3697c..c598c5ac2fd0 100644
> --- a/drivers/block/mtip32xx/mtip32xx.c
> +++ b/drivers/block/mtip32xx/mtip32xx.c
> @@ -3964,7 +3964,7 @@ static int mtip_block_initialize(struct driver_data *dd)
>   dd->tags.reserved_tags = 1;
>   dd->tags.cmd_size = sizeof(struct mtip_cmd);
>   dd->tags.numa_node = dd->numa_node;
> - dd->tags.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_SCHED;
> + dd->tags.flags = BLK_MQ_F_SHOULD_MERGE;
>   dd->tags.driver_data = dd;
>   dd->tags.timeout = MTIP_NCQ_CMD_TIMEOUT_MS;
>  
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 6/7] blk-mq-sched: remove hack that bypasses scheduler for reserved requests

2017-05-01 Thread Hannes Reinecke

On 04/28/2017 04:55 PM, Jens Axboe wrote:
> We have update the troublesome driver (mtip32xx) to deal with this
> appropriately. So kill the hack that bypassed scheduler allocation
> and insertion for reserved requests.
> 
updated?

> Signed-off-by: Jens Axboe 
> ---
>  block/blk-mq-sched.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 8b361e192e8a..e79e9f18d7c2 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -82,11 +82,7 @@ struct request *blk_mq_sched_get_request(struct 
> request_queue *q,
>   if (likely(!data->hctx))
>   data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
>  
> - /*
> -  * For a reserved tag, allocate a normal request since we might
> -  * have driver dependencies on the value of the internal tag.
> -  */
> - if (e && !(data->flags & BLK_MQ_REQ_RESERVED)) {
> + if (e) {
>   data->flags |= BLK_MQ_REQ_INTERNAL;
>  
>   /*
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 5/7] mtip32xx: convert internal command issue to block IO path

2017-05-01 Thread Hannes Reinecke

On 04/28/2017 04:55 PM, Jens Axboe wrote:
> The driver special cases certain things for command issue, depending
> on whether it's an internal command or not. Make the internal commands
> use the regular infrastructure for issuing IO.
> 
> Since this is an 8-group souped up AHCI variant, we have to deal
> with NCQ vs non-queueable commands. Do this from the queue_rq
> handler, by backing off unless the drive is idle.
> 
> Signed-off-by: Jens Axboe 
> ---
>  drivers/block/mtip32xx/mtip32xx.c | 106 
> +++---
>  1 file changed, 76 insertions(+), 30 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 4/7] blk-mq: add and export helper to tell if a request is reserved

2017-05-01 Thread Hannes Reinecke

On 04/28/2017 04:55 PM, Jens Axboe wrote:
> We only have an internal helper for checking a tag value. Add
> an exported helper that takes a request and hardware queue,
> and check against the driver tags.
> 
> Signed-off-by: Jens Axboe 
> ---
>  block/blk-mq-tag.c | 6 ++
>  include/linux/blk-mq.h | 1 +
>  2 files changed, 7 insertions(+)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH 3/4] blk-mq: use hw tag for scheduling if hw tag space is big enough

2017-05-01 Thread Omar Sandoval

On Mon, May 01, 2017 at 03:06:16PM +, Bart Van Assche wrote:
> On Sat, 2017-04-29 at 18:35 +0800, Ming Lei wrote:
> > On Fri, Apr 28, 2017 at 06:09:40PM +, Bart Van Assche wrote:
> > > On Fri, 2017-04-28 at 23:15 +0800, Ming Lei wrote:
> > > > +static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
> > > > +{
> > > > +   if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
> > > > +   return false;
> > > > +
> > > > +   if (blk_mq_get_queue_depth(q) < q->nr_requests)
> > > > +   return false;
> > > > +
> > > > +   return true;
> > > > +}
> > > 
> > > The only user of shared tag sets I know of is scsi-mq. I think it's really
> > > unfortunate that this patch systematically disables 
> > > BLK_MQ_F_SCHED_USE_HW_TAG
> > > for scsi-mq.
> > 
> > In previous patch, I actually allow driver to pass this flag, but this
> > feature is dropped in this post, just for making it simple & clean.
> > If you think we need it for shared tag set, I can add it in v1.
> > 
> > For shared tag sets, I suggest to not enable it at default, because
> > scheduler is per request queue now, and generaly more requests available,
> > better it performs.  When tags are shared among several request
> > queues, one of them may use tags up for its own scheduling, then
> > starve others. But it should be possible and not difficult to allocate
> > requests fairly for scheduling in this case if we switch to per-hctx
> > scheduling.
> 
> Hello Ming,
> 
> Have you noticed that there is already a mechanism in the block layer to
> avoid starvation if a tag set is shared? The hctx_may_queue() function
> guarantees that each user that shares a tag set gets at least some tags.
> The .active_queues counter keeps track of the number of hardware queues
> that share a tag set.
> 
> Bart.

The scheduler tags are there to abstract away the hardware, and
USE_HW_TAG should just be an optimization for when that abstraction is a
noop. That's not the case when there are shared tags, and I doubt that
the overhead of the scheduler tags is significant for scsi-mq. Let's
stick with the behavior Ming had here.

[PATCH 03/13] blk: make the bioset rescue_workqueue optional.

2017-05-01 Thread NeilBrown

This patch converts bioset_create() and
bioset_create_nobvec() to not create a workqueue so
alloctions will never trigger punt_bios_to_rescuer().  It
also introduces bioset_create_rescued() and
bioset_create_nobvec_rescued() which preserve the old
behaviour.

All callers of bioset_create() and bioset_create_nobvec(),
that are inside block device drivers, are converted to the
_rescued() version.

biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios.  This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, drbd_md_io_bio_set,
and one allocated by target_core_iblock.c.

biosets used by md/raid to not need rescuing as
their usage was recently audited to revised to never
risk deadlock.

It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.

Signed-off-by: NeilBrown 
---
 block/bio.c   |   12 ++--
 block/blk-core.c  |2 +-
 drivers/md/bcache/super.c |6 --
 drivers/md/dm-crypt.c |2 +-
 drivers/md/dm-io.c|2 +-
 drivers/md/dm.c   |5 +++--
 include/linux/bio.h   |1 +
 7 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 3f1286ed301e..257606e742e0 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -363,6 +363,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
struct bio_list punt, nopunt;
struct bio *bio;
 
+   if (!WARN_ON_ONCE(!bs->rescue_workqueue))
+   return;
/*
 * In order to guarantee forward progress we must punt only bios that
 * were allocated from this bio_set; otherwise, if there was a bio on
@@ -474,7 +476,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int 
nr_iovecs,
 
if (current->bio_list &&
(!bio_list_empty(¤t->bio_list[0]) ||
-!bio_list_empty(¤t->bio_list[1])))
+!bio_list_empty(¤t->bio_list[1])) &&
+   bs->rescue_workqueue)
gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
p = mempool_alloc(bs->bio_pool, gfp_mask);
@@ -1925,7 +1928,7 @@ EXPORT_SYMBOL(bioset_free);
  * bioset_create  - Create a bio_set
  * @pool_size: Number of bio and bio_vecs to cache in the mempool
  * @front_pad: Number of bytes to allocate in front of the returned bio
- * @flags: Flags to modify behavior, currently only %BIOSET_NEED_BVECS
+ * @flags: Flags to modify behavior, currently %BIOSET_NEED_BVECS and 
%BIOSET_NEED_RESCUER
  *
  * Description:
  *Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller
@@ -1936,6 +1939,8 @@ EXPORT_SYMBOL(bioset_free);
  *or things will break badly.
  *If %BIOSET_NEED_BVECS is set in @flags, a separate pool will be allocated
  *for allocating iovecs.  This pool is not needed e.g. for 
bio_clone_fast().
+ *If %BIOSET_NEED_RESCUER is set, workqueue is create which can be used to
+ *dispatch queued requests when the mempool runs out of space.
  *
  */
 struct bio_set *bioset_create(unsigned int pool_size,
@@ -1971,6 +1976,9 @@ struct bio_set *bioset_create(unsigned int pool_size,
goto bad;
}
 
+   if (!(flags & BIOSET_NEED_RESCUER))
+   return bs;
+
bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
if (!bs->rescue_workqueue)
goto bad;
diff --git a/block/blk-core.c b/block/blk-core.c
index 3797753f4085..ae51f159a2ca 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -730,7 +730,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
if (q->id < 0)
goto fail_q;
 
-   q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
+   q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS | 
BIOSET_NEED_RESCUER);
if (!q->bio_split)
goto fail_id;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 8b9dae98cc7e..bb50991a82d3 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -786,7 +786,8 @@ static int bcache_device_init(struct bcache_device *d, 
unsigned block_size,
 
minor *= BCACHE_MINORS;
 
-   if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio), 
BIOSET_NEED_BVECS)) ||
+   if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio),
+  BIOSET_NEED_BVECS | 
BIOSET_NEED_RESCUER)) ||
!(d->disk = alloc_disk(BCACHE_MINORS))) {
ida_simple_remove(&bcache_minor, minor);
return -ENOMEM;
@@ -1520,7 +1521,8 @@ struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
sizeof(struct bbio) + sizeof(struct bio_vec) *
bucket_pages(c))) ||
!(c->fill_iter = mempool_create_kmalloc_pool(1, iter_size))

[PATCH 01/13] blk: remove bio_set arg from blk_queue_split()

2017-05-01 Thread NeilBrown

blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary.  Remove the last arg and always use
q->bio_split inside blk_queue_split()

Reviewed-by: Christoph Hellwig 
Reviewed-by: Ming Lei 
Credit-to: Javier González  (Noticed that lightnvm was missed)
Signed-off-by: NeilBrown 
---
 block/blk-core.c  |2 +-
 block/blk-merge.c |9 -
 block/blk-mq.c|2 +-
 drivers/block/drbd/drbd_req.c |2 +-
 drivers/block/pktcdvd.c   |2 +-
 drivers/block/ps3vram.c   |2 +-
 drivers/block/rsxx/dev.c  |2 +-
 drivers/block/umem.c  |2 +-
 drivers/block/zram/zram_drv.c |2 +-
 drivers/lightnvm/pblk-init.c  |4 ++--
 drivers/lightnvm/rrpc.c   |2 +-
 drivers/md/md.c   |2 +-
 drivers/s390/block/dcssblk.c  |2 +-
 drivers/s390/block/xpram.c|2 +-
 include/linux/blkdev.h|3 +--
 15 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 24886b69690f..1a05c505964d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1663,7 +1663,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 */
blk_queue_bounce(q, &bio);
 
-   blk_queue_split(q, &bio, q->bio_split);
+   blk_queue_split(q, &bio);
 
if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
bio->bi_error = -EIO;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..d59074556703 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -202,8 +202,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
return do_split ? new : NULL;
 }
 
-void blk_queue_split(struct request_queue *q, struct bio **bio,
-struct bio_set *bs)
+void blk_queue_split(struct request_queue *q, struct bio **bio)
 {
struct bio *split, *res;
unsigned nsegs;
@@ -211,13 +210,13 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio,
switch (bio_op(*bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
-   split = blk_bio_discard_split(q, *bio, bs, &nsegs);
+   split = blk_bio_discard_split(q, *bio, q->bio_split, &nsegs);
break;
case REQ_OP_WRITE_ZEROES:
-   split = blk_bio_write_zeroes_split(q, *bio, bs, &nsegs);
+   split = blk_bio_write_zeroes_split(q, *bio, q->bio_split, 
&nsegs);
break;
case REQ_OP_WRITE_SAME:
-   split = blk_bio_write_same_split(q, *bio, bs, &nsegs);
+   split = blk_bio_write_same_split(q, *bio, q->bio_split, &nsegs);
break;
default:
split = blk_bio_segment_split(q, *bio, q->bio_split, &nsegs);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index bf90684a007a..45f084fb2052 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1543,7 +1543,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
return BLK_QC_T_NONE;
}
 
-   blk_queue_split(q, &bio, q->bio_split);
+   blk_queue_split(q, &bio);
 
if (!is_flush_fua && !blk_queue_nomerges(q) &&
blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index b5730e17b455..fa62dd8a4d46 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1557,7 +1557,7 @@ blk_qc_t drbd_make_request(struct request_queue *q, 
struct bio *bio)
struct drbd_device *device = (struct drbd_device *) q->queuedata;
unsigned long start_jif;
 
-   blk_queue_split(q, &bio, q->bio_split);
+   blk_queue_split(q, &bio);
 
start_jif = jiffies;
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 205b865ebeb9..1457e65c50b1 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2414,7 +2414,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, 
struct bio *bio)
 
blk_queue_bounce(q, &bio);
 
-   blk_queue_split(q, &bio, q->bio_split);
+   blk_queue_split(q, &bio);
 
pd = q->queuedata;
if (!pd) {
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index 456b4fe21559..48072c0c1010 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -606,7 +606,7 @@ static blk_qc_t ps3vram_make_request(struct request_queue 
*q, struct bio *bio)
 
dev_dbg(&dev->core, "%s\n", __func__);
 
-   blk_queue_split(q, &bio, q->bio_split);
+   blk_queue_split(q, &bio);
 
spin_lock_irq(&priv->lock);
busy = !bio_list_empty(&priv->list);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index 9c566364ac9c..01624eaefcba 100644
--- a/drivers/block

[PATCH 04/13] blk: use non-rescuing bioset for q->bio_split.

2017-05-01 Thread NeilBrown

A rescuing bioset is only useful if there might be bios from
that same bioset on the bio_list_on_stack queue at a time
when bio_alloc_bioset() is called.  This never applies to
q->bio_split.

Allocations from q->bio_split are only ever made from
blk_queue_split() which is only ever called early in each of
various make_request_fn()s.  The original bio (call this A)
is then passed to generic_make_request() and is placed on
the bio_list_on_stack queue, and the bio that was allocated
from q->bio_split (B) is processed.

The processing of this may cause other bios to be passed to
generic_make_request() or may even cause the bio B itself to
be passed, possible after some prefix has been split off
(using some other bioset).

generic_make_request() now guarantees that all of these bios
(B and dependants) will be fully processed before the tail
of the original bio A gets handled.  None of these early bios
can possible trigger an allocation from the original
q->bio_split as they are either too small to require
splitting or (more likely) are destined for a different queue.

The next time that the original q->bio_split might be used
by this thread is when A is processed again, as it might
still be too big to handle directly.  By this time there
cannot be any other bios allocated from q->bio_split in the
generic_make_request() queue.  So no rescuing will ever be
needed.

Reviewed-by: Christoph Hellwig 
Signed-off-by: NeilBrown 
---
 block/blk-core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ae51f159a2ca..3797753f4085 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -730,7 +730,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
if (q->id < 0)
goto fail_q;
 
-   q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS | 
BIOSET_NEED_RESCUER);
+   q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
if (!q->bio_split)
goto fail_id;

[PATCH 06/13] rbd: use bio_clone_fast() instead of bio_clone()

2017-05-01 Thread NeilBrown

bio_clone() makes a copy of the bi_io_vec, but rbd never changes that,
so there is no need for a copy.
bio_clone_fast() can be used instead, which avoids making the copy.

This requires that we provide a bio_set.  bio_clone() uses fs_bio_set,
but it isn't, in general, safe to use the same bio_set at different
levels of the stack, as that can lead to deadlocks.  As filesystems
use fs_bio_set, block devices shouldn't.

As rbd never stacks, it is safe to have a single global bio_set for
all rbd devices to use.  So allocate that when the module is
initialised, and use it with bio_clone_fast().

Reviewed-by: Christoph Hellwig 
Signed-off-by: NeilBrown 
---
 drivers/block/rbd.c |   16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 089ac4179919..52120ed9385f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -441,6 +441,8 @@ static DEFINE_SPINLOCK(rbd_client_list_lock);
 static struct kmem_cache   *rbd_img_request_cache;
 static struct kmem_cache   *rbd_obj_request_cache;
 
+static struct bio_set  *rbd_bio_clone;
+
 static int rbd_major;
 static DEFINE_IDA(rbd_dev_id_ida);
 
@@ -1362,7 +1364,7 @@ static struct bio *bio_clone_range(struct bio *bio_src,
 {
struct bio *bio;
 
-   bio = bio_clone(bio_src, gfpmask);
+   bio = bio_clone_fast(bio_src, gfpmask, rbd_bio_clone);
if (!bio)
return NULL;/* ENOMEM */
 
@@ -6342,8 +6344,16 @@ static int rbd_slab_init(void)
if (!rbd_obj_request_cache)
goto out_err;
 
+   rbd_assert(!rbd_bio_clone);
+   rbd_bio_clone = bioset_create(BIO_POOL_SIZE, 0, 0);
+   if (!rbd_bio_clone)
+   goto out_err_clone;
+
return 0;
 
+out_err_clone:
+   kmem_cache_destroy(rbd_obj_request_cache);
+   rbd_obj_request_cache = NULL;
 out_err:
kmem_cache_destroy(rbd_img_request_cache);
rbd_img_request_cache = NULL;
@@ -6359,6 +6369,10 @@ static void rbd_slab_exit(void)
rbd_assert(rbd_img_request_cache);
kmem_cache_destroy(rbd_img_request_cache);
rbd_img_request_cache = NULL;
+
+   rbd_assert(rbd_bio_clone);
+   bioset_free(rbd_bio_clone);
+   rbd_bio_clone = NULL;
 }
 
 static int __init rbd_init(void)

[PATCH 08/13] pktcdvd: use bio_clone_fast() instead of bio_clone()

2017-05-01 Thread NeilBrown

pktcdvd doesn't change the bi_io_vec of the clone bio,
so it is more efficient to use bio_clone_fast(), and not clone
the bi_io_vec.
This requires providing a bio_set, and it is safest to
provide a dedicated bio_set rather than sharing
fs_bio_set, which filesytems use.
This new bio_set, pkt_bio_set, can also be use for the bio_split()
call as the two allocations (bio_clone_fast, and bio_split) are
independent, neither can block a bio allocated by the other.

Reviewed-by: Christoph Hellwig 
Signed-off-by: NeilBrown 
---
 drivers/block/pktcdvd.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 1457e65c50b1..6f9c31fae335 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -98,6 +98,7 @@ static int write_congestion_on  = PKT_WRITE_CONGESTION_ON;
 static int write_congestion_off = PKT_WRITE_CONGESTION_OFF;
 static struct mutex ctl_mutex; /* Serialize open/close/setup/teardown */
 static mempool_t *psd_pool;
+static struct bio_set *pkt_bio_set;
 
 static struct class*class_pktcdvd = NULL;/* /sys/class/pktcdvd */
 static struct dentry   *pkt_debugfs_root = NULL; /* /sys/kernel/debug/pktcdvd 
*/
@@ -2310,7 +2311,7 @@ static void pkt_end_io_read_cloned(struct bio *bio)
 
 static void pkt_make_request_read(struct pktcdvd_device *pd, struct bio *bio)
 {
-   struct bio *cloned_bio = bio_clone(bio, GFP_NOIO);
+   struct bio *cloned_bio = bio_clone_fast(bio, GFP_NOIO, pkt_bio_set);
struct packet_stacked_data *psd = mempool_alloc(psd_pool, GFP_NOIO);
 
psd->pd = pd;
@@ -2455,7 +2456,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, 
struct bio *bio)
 
split = bio_split(bio, last_zone -
  bio->bi_iter.bi_sector,
- GFP_NOIO, fs_bio_set);
+ GFP_NOIO, pkt_bio_set);
bio_chain(split, bio);
} else {
split = bio;
@@ -2919,6 +2920,11 @@ static int __init pkt_init(void)
sizeof(struct packet_stacked_data));
if (!psd_pool)
return -ENOMEM;
+   pkt_bio_set = bioset_create(BIO_POOL_SIZE, 0, 0);
+   if (!pkt_bio_set) {
+   mempool_destroy(psd_pool);
+   return -ENOMEM;
+   }
 
ret = register_blkdev(pktdev_major, DRIVER_NAME);
if (ret < 0) {
@@ -2951,6 +2957,7 @@ static int __init pkt_init(void)
unregister_blkdev(pktdev_major, DRIVER_NAME);
 out2:
mempool_destroy(psd_pool);
+   bioset_free(pkt_bio_set);
return ret;
 }
 
@@ -2964,6 +2971,7 @@ static void __exit pkt_exit(void)
 
unregister_blkdev(pktdev_major, DRIVER_NAME);
mempool_destroy(psd_pool);
+   bioset_free(pkt_bio_set);
 }
 
 MODULE_DESCRIPTION("Packet writing layer for CD/DVD drives");

[PATCH 11/13] bcache: use kmalloc to allocate bio in bch_data_verify()

2017-05-01 Thread NeilBrown

This function allocates a bio, then a collection
of pages.  It copes with failure.

It currently uses a mempool() to allocate the bio,
but alloc_page() to allocate the pages.  These fail
in different ways, so the usage is inconsistent.

Change the bio_clone() to bio_clone_kmalloc()
so that no pool is used either for the bio or the pages.

Reviewed-by: Christoph Hellwig 
Acked-by: Kent Overstreet 
Reviewed-by : Ming Lei 
Signed-off-by: NeilBrown 
---
 drivers/md/bcache/debug.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 06f55056aaae..35a5a7210e51 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -110,7 +110,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
struct bio_vec bv, cbv;
struct bvec_iter iter, citer = { 0 };
 
-   check = bio_clone(bio, GFP_NOIO);
+   check = bio_clone_kmalloc(bio, GFP_NOIO);
if (!check)
return;
check->bi_opf = REQ_OP_READ;

[PATCH 12/13] block: remove bio_clone() and all references.

2017-05-01 Thread NeilBrown

bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().

So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Ming Lei 
Signed-off-by: NeilBrown 
---
 Documentation/block/biodoc.txt |2 +-
 block/bio.c|2 +-
 block/blk-merge.c  |6 +++---
 drivers/md/md.c|2 +-
 include/linux/bio.h|5 -
 5 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index 01ddeaf64b0f..9490f2845f06 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -632,7 +632,7 @@ to i/o submission, if the bio fields are likely to be 
accessed after the
 i/o is issued (since the bio may otherwise get freed in case i/o completion
 happens in the meantime).
 
-The bio_clone() routine may be used to duplicate a bio, where the clone
+The bio_clone_fast() routine may be used to duplicate a bio, where the clone
 shares the bio_vec_list with the original bio (i.e. both point to the
 same bio_vec_list). This would typically be used for splitting i/o requests
 in lvm or md.
diff --git a/block/bio.c b/block/bio.c
index 257606e742e0..d82f6da94421 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -547,7 +547,7 @@ EXPORT_SYMBOL(zero_fill_bio);
  *
  * Description:
  *   Put a reference to a &struct bio, either one you have gotten with
- *   bio_alloc, bio_get or bio_clone. The last put of a bio will free it.
+ *   bio_alloc, bio_get or bio_clone_*. The last put of a bio will free it.
  **/
 void bio_put(struct bio *bio)
 {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 51c84540d3bb..e7862e9dcc39 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -115,13 +115,13 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 * With arbitrary bio size, the incoming bio may be very
 * big. We have to split the bio into small bios so that
 * each holds at most BIO_MAX_PAGES bvecs because
-* bio_clone() can fail to allocate big bvecs.
+* bio_clone_bioset() can fail to allocate big bvecs.
 *
-* Those drivers which will need to use bio_clone()
+* Those drivers which will need to use bio_clone_bioset()
 * should tell us in some way.  For now, impose the
 * BIO_MAX_PAGES limit on all queues.
 *
-* TODO: handle users of bio_clone() differently.
+* TODO: handle users of bio_clone_bioset() differently.
 */
if (bvecs++ >= BIO_MAX_PAGES)
goto split;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1379fb636de2..cd2fa40e9af3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -185,7 +185,7 @@ static int start_readonly;
 static bool create_on_open = true;
 
 /* bio_clone_mddev
- * like bio_clone, but with a local bio set
+ * like bio_clone_bioset, but with a local bio set
  */
 
 struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 40e5d7b62f29..538c981ac26a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -395,11 +395,6 @@ static inline struct bio *bio_alloc(gfp_t gfp_mask, 
unsigned int nr_iovecs)
return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
 }
 
-static inline struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
-{
-   return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
-}
-
 static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 {
return bio_alloc_bioset(gfp_mask, nr_iovecs, NULL);

[PATCH 10/13] xen-blkfront: remove bio splitting.

2017-05-01 Thread NeilBrown

bios that are re-submitted will pass through blk_queue_split() when
blk_queue_bio() is called, and this will split the bio if necessary.
There is no longer any need to do this splitting in xen-blkfront.

Acked-by: Roger Pau Monné 
Signed-off-by: NeilBrown 
---
 drivers/block/xen-blkfront.c |   54 ++
 1 file changed, 3 insertions(+), 51 deletions(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 39459631667c..fb963a010d4d 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -110,11 +110,6 @@ struct blk_shadow {
unsigned long associated_id;
 };
 
-struct split_bio {
-   struct bio *bio;
-   atomic_t pending;
-};
-
 struct blkif_req {
int error;
 };
@@ -1996,28 +1991,13 @@ static int blkfront_probe(struct xenbus_device *dev,
return 0;
 }
 
-static void split_bio_end(struct bio *bio)
-{
-   struct split_bio *split_bio = bio->bi_private;
-
-   if (atomic_dec_and_test(&split_bio->pending)) {
-   split_bio->bio->bi_phys_segments = 0;
-   split_bio->bio->bi_error = bio->bi_error;
-   bio_endio(split_bio->bio);
-   kfree(split_bio);
-   }
-   bio_put(bio);
-}
-
 static int blkif_recover(struct blkfront_info *info)
 {
-   unsigned int i, r_index;
+   unsigned int r_index;
struct request *req, *n;
int rc;
-   struct bio *bio, *cloned_bio;
-   unsigned int segs, offset;
-   int pending, size;
-   struct split_bio *split_bio;
+   struct bio *bio;
+   unsigned int segs;
 
blkfront_gather_backend_features(info);
/* Reset limits changed by blk_mq_update_nr_hw_queues(). */
@@ -2056,34 +2036,6 @@ static int blkif_recover(struct blkfront_info *info)
 
while ((bio = bio_list_pop(&info->bio_list)) != NULL) {
/* Traverse the list of pending bios and re-queue them */
-   if (bio_segments(bio) > segs) {
-   /*
-* This bio has more segments than what we can
-* handle, we have to split it.
-*/
-   pending = (bio_segments(bio) + segs - 1) / segs;
-   split_bio = kzalloc(sizeof(*split_bio), GFP_NOIO);
-   BUG_ON(split_bio == NULL);
-   atomic_set(&split_bio->pending, pending);
-   split_bio->bio = bio;
-   for (i = 0; i < pending; i++) {
-   offset = (i * segs * XEN_PAGE_SIZE) >> 9;
-   size = min((unsigned int)(segs * XEN_PAGE_SIZE) 
>> 9,
-  (unsigned int)bio_sectors(bio) - 
offset);
-   cloned_bio = bio_clone(bio, GFP_NOIO);
-   BUG_ON(cloned_bio == NULL);
-   bio_trim(cloned_bio, offset, size);
-   cloned_bio->bi_private = split_bio;
-   cloned_bio->bi_end_io = split_bio_end;
-   submit_bio(cloned_bio);
-   }
-   /*
-* Now we have to wait for all those smaller bios to
-* end, so we can also end the "parent" bio.
-*/
-   continue;
-   }
-   /* We don't need to split this bio */
submit_bio(bio);
}

[PATCH 13/13] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()

2017-05-01 Thread NeilBrown

blk_bio_segment_split() makes sure bios have no more than
BIO_MAX_PAGES entries in the bi_io_vec.
This was done because bio_clone_bioset() (when given a
mempool bioset) could not handle larger io_vecs.

No driver uses bio_clone_bioset() any more, they all
use bio_clone_fast() if anything, and bio_clone_fast()
doesn't clone the bi_io_vec.

The main user of of bio_clone_bioset() at this level
is bounce.c, and bouncing now happens before blk_bio_segment_split(),
so that is not of concern.

So remove the big helpful comment and the code.

Signed-off-by: NeilBrown 
---
 block/blk-merge.c |   16 
 1 file changed, 16 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index e7862e9dcc39..cea544ec5d96 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -108,25 +108,9 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
bool do_split = true;
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
-   unsigned bvecs = 0;
 
bio_for_each_segment(bv, bio, iter) {
/*
-* With arbitrary bio size, the incoming bio may be very
-* big. We have to split the bio into small bios so that
-* each holds at most BIO_MAX_PAGES bvecs because
-* bio_clone_bioset() can fail to allocate big bvecs.
-*
-* Those drivers which will need to use bio_clone_bioset()
-* should tell us in some way.  For now, impose the
-* BIO_MAX_PAGES limit on all queues.
-*
-* TODO: handle users of bio_clone_bioset() differently.
-*/
-   if (bvecs++ >= BIO_MAX_PAGES)
-   goto split;
-
-   /*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
 */

[PATCH 09/13] lightnvm/pblk-read: use bio_clone_fast()

2017-05-01 Thread NeilBrown

pblk_submit_read() uses bio_clone_bioset() but doesn't change the
io_vec, so bio_clone_fast() is a better choice.

It also uses fs_bio_set which is intended for filesystems.  Using it
in a device driver can deadlock.
So allocate a new bioset, and and use bio_clone_fast().

Signed-off-by: NeilBrown 
---
 drivers/lightnvm/pblk-init.c |   12 +++-
 drivers/lightnvm/pblk-read.c |2 +-
 drivers/lightnvm/pblk.h  |1 +
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index b3fec8ec55b8..aaefbccce30e 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -23,6 +23,7 @@
 static struct kmem_cache *pblk_blk_ws_cache, *pblk_rec_cache, *pblk_r_rq_cache,
*pblk_w_rq_cache, *pblk_line_meta_cache;
 static DECLARE_RWSEM(pblk_lock);
+struct bio_set *pblk_bio_set;
 
 static int pblk_rw_io(struct request_queue *q, struct pblk *pblk,
  struct bio *bio)
@@ -946,11 +947,20 @@ static struct nvm_tgt_type tt_pblk = {
 
 static int __init pblk_module_init(void)
 {
-   return nvm_register_tgt_type(&tt_pblk);
+   int ret;
+
+   pblk_bio_set = bioset_create(BIO_POOL_SIZE, 0, 0);
+   if (!pblk_bio_set)
+   return -ENOMEM;
+   ret = nvm_register_tgt_type(&tt_pblk);
+   if (ret)
+   bioset_free(pblk_bio_set);
+   return ret;
 }
 
 static void pblk_module_exit(void)
 {
+   bioset_free(pblk_bio_set);
nvm_unregister_tgt_type(&tt_pblk);
 }
 
diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
index 4a12f14d78c6..8ee0c50a7a71 100644
--- a/drivers/lightnvm/pblk-read.c
+++ b/drivers/lightnvm/pblk-read.c
@@ -342,7 +342,7 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
struct pblk_r_ctx *r_ctx = nvm_rq_to_pdu(rqd);
 
/* Clone read bio to deal with read errors internally */
-   int_bio = bio_clone_bioset(bio, GFP_KERNEL, fs_bio_set);
+   int_bio = bio_clone_fast(bio, GFP_KERNEL, pblk_bio_set);
if (!int_bio) {
pr_err("pblk: could not clone read bio\n");
return NVM_IO_ERR;
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 99f3186b5288..95b665f23925 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -702,6 +702,7 @@ void pblk_write_should_kick(struct pblk *pblk);
 /*
  * pblk read path
  */
+extern struct bio_set *pblk_bio_set;
 int pblk_submit_read(struct pblk *pblk, struct bio *bio);
 int pblk_submit_read_gc(struct pblk *pblk, u64 *lba_list, void *data,
unsigned int nr_secs, unsigned int *secs_to_gc,

[PATCH 05/13] block: Improvements to bounce-buffer handling

2017-05-01 Thread NeilBrown

Since commit 23688bf4f830 ("block: ensure to split after potentially
bouncing a bio") blk_queue_bounce() is called *before*
blk_queue_split().
This means that:
 1/ the comments blk_queue_split() about bounce buffers are
irrelevant, and
 2/ a very large bio (more than BIO_MAX_PAGES) will no longer be
split before it arrives at blk_queue_bounce(), leading to the
possibility that bio_clone_bioset() will fail and a NULL
will be dereferenced.

Separately, blk_queue_bounce() shouldn't use fs_bio_set as the bio
being copied could be from the same set, and this could lead to a
deadlock.

So:
 - allocate 2 private biosets for blk_queue_bounce, one for
   splitting enormous bios and one for cloning bios.
 - add code to split a bio that exceeds BIO_MAX_PAGES.
 - Fix up the comments in blk_queue_split()

Credit-to: Ming Lei  (suggested using single 
bio_for_each_segment loop)
Signed-off-by: NeilBrown 
---
 block/blk-merge.c |   14 --
 block/bounce.c|   32 ++--
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index d59074556703..51c84540d3bb 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -117,17 +117,11 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 * each holds at most BIO_MAX_PAGES bvecs because
 * bio_clone() can fail to allocate big bvecs.
 *
-* It should have been better to apply the limit per
-* request queue in which bio_clone() is involved,
-* instead of globally. The biggest blocker is the
-* bio_clone() in bio bounce.
+* Those drivers which will need to use bio_clone()
+* should tell us in some way.  For now, impose the
+* BIO_MAX_PAGES limit on all queues.
 *
-* If bio is splitted by this reason, we should have
-* allowed to continue bios merging, but don't do
-* that now for making the change simple.
-*
-* TODO: deal with bio bounce's bio_clone() gracefully
-* and convert the global limit into per-queue limit.
+* TODO: handle users of bio_clone() differently.
 */
if (bvecs++ >= BIO_MAX_PAGES)
goto split;
diff --git a/block/bounce.c b/block/bounce.c
index 1cb5dd3a5da1..087ecc2dc66c 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -26,6 +26,7 @@
 #define POOL_SIZE  64
 #define ISA_POOL_SIZE  16
 
+struct bio_set *bounce_bio_set, *bounce_bio_split;
 static mempool_t *page_pool, *isa_page_pool;
 
 #if defined(CONFIG_HIGHMEM) || defined(CONFIG_NEED_BOUNCE_POOL)
@@ -40,6 +41,14 @@ static __init int init_emergency_pool(void)
BUG_ON(!page_pool);
pr_info("pool size: %d pages\n", POOL_SIZE);
 
+   bounce_bio_set = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
+   BUG_ON(!bounce_bio_set);
+   if (bioset_integrity_create(bounce_bio_set, BIO_POOL_SIZE))
+   BUG_ON(1);
+
+   bounce_bio_split = bioset_create(BIO_POOL_SIZE, 0, 0);
+   BUG_ON(!bounce_bio_split);
+
return 0;
 }
 
@@ -186,15 +195,26 @@ static void __blk_queue_bounce(struct request_queue *q, 
struct bio **bio_orig,
int rw = bio_data_dir(*bio_orig);
struct bio_vec *to, from;
struct bvec_iter iter;
-   unsigned i;
+   unsigned i = 0;
+   bool bounce = false;
+   int sectors = 0;
 
-   bio_for_each_segment(from, *bio_orig, iter)
+   bio_for_each_segment(from, *bio_orig, iter) {
+   if (i++ < BIO_MAX_PAGES)
+   sectors += from.bv_len >> 9;
if (page_to_pfn(from.bv_page) > queue_bounce_pfn(q))
-   goto bounce;
+   bounce = true;
+   }
+   if (!bounce)
+   return;
 
-   return;
-bounce:
-   bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set);
+   if (sectors < bio_sectors(*bio_orig)) {
+   bio = bio_split(*bio_orig, sectors, GFP_NOIO, bounce_bio_split);
+   bio_chain(bio, *bio_orig);
+   generic_make_request(*bio_orig);
+   *bio_orig = bio;
+   }
+   bio = bio_clone_bioset(*bio_orig, GFP_NOIO, bounce_bio_set);
 
bio_for_each_segment_all(to, bio, i) {
struct page *page = to->bv_page;

[PATCH 07/13] drbd: use bio_clone_fast() instead of bio_clone()

2017-05-01 Thread NeilBrown

drbd does not modify the bi_io_vec of the cloned bio,
so there is no need to clone that part.  So bio_clone_fast()
is the better choice.
For bio_clone_fast() we need to specify a bio_set.
We could use fs_bio_set, which bio_clone() uses, or
drbd_md_io_bio_set, which drbd uses for metadata, but it is
generally best to avoid sharing bio_sets unless you can
be certain that there are no interdependencies.

So create a new bio_set, drbd_io_bio_set, and use bio_clone_fast().

Also remove a "XXX cannot fail ???" comment because it definitely
cannot fail - bio_clone_fast() doesn't fail if the GFP flags allow for
sleeping.

Reviewed-by: Christoph Hellwig 
Signed-off-by: NeilBrown 
---
 drivers/block/drbd/drbd_int.h  |3 +++
 drivers/block/drbd/drbd_main.c |9 +
 drivers/block/drbd/drbd_req.h  |2 +-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index d5da45bb03a6..f91982515a6b 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1441,6 +1441,9 @@ extern struct bio_set *drbd_md_io_bio_set;
 /* to allocate from that set */
 extern struct bio *bio_alloc_drbd(gfp_t gfp_mask);
 
+/* And a bio_set for cloning */
+extern struct bio_set *drbd_io_bio_set;
+
 extern struct mutex resources_mutex;
 
 extern int conn_lowest_minor(struct drbd_connection *connection);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index b395fe391171..df99dc8bcb63 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -128,6 +128,7 @@ mempool_t *drbd_request_mempool;
 mempool_t *drbd_ee_mempool;
 mempool_t *drbd_md_io_page_pool;
 struct bio_set *drbd_md_io_bio_set;
+struct bio_set *drbd_io_bio_set;
 
 /* I do not use a standard mempool, because:
1) I want to hand out the pre-allocated objects first.
@@ -2098,6 +2099,8 @@ static void drbd_destroy_mempools(void)
 
/* D_ASSERT(device, atomic_read(&drbd_pp_vacant)==0); */
 
+   if (drbd_io_bio_set)
+   bioset_free(drbd_io_bio_set);
if (drbd_md_io_bio_set)
bioset_free(drbd_md_io_bio_set);
if (drbd_md_io_page_pool)
@@ -2115,6 +2118,7 @@ static void drbd_destroy_mempools(void)
if (drbd_al_ext_cache)
kmem_cache_destroy(drbd_al_ext_cache);
 
+   drbd_io_bio_set  = NULL;
drbd_md_io_bio_set   = NULL;
drbd_md_io_page_pool = NULL;
drbd_ee_mempool  = NULL;
@@ -2142,6 +2146,7 @@ static int drbd_create_mempools(void)
drbd_pp_pool = NULL;
drbd_md_io_page_pool = NULL;
drbd_md_io_bio_set   = NULL;
+   drbd_io_bio_set  = NULL;
 
/* caches */
drbd_request_cache = kmem_cache_create(
@@ -2165,6 +2170,10 @@ static int drbd_create_mempools(void)
goto Enomem;
 
/* mempools */
+   drbd_io_bio_set = bioset_create(BIO_POOL_SIZE, 0, 0);
+   if (drbd_io_bio_set == NULL)
+   goto Enomem;
+
drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0, 
BIOSET_NEED_BVECS);
if (drbd_md_io_bio_set == NULL)
goto Enomem;
diff --git a/drivers/block/drbd/drbd_req.h b/drivers/block/drbd/drbd_req.h
index eb49e7f2da91..9e1866ab238f 100644
--- a/drivers/block/drbd/drbd_req.h
+++ b/drivers/block/drbd/drbd_req.h
@@ -263,7 +263,7 @@ enum drbd_req_state_bits {
 static inline void drbd_req_make_private_bio(struct drbd_request *req, struct 
bio *bio_src)
 {
struct bio *bio;
-   bio = bio_clone(bio_src, GFP_NOIO); /* XXX cannot fail?? */
+   bio = bio_clone_fast(bio_src, GFP_NOIO, drbd_io_bio_set);
 
req->private_bio = bio;

[PATCH 02/13] blk: replace bioset_create_nobvec() with a flags arg to bioset_create()

2017-05-01 Thread NeilBrown

"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().

Suggested-by: Christoph Hellwig 
Signed-off-by: NeilBrown 
---
 block/bio.c |   60 +--
 block/blk-core.c|2 +
 drivers/block/drbd/drbd_main.c  |2 +
 drivers/md/bcache/super.c   |4 +-
 drivers/md/dm-crypt.c   |2 +
 drivers/md/dm-io.c  |2 +
 drivers/md/dm.c |2 +
 drivers/md/md.c |2 +
 drivers/md/raid1.c  |2 +
 drivers/md/raid10.c |2 +
 drivers/md/raid5-cache.c|2 +
 drivers/md/raid5-ppl.c  |2 +
 drivers/md/raid5.c  |2 +
 drivers/target/target_core_iblock.c |2 +
 fs/block_dev.c  |2 +
 fs/btrfs/extent_io.c|3 +-
 fs/xfs/xfs_super.c  |3 +-
 include/linux/bio.h |6 ++--
 18 files changed, 45 insertions(+), 57 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 888e7801c638..3f1286ed301e 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1921,9 +1921,26 @@ void bioset_free(struct bio_set *bs)
 }
 EXPORT_SYMBOL(bioset_free);
 
-static struct bio_set *__bioset_create(unsigned int pool_size,
-  unsigned int front_pad,
-  bool create_bvec_pool)
+/**
+ * bioset_create  - Create a bio_set
+ * @pool_size: Number of bio and bio_vecs to cache in the mempool
+ * @front_pad: Number of bytes to allocate in front of the returned bio
+ * @flags: Flags to modify behavior, currently only %BIOSET_NEED_BVECS
+ *
+ * Description:
+ *Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller
+ *to ask for a number of bytes to be allocated in front of the bio.
+ *Front pad allocation is useful for embedding the bio inside
+ *another structure, to avoid allocating extra data to go with the bio.
+ *Note that the bio must be embedded at the END of that structure always,
+ *or things will break badly.
+ *If %BIOSET_NEED_BVECS is set in @flags, a separate pool will be allocated
+ *for allocating iovecs.  This pool is not needed e.g. for 
bio_clone_fast().
+ *
+ */
+struct bio_set *bioset_create(unsigned int pool_size,
+ unsigned int front_pad,
+ int flags)
 {
unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
struct bio_set *bs;
@@ -1948,7 +1965,7 @@ static struct bio_set *__bioset_create(unsigned int 
pool_size,
if (!bs->bio_pool)
goto bad;
 
-   if (create_bvec_pool) {
+   if (flags & BIOSET_NEED_BVECS) {
bs->bvec_pool = biovec_create_pool(pool_size);
if (!bs->bvec_pool)
goto bad;
@@ -1963,41 +1980,8 @@ static struct bio_set *__bioset_create(unsigned int 
pool_size,
bioset_free(bs);
return NULL;
 }
-
-/**
- * bioset_create  - Create a bio_set
- * @pool_size: Number of bio and bio_vecs to cache in the mempool
- * @front_pad: Number of bytes to allocate in front of the returned bio
- *
- * Description:
- *Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller
- *to ask for a number of bytes to be allocated in front of the bio.
- *Front pad allocation is useful for embedding the bio inside
- *another structure, to avoid allocating extra data to go with the bio.
- *Note that the bio must be embedded at the END of that structure always,
- *or things will break badly.
- */
-struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
-{
-   return __bioset_create(pool_size, front_pad, true);
-}
 EXPORT_SYMBOL(bioset_create);
 
-/**
- * bioset_create_nobvec  - Create a bio_set without bio_vec mempool
- * @pool_size: Number of bio to cache in the mempool
- * @front_pad: Number of bytes to allocate in front of the returned bio
- *
- * Description:
- *Same functionality as bioset_create() except that mempool is not
- *created for bio_vecs. Saving some memory for bio_clone_fast() users.
- */
-struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int 
front_pad)
-{
-   return __bioset_create(pool_size, front_pad, false);
-}
-EXPORT_SYMBOL(bioset_create_nobvec);
-
 #ifdef CONFIG_BLK_CGROUP
 
 /**
@@ -2112,7 +2096,7 @@ static int __init init_bio(void)
bio_integrity_init();
biovec_init_slabs();
 
-   fs_bio_set =

[PATCH 00/13] block: assorted cleanup for bio splitting and cloning.

2017-05-01 Thread NeilBrown

This is a revision of my series of patches working
towards removing the bioset work queues.

This set is based on Linus' tree as for today (2nd May) plus
the for-linus branch from Shaohua's md/raid tree.

This series adds a fix for the new lightnvm/pblk-read code
and discards bioset_create_nobvec() in favor of a flag arg to
bioset_create().  There are also minor fixes and a little
code clean-up.

I hope to eventually get rid of the new BIOSET_NEED_RESCUER flag,
but that needs work ing dm and probably bcache first.

Thanks,
NeilBrown


---

NeilBrown (13):
  blk: remove bio_set arg from blk_queue_split()
  blk: replace bioset_create_nobvec() with a flags arg to bioset_create()
  blk: make the bioset rescue_workqueue optional.
  blk: use non-rescuing bioset for q->bio_split.
  block: Improvements to bounce-buffer handling
  rbd: use bio_clone_fast() instead of bio_clone()
  drbd: use bio_clone_fast() instead of bio_clone()
  pktcdvd: use bio_clone_fast() instead of bio_clone()
  lightnvm/pblk-read: use bio_clone_fast()
  xen-blkfront: remove bio splitting.
  bcache: use kmalloc to allocate bio in bch_data_verify()
  block: remove bio_clone() and all references.
  block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()


 Documentation/block/biodoc.txt  |2 -
 block/bio.c |   72 ---
 block/blk-core.c|4 +-
 block/blk-merge.c   |   31 ++-
 block/blk-mq.c  |2 -
 block/bounce.c  |   32 +---
 drivers/block/drbd/drbd_int.h   |3 +
 drivers/block/drbd/drbd_main.c  |   11 +
 drivers/block/drbd/drbd_req.c   |2 -
 drivers/block/drbd/drbd_req.h   |2 -
 drivers/block/pktcdvd.c |   14 +--
 drivers/block/ps3vram.c |2 -
 drivers/block/rbd.c |   16 +++-
 drivers/block/rsxx/dev.c|2 -
 drivers/block/umem.c|2 -
 drivers/block/xen-blkfront.c|   54 +-
 drivers/block/zram/zram_drv.c   |2 -
 drivers/lightnvm/pblk-init.c|   16 ++--
 drivers/lightnvm/pblk-read.c|2 -
 drivers/lightnvm/pblk.h |1 
 drivers/lightnvm/rrpc.c |2 -
 drivers/md/bcache/debug.c   |2 -
 drivers/md/bcache/super.c   |6 ++-
 drivers/md/dm-crypt.c   |2 -
 drivers/md/dm-io.c  |2 -
 drivers/md/dm.c |5 +-
 drivers/md/md.c |6 +--
 drivers/md/raid1.c  |2 -
 drivers/md/raid10.c |2 -
 drivers/md/raid5-cache.c|2 -
 drivers/md/raid5-ppl.c  |2 -
 drivers/md/raid5.c  |2 -
 drivers/s390/block/dcssblk.c|2 -
 drivers/s390/block/xpram.c  |2 -
 drivers/target/target_core_iblock.c |2 -
 fs/block_dev.c  |2 -
 fs/btrfs/extent_io.c|3 +
 fs/xfs/xfs_super.c  |3 +
 include/linux/bio.h |   12 ++
 include/linux/blkdev.h  |3 -
 40 files changed, 162 insertions(+), 174 deletions(-)

--
Signature

Re: [PATCH 02/11] blk: make the bioset rescue_workqueue optional.

2017-05-01 Thread NeilBrown

On Mon, May 01 2017, Jens Axboe wrote:

> On 04/30/2017 11:00 PM, NeilBrown wrote:
>> On Mon, Apr 24 2017, Christoph Hellwig wrote:
>> 
>>> On Mon, Apr 24, 2017 at 11:51:01AM +1000, NeilBrown wrote:

 I was following the existing practice exemplified by
 bioset_create_nobvec().
>>>
>>> Which is pretty ugly to start with..
>> 
>> That is a matter of personal taste.
>> As such, it is up to the maintainer to change it if they want it
>> changed.
>> 
>>>
 By not changing the signature of the function, I can avoid touching
 quite a few places where it is called.
>>>
>>> There are 13 callers of bioset_create and one caller of
>>> bioset_create_nobvec, and your series touches many of those.
>>>
>>> So just adding a flags argument to bioset_create and passing
>>> BIOSET_NEED_BVECS and BIOSET_NEED_RESUER flags to it doesn't seem
>>> to much of an effort, and it's going to create a much nicer and easier
>>> to extend interface.
>> 
>> If someone else submitted a patch to discard bioset_create_nobvec in
>> favour of BIOSET_NEED_BVECS and got it accepted, then I would rebase my
>> series on that.  As it is, I'm basing my patches on the style currently
>> present in the tree.
>> 
>> Of course, if Jens says he'll only take my patches if I change to style
>> to match your preference, I'll do that.
>
> I generally tend to prefer tree wide cleanups to improve our APIs, even
> if it does cause an extra bit of pain. Would you mind doing that as a
> prep patch?

OK, will do.

I have rebased and fixed up a couple of issues.  Will repost shortly.

Thanks,
NeilBrown


signature.asc
Description: PGP signature

Re: [PATCH 6/6] blk-mq-debugfs: Add 'kick' operation

2017-05-01 Thread Jens Axboe

On 05/01/2017 06:19 PM, Omar Sandoval wrote:
> On Thu, Apr 27, 2017 at 08:54:37AM -0700, Bart Van Assche wrote:
>> Running a queue causes the block layer to examine the per-CPU and
>> hw queues but not the requeue list. Hence add a 'kick' operation
>> that also examines the requeue list.
> 
> The naming of these operations isn't super intuitive, but it makes
> enough sense if you know the code, I guess.

I don't worry about that too much, but I do think it's important
that we have some way of knowing WHAT commands are available
for a given kernel, without having to consult the source. It's
no big deal you're running the latest and greatest debug kernels,
but it's a bigger issue if you're debugging kernel x.y.z for
a customer and you have to consult the source to find them.

Can we include it in the show output?

-- 
Jens Axboe

Re: [PATCH 6/6] blk-mq-debugfs: Add 'kick' operation

2017-05-01 Thread Omar Sandoval

On Thu, Apr 27, 2017 at 08:54:37AM -0700, Bart Van Assche wrote:
> Running a queue causes the block layer to examine the per-CPU and
> hw queues but not the requeue list. Hence add a 'kick' operation
> that also examines the requeue list.

The naming of these operations isn't super intuitive, but it makes
enough sense if you know the code, I guess.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Bart Van Assche 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 
> ---
>  block/blk-mq-debugfs.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index a5e286e04569..aeca26c739d1 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -120,8 +120,10 @@ static ssize_t blk_queue_flags_store(struct file *file, 
> const char __user *ubuf,
>   blk_mq_run_hw_queues(q, true);
>   } else if (strcmp(op, "start") == 0) {
>   blk_mq_start_stopped_hw_queues(q, true);
> + } else if (strcmp(op, "kick") == 0) {
> + blk_mq_kick_requeue_list(q);
>   } else {
> - pr_err("%s: unsupported operation %s. Use either 'run' or 
> 'start'\n",
> + pr_err("%s: unsupported operation %s. Use 'run', 'start' or 
> 'kick'\n",
>  __func__, op);
>   return -EINVAL;
>   }
> -- 
> 2.12.2
>

Re: [PATCH 5/6] blk-mq-debugfs: Show busy requests

2017-05-01 Thread Omar Sandoval

On Thu, Apr 27, 2017 at 08:54:36AM -0700, Bart Van Assche wrote:
> Requests that got stuck in a block driver are neither on
> blk_mq_ctx.rq_list nor on any hw dispatch queue. Make these
> visible in debugfs through the "busy" attribute.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Omar Sandoval 
> Cc: Hannes Reinecke 
> ---
>  block/blk-mq-debugfs.c | 38 ++
>  1 file changed, 38 insertions(+)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 5092d90e37f6..a5e286e04569 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -428,6 +428,43 @@ static const struct file_operations hctx_dispatch_fops = 
> {
>   .release= seq_release,
>  };
>  
> +struct show_busy_ctx {
> + struct seq_file *m;
> + struct blk_mq_hw_ctx*hctx;
> +};
> +
> +static void hctx_show_busy(struct request *rq, void *data, bool reserved)
> +{
> + const struct show_busy_ctx *ctx = data;
> + struct request_queue *q = rq->q;
> +
> + if (q->queue_hw_ctx[rq->mq_ctx->index_hw] == ctx->hctx &&

This doesn't look right. ctx->index_hw is the index into hctx->ctxs for
this ctx. I think you mean blk_mq_map_queue(q, rq->mq_ctx->cpu).

> + test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
> + blk_mq_debugfs_rq_show(ctx->m, &rq->queuelist);
> +}
> +
> +static int hctx_busy_show(struct seq_file *m, void *v)
> +{
> + struct blk_mq_hw_ctx *hctx = m->private;
> + struct show_busy_ctx ctx = { .m = m, .hctx = hctx };
> +
> + blk_mq_tagset_busy_iter(hctx->queue->tag_set, hctx_show_busy, &ctx);
> +
> + return 0;
> +}
> +
> +static int hctx_busy_open(struct inode *inode, struct file *file)
> +{
> + return single_open(file, hctx_busy_show, inode->i_private);
> +}
> +
> +static const struct file_operations hctx_busy_fops = {
> + .open   = hctx_busy_open,
> + .read   = seq_read,
> + .llseek = seq_lseek,
> + .release= single_release,
> +};
> +
>  static int hctx_ctx_map_show(struct seq_file *m, void *v)
>  {
>   struct blk_mq_hw_ctx *hctx = m->private;
> @@ -885,6 +922,7 @@ static const struct blk_mq_debugfs_attr 
> blk_mq_debugfs_hctx_attrs[] = {
>   {"state", 0400, &hctx_state_fops},
>   {"flags", 0400, &hctx_flags_fops},
>   {"dispatch", 0400, &hctx_dispatch_fops},
> + {"busy", 0400, &hctx_busy_fops},
>   {"ctx_map", 0400, &hctx_ctx_map_fops},
>   {"tags", 0400, &hctx_tags_fops},
>   {"tags_bitmap", 0400, &hctx_tags_bitmap_fops},
> -- 
> 2.12.2
>

[PATCH] block: don't call blk_mq_quiesce_queue() after queue is freezed

2017-05-01 Thread Ming Lei

After queue is freezed, no request in this queue can be in use
at all, so there can't be any .queue_rq() is running on this queue.
It isn't necessary to call blk_mq_quiesce_queue() any more, so
remove it in both elevator_switch_mq() and blk_mq_update_nr_requests().

Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c   | 2 --
 block/elevator.c | 3 ---
 2 files changed, 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bf90684a007a..8f72f16b498a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2617,7 +2617,6 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
return -EINVAL;
 
blk_mq_freeze_queue(q);
-   blk_mq_quiesce_queue(q);
 
ret = 0;
queue_for_each_hw_ctx(q, hctx, i) {
@@ -2643,7 +2642,6 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
q->nr_requests = nr;
 
blk_mq_unfreeze_queue(q);
-   blk_mq_start_stopped_hw_queues(q, true);
 
return ret;
 }
diff --git a/block/elevator.c b/block/elevator.c
index bf11e70f008b..c7a4ee682033 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -950,7 +950,6 @@ static int elevator_switch_mq(struct request_queue *q,
int ret;
 
blk_mq_freeze_queue(q);
-   blk_mq_quiesce_queue(q);
 
if (q->elevator) {
if (q->elevator->registered)
@@ -978,9 +977,7 @@ static int elevator_switch_mq(struct request_queue *q,
 
 out:
blk_mq_unfreeze_queue(q);
-   blk_mq_start_stopped_hw_queues(q, true);
return ret;
-
 }
 
 /*
-- 
2.9.3

Re: [PATCH 25/27] block: remove the discard_zeroes_data flag

2017-05-01 Thread Bart Van Assche

On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
> kill this hack.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Martin K. Petersen 
> Reviewed-by: Hannes Reinecke 
> [ ... ]
> diff --git a/drivers/target/target_core_device.c 
> b/drivers/target/target_core_device.c
> index c754ae33bf7b..d2f089cfa9ae 100644
> --- a/drivers/target/target_core_device.c
> +++ b/drivers/target/target_core_device.c
> @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct 
> se_dev_attrib *attrib,
>   attrib->unmap_granularity = q->limits.discard_granularity / block_size;
>   attrib->unmap_granularity_alignment = q->limits.discard_alignment /
>   block_size;
> - attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
> + attrib->unmap_zeroes_data = 0;
>   return true;
>  }
>  EXPORT_SYMBOL(target_configure_unmap_from_queue);

Hello Christoph,

Sorry that I hadn't noticed this before but I think that this patch
introduces a significant performance regressions for LIO users. Before
this patch the LBPRZ flag was reported correctly to initiator systems
through the thin provisioning VPD. With this patch applied that flag
will always be reported as zero, forcing initiators to submit WRITE
commands with zeroed data buffers instead of submitting the SCSI UNMAP
command to block devices for which discard_zeroes_data was set. From
target_core_spc.c:

/* Thin Provisioning VPD */
static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char 
*buf)
{
[ ... ]
/*
 * The unmap_zeroes_data set means that the underlying device supports
 * REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
 * the SBC requirements for LBPRZ, meaning that a subsequent read
 * will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
 * See sbc4r36 6.6.4.
 */
if (((dev->dev_attrib.emulate_tpu != 0) ||
 (dev->dev_attrib.emulate_tpws != 0)) &&
 (dev->dev_attrib.unmap_zeroes_data != 0))
buf[5] |= 0x04;
[ ... ]
}

Please advise how to proceed.

Thanks,

Bart.

[PATCH v2 08/10] dm-linear: Add support for zoned block devices

2017-05-01 Thread damien . lemoal

From: Damien Le Moal 

Add support for zoned block devices by allowing host-managed zoned block
device mapped targets, the remapping of REQ_OP_ZONE_RESET and the post
processing (reply remapping) of REQ_OP_ZONE_REPORT.

Signed-off-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-linear.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
index 4788b0b..9c4debd 100644
--- a/drivers/md/dm-linear.c
+++ b/drivers/md/dm-linear.c
@@ -87,7 +87,7 @@ static void linear_map_bio(struct dm_target *ti, struct bio 
*bio)
struct linear_c *lc = ti->private;
 
bio->bi_bdev = lc->dev->bdev;
-   if (bio_sectors(bio))
+   if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET)
bio->bi_iter.bi_sector =
linear_map_sector(ti, bio->bi_iter.bi_sector);
 }
@@ -99,6 +99,16 @@ static int linear_map(struct dm_target *ti, struct bio *bio)
return DM_MAPIO_REMAPPED;
 }
 
+static int linear_end_io(struct dm_target *ti, struct bio *bio, int error)
+{
+   struct linear_c *lc = ti->private;
+
+   if (!error && bio_op(bio) == REQ_OP_ZONE_REPORT)
+   dm_remap_zone_report(ti, bio, lc->start);
+
+   return error;
+}
+
 static void linear_status(struct dm_target *ti, status_type_t type,
  unsigned status_flags, char *result, unsigned maxlen)
 {
@@ -162,10 +172,12 @@ static long linear_direct_access(struct dm_target *ti, 
sector_t sector,
 static struct target_type linear_target = {
.name   = "linear",
.version = {1, 3, 0},
+   .features = DM_TARGET_ZONED_HM,
.module = THIS_MODULE,
.ctr= linear_ctr,
.dtr= linear_dtr,
.map= linear_map,
+   .end_io = linear_end_io,
.status = linear_status,
.prepare_ioctl = linear_prepare_ioctl,
.iterate_devices = linear_iterate_devices,
-- 
2.9.3

[PATCH v2 09/10] dm-kcopyd: Add sequential write feature

2017-05-01 Thread damien . lemoal

From: Damien Le Moal 

When copyying blocks to host-managed zoned block devices, writes must be
sequential. dm_kcopyd_copy() does not howerver guarantee this as writes
are issued in the completion order of reads, and reads may complete out
of order despite being issued sequentially.

Fix this by introducing the DM_KCOPYD_WRITE_SEQ flag. This can be
specified by the user when calling dm_kcopyd_copy() and is set
automatically if one of the destinations is a host-managed zoned block
device. For a split job, the master job maintains the write position at
which writes must be issued. This is checked with the pop() function
which is modify to not return any write I/O sub job that is not at the
correct write position.

When DM_KCOPYD_WRITE_SEQ is specified for a job, errors cannot be
ignored and the flag DM_KCOPYD_IGNORE_ERROR is ignored, even if
specified by the user.

Signed-off-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-kcopyd.c| 68 +--
 include/linux/dm-kcopyd.h |  1 +
 2 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 9e9d04cb..477846e 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -356,6 +356,7 @@ struct kcopyd_job {
struct mutex lock;
atomic_t sub_jobs;
sector_t progress;
+   sector_t write_ofst;
 
struct kcopyd_job *master_job;
 };
@@ -386,6 +387,34 @@ void dm_kcopyd_exit(void)
  * Functions to push and pop a job onto the head of a given job
  * list.
  */
+static struct kcopyd_job *pop_io_job(struct list_head *jobs,
+struct dm_kcopyd_client *kc)
+{
+   struct kcopyd_job *job;
+
+   /*
+* For I/O jobs, pop any read, any write without sequential write
+* constraint and sequential writes that are at the right position.
+*/
+   list_for_each_entry(job, jobs, list) {
+
+   if (job->rw == READ ||
+   !test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags)) {
+   list_del(&job->list);
+   return job;
+   }
+
+   if (job->write_ofst == job->master_job->write_ofst) {
+   job->master_job->write_ofst += job->source.count;
+   list_del(&job->list);
+   return job;
+   }
+
+   }
+
+   return NULL;
+}
+
 static struct kcopyd_job *pop(struct list_head *jobs,
  struct dm_kcopyd_client *kc)
 {
@@ -395,8 +424,12 @@ static struct kcopyd_job *pop(struct list_head *jobs,
spin_lock_irqsave(&kc->job_lock, flags);
 
if (!list_empty(jobs)) {
-   job = list_entry(jobs->next, struct kcopyd_job, list);
-   list_del(&job->list);
+   if (jobs == &kc->io_jobs) {
+   job = pop_io_job(jobs, kc);
+   } else {
+   job = list_entry(jobs->next, struct kcopyd_job, list);
+   list_del(&job->list);
+   }
}
spin_unlock_irqrestore(&kc->job_lock, flags);
 
@@ -506,6 +539,14 @@ static int run_io_job(struct kcopyd_job *job)
.client = job->kc->io_client,
};
 
+   /*
+* If we need to write sequentially and some reads or writes failed,
+* no point in continuing.
+*/
+   if (test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags) &&
+   job->master_job->write_err)
+   return -EIO;
+
io_job_start(job->kc->throttle);
 
if (job->rw == READ)
@@ -655,6 +696,7 @@ static void segment_complete(int read_err, unsigned long 
write_err,
int i;
 
*sub_job = *job;
+   sub_job->write_ofst = progress;
sub_job->source.sector += progress;
sub_job->source.count = count;
 
@@ -723,6 +765,27 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct 
dm_io_region *from,
job->num_dests = num_dests;
memcpy(&job->dests, dests, sizeof(*dests) * num_dests);
 
+   /*
+* If one of the destination is a host-managed zoned block device,
+* we need to write sequentially. If one of the destination is a
+* host-aware device, then leave it to the caller to choose what to do.
+*/
+   if (!test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags)) {
+   for (i = 0; i < job->num_dests; i++) {
+   if (bdev_zoned_model(dests[i].bdev) == BLK_ZONED_HM) {
+   set_bit(DM_KCOPYD_WRITE_SEQ, &job->flags);
+   break;
+   }
+   }
+   }
+
+   /*
+* If we need to write sequentially, errors cannot be ignored.
+*/
+   if (test_bit(DM_KCOPYD_WRITE_SEQ, &job->flags) &&
+   test_bit(DM_KCOPYD_IGNORE_ERROR, &job->flags))
+   clear_bit(DM_KCOPYD_IGNORE_E

[PATCH v2 07/10] dm-flakey: Add support for zoned block devices

2017-05-01 Thread damien . lemoal

From: Damien Le Moal 

With the development of file system support for zoned block devices
(e.g. f2fs), having dm-flakey support for these devices is interesting
to improve testing.

This patch adds support for zoned block devices in dm-flakey, both
host-aware and host-managed. The target type feature is set to
DM_TARGET_ZONED_HM indicate support for host-managed models. The
remaining of the support adds hooks for remapping of REQ_OP_ZONE_RESET
and REQ_OP_ZONE_REPORT bios. Additionally, in the bio completion path,
(backward) remapping of a zone report reply is also added.

Signed-off-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-flakey.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c
index 13305a1..b419c85 100644
--- a/drivers/md/dm-flakey.c
+++ b/drivers/md/dm-flakey.c
@@ -251,6 +251,8 @@ static int flakey_ctr(struct dm_target *ti, unsigned int 
argc, char **argv)
return 0;
 
 bad:
+   if (fc->dev)
+   dm_put_device(ti, fc->dev);
kfree(fc);
return r;
 }
@@ -275,7 +277,7 @@ static void flakey_map_bio(struct dm_target *ti, struct bio 
*bio)
struct flakey_c *fc = ti->private;
 
bio->bi_bdev = fc->dev->bdev;
-   if (bio_sectors(bio))
+   if (bio_sectors(bio) || bio_op(bio) == REQ_OP_ZONE_RESET)
bio->bi_iter.bi_sector =
flakey_map_sector(ti, bio->bi_iter.bi_sector);
 }
@@ -306,6 +308,14 @@ static int flakey_map(struct dm_target *ti, struct bio 
*bio)
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct 
per_bio_data));
pb->bio_submitted = false;
 
+   /* Do not fail reset zone */
+   if (bio_op(bio) == REQ_OP_ZONE_RESET)
+   goto map_bio;
+
+   /* We need to remap reported zones, so remember the BIO iter */
+   if (bio_op(bio) == REQ_OP_ZONE_REPORT)
+   goto map_bio;
+
/* Are we alive ? */
elapsed = (jiffies - fc->start_time) / HZ;
if (elapsed % (fc->up_interval + fc->down_interval) >= fc->up_interval) 
{
@@ -363,6 +373,14 @@ static int flakey_end_io(struct dm_target *ti, struct bio 
*bio, int error)
struct flakey_c *fc = ti->private;
struct per_bio_data *pb = dm_per_bio_data(bio, sizeof(struct 
per_bio_data));
 
+   if (bio_op(bio) == REQ_OP_ZONE_RESET)
+   return error;
+
+   if (bio_op(bio) == REQ_OP_ZONE_REPORT) {
+   dm_remap_zone_report(ti, bio, fc->start);
+   return error;
+   }
+
if (!error && pb->bio_submitted && (bio_data_dir(bio) == READ)) {
if (fc->corrupt_bio_byte && (fc->corrupt_bio_rw == READ) &&
all_corrupt_bio_flags_match(bio, fc)) {
@@ -446,6 +464,7 @@ static int flakey_iterate_devices(struct dm_target *ti, 
iterate_devices_callout_
 static struct target_type flakey_target = {
.name   = "flakey",
.version = {1, 4, 0},
+   .features = DM_TARGET_ZONED_HM,
.module = THIS_MODULE,
.ctr= flakey_ctr,
.dtr= flakey_dtr,
-- 
2.9.3

[PATCH v2 06/10] dm: Introduce dm_remap_zone_report()

2017-05-01 Thread damien . lemoal

From: Damien Le Moal 

A target driver support zoned block devices and exposing it as such may
receive REQ_OP_ZONE_REPORT request for the user to determine the mapped
device zone configuration. To process properly such request, the target
driver may need to remap the zone descriptors provided in the report
reply. The helper function dm_remap_zone_report() does this generically
using only the target start offset and length and the start offset
within the target device.

dm_remap_zone_report() will remap the start sector of all zones
reported. If the report includes sequential zones, the write pointer
position of these zones will also be remapped.

Signed-off-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm.c   | 80 +++
 include/linux/device-mapper.h | 10 ++
 2 files changed, 90 insertions(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index cd44928..1f6558e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -975,6 +975,86 @@ void dm_accept_partial_bio(struct bio *bio, unsigned 
n_sectors)
 }
 EXPORT_SYMBOL_GPL(dm_accept_partial_bio);
 
+#ifdef CONFIG_BLK_DEV_ZONED
+/*
+ * The zone descriptors obtained with a zone report indicate
+ * zone positions within the target device. The zone descriptors
+ * must be remapped to match their position within the dm device.
+ * A target may call dm_remap_zone_report after completion of a
+ * REQ_OP_ZONE_REPORT bio to remap the zone descriptors obtained
+ * from the target device mapping to the dm device.
+ */
+void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t 
start)
+{
+   struct dm_target_io *tio =
+   container_of(bio, struct dm_target_io, clone);
+   struct bio *report_bio = tio->io->bio;
+   struct blk_zone_report_hdr *hdr = NULL;
+   struct blk_zone *zone;
+   unsigned int nr_rep = 0;
+   unsigned int ofst;
+   struct bio_vec bvec;
+   struct bvec_iter iter;
+   void *addr;
+
+   if (bio->bi_error)
+   return;
+
+   /*
+* Remap the start sector of the reported zones. For sequential zones,
+* also remap the write pointer position.
+*/
+   bio_for_each_segment(bvec, report_bio, iter) {
+
+   addr = kmap_atomic(bvec.bv_page);
+
+   /* Remember the report header in the first page */
+   if (!hdr) {
+   hdr = addr;
+   ofst = sizeof(struct blk_zone_report_hdr);
+   } else {
+   ofst = 0;
+   }
+
+   /* Set zones start sector */
+   while (hdr->nr_zones && ofst < bvec.bv_len) {
+   zone = addr + ofst;
+   if (zone->start >= start + ti->len) {
+   hdr->nr_zones = 0;
+   break;
+   }
+   zone->start = zone->start + ti->begin - start;
+   if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) {
+   if (zone->cond == BLK_ZONE_COND_FULL)
+   zone->wp = zone->start + zone->len;
+   else if (zone->cond == BLK_ZONE_COND_EMPTY)
+   zone->wp = zone->start;
+   else
+   zone->wp = zone->wp + ti->begin - start;
+   }
+   ofst += sizeof(struct blk_zone);
+   hdr->nr_zones--;
+   nr_rep++;
+   }
+
+   if (addr != hdr)
+   kunmap_atomic(addr);
+
+   if (!hdr->nr_zones)
+   break;
+
+   }
+
+   if (hdr) {
+   hdr->nr_zones = nr_rep;
+   kunmap_atomic(hdr);
+   }
+
+   bio_advance(report_bio, report_bio->bi_iter.bi_size);
+}
+EXPORT_SYMBOL_GPL(dm_remap_zone_report);
+#endif
+
 /*
  * Flush current->bio_list when the target map method blocks.
  * This fixes deadlocks in snapshot and possibly in other targets.
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index b3c2408..d21c761 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -433,6 +433,16 @@ struct gendisk *dm_disk(struct mapped_device *md);
 int dm_suspended(struct dm_target *ti);
 int dm_noflush_suspending(struct dm_target *ti);
 void dm_accept_partial_bio(struct bio *bio, unsigned n_sectors);
+#ifdef CONFIG_BLK_DEV_ZONED
+void dm_remap_zone_report(struct dm_target *ti, struct bio *bio,
+ sector_t start);
+#else
+static inline void dm_remap_zone_report(struct dm_target *ti, struct bio *bio,
+   sector_t start)
+{
+   bio->bi_error = -ENOTSUPP;
+}
+#endif
 union map_info *dm_get_rq_mapinfo(struct request *rq);
 
 struct queue_limits *dm_get_queue_li

[PATCH v2 04/10] dm: Fix REQ_OP_ZONE_RESET bio handling

2017-05-01 Thread damien . lemoal

From: Damien Le Moal 

The REQ_OP_ZONE_RESET bio has no payload and zero sectors. Its position
is the only information used to indicate the zone to reset on the
device. Due to its zero length, this bio is not cloned and sent to the
target through the non-flush case in __split_and_process_bio().
Add an additional case in that function to call
__split_and_process_non_flush() without checking the clone info size.

Signed-off-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index dfb7597..1d98035 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1318,6 +1318,10 @@ static void __split_and_process_bio(struct mapped_device 
*md,
ci.sector_count = 0;
error = __send_empty_flush(&ci);
/* dec_pending submits any data associated with flush */
+   } else if (bio_op(bio) == REQ_OP_ZONE_RESET) {
+   ci.bio = bio;
+   ci.sector_count = 0;
+   error = __split_and_process_non_flush(&ci);
} else {
ci.bio = bio;
ci.sector_count = bio_sectors(bio);
-- 
2.9.3

[PATCH v2 03/10] dm-table: Check block devices zone model compatibility

2017-05-01 Thread damien . lemoal

From: Damien Le Moal 

When setting the dm device queue limits, several possibilities exists
for zoned block devices:
1) The dm target driver may want to expose a different zone model (e.g.
host-managed device emulation or regular block device on top of
host-managed zoned block devices)
2) Expose the underlying zone model of the devices as is

To allow both cases, the underlying block device zone model must be set
in the target limits in dm_set_device_limits() and the compatibility of
all devices checked similarly to the logical block size alignment. For
this last check, introduce the function validate_hardware_zone_model()
to check that all targets of a table have the same zone model and that
the zone size of the target devices are equal.

Signed-off-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-table.c | 93 +++
 1 file changed, 93 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 6947f0f..cc89a78 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -505,6 +505,8 @@ static int dm_set_device_limits(struct dm_target *ti, 
struct dm_dev *dev,
   q->limits.alignment_offset,
   (unsigned long long) start << SECTOR_SHIFT);
 
+   limits->zoned = bdev_zoned_model(bdev);
+
return 0;
 }
 
@@ -720,6 +722,94 @@ static int 
validate_hardware_logical_block_alignment(struct dm_table *table,
return 0;
 }
 
+/*
+ * Check a devices's table for compatibility between zoned devices used by
+ * the table targets. The zone model may come directly from a target block
+ * device or may have been set by the target using the io_hints method.
+ * Overall, if any of the table device targets is advertized as a zoned
+ * block device, then all targets devices should also be advertized as
+ * using the same model and the devices zone size all equal.
+ */
+static int validate_hardware_zone_model(struct dm_table *table,
+   struct queue_limits *limits)
+{
+   struct dm_target *ti;
+   struct queue_limits ti_limits;
+   unsigned int zone_sectors = limits->chunk_sectors;
+   unsigned int num_targets = dm_table_get_num_targets(table);
+   int zone_model = -1;
+   unsigned int i;
+
+   if (!num_targets)
+   return 0;
+
+   /*
+* Check each entry in the table in turn.
+*/
+   for (i = 0; i < num_targets; i++) {
+
+   ti = dm_table_get_target(table, i);
+
+   /* Get the target device limits */
+   blk_set_stacking_limits(&ti_limits);
+   if (ti->type->iterate_devices)
+   ti->type->iterate_devices(ti, dm_set_device_limits,
+ &ti_limits);
+
+   /*
+* Let the target driver change the hardware limits, and
+* in particular the zone model if needed.
+*/
+   if (ti->type->io_hints)
+   ti->type->io_hints(ti, &ti_limits);
+
+   /* Check zone model compatibility */
+   if (zone_model == -1)
+   zone_model = ti_limits.zoned;
+   if (ti_limits.zoned != zone_model) {
+   zone_model = -1;
+   break;
+   }
+
+   if (zone_model != BLK_ZONED_NONE) {
+   /* Check zone size validity and compatibility */
+   if (!zone_sectors ||
+   !is_power_of_2(zone_sectors))
+   break;
+   if (ti_limits.chunk_sectors != zone_sectors) {
+   zone_sectors = ti_limits.chunk_sectors;
+   break;
+   }
+   }
+
+   }
+
+   if (i < num_targets) {
+   if (zone_model == -1)
+   DMWARN("%s: table line %u (start sect %llu len %llu) "
+  "has an incompatible zone model",
+  dm_device_name(table->md), i,
+  (unsigned long long) ti->begin,
+  (unsigned long long) ti->len);
+   else
+   DMWARN("%s: table line %u (start sect %llu len %llu) "
+  "has an incompatible zone size %u",
+  dm_device_name(table->md), i,
+  (unsigned long long) ti->begin,
+  (unsigned long long) ti->len,
+  zone_sectors);
+   return -EINVAL;
+   }
+
+   if (zone_model == BLK_ZONED_HA ||
+   zone_model == BLK_ZONED_HM) {
+   limits->zoned = zone_model;
+   limits->chunk_sectors = zone_sectors;
+   }
+
+   return 0;
+}
+
 int dm_table_add_target(struct dm_t

[PATCH v2 05/10] dm: Fix REQ_OP_ZONE_REPORT bio handling

2017-05-01 Thread damien . lemoal

From: Damien Le Moal 

A REQ_OP_ZONE_REPORT bio is not a medium access command. Its number of
sectors indicates the maximum size allowed for the report reply size
and not an amount of sectors accessed from the device.
REQ_OP_ZONE_REPORT bios should thus not be split depending on the
target device maximum I/O length but passed as is. Note that it is the
responsability of the target to remap and format the report reply.

Signed-off-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 1d98035..cd44928 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1098,7 +1098,8 @@ static int clone_bio(struct dm_target_io *tio, struct bio 
*bio,
return r;
}
 
-   bio_advance(clone, to_bytes(sector - clone->bi_iter.bi_sector));
+   if (bio_op(bio) != REQ_OP_ZONE_REPORT)
+   bio_advance(clone, to_bytes(sector - clone->bi_iter.bi_sector));
clone->bi_iter.bi_size = to_bytes(len);
 
if (bio_integrity(bio))
@@ -1275,7 +1276,11 @@ static int __split_and_process_non_flush(struct 
clone_info *ci)
if (!dm_target_is_valid(ti))
return -EIO;
 
-   len = min_t(sector_t, max_io_len(ci->sector, ti), ci->sector_count);
+   if (bio_op(bio) == REQ_OP_ZONE_REPORT)
+   len = ci->sector_count;
+   else
+   len = min_t(sector_t, max_io_len(ci->sector, ti),
+   ci->sector_count);
 
r = __clone_and_map_data_bio(ci, ti, ci->sector, &len);
if (r < 0)
-- 
2.9.3

[PATCH v2 01/10] dm-table: Introduce DM_TARGET_ZONED_HM feature

2017-05-01 Thread damien . lemoal

From: Damien Le Moal 

The target drivers currently available will not operate correctly if a
table target maps onto a host-managed zoned block device.

To avoid problems, this patch introduces the new feature flag
DM_TARGET_ZONED_HM for a target driver to explicitly state that it
supports host-managed zoned block devices. This feature is checked
in dm_get_device() to prevent the addition to a table of a target
mapping to a host-managed zoned block device if the target type does
not have the feature enabled.

Note that as host-aware zoned block devices are backward compatible
with regular block devices, they can be used by any of the current
target types. This new feature is thus restricted to host-managed
zoned block devices.

Signed-off-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-table.c | 23 +++
 include/linux/device-mapper.h |  6 ++
 2 files changed, 29 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 3ad16d9..06d3b7b 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -388,6 +388,24 @@ dev_t dm_get_dev_t(const char *path)
 EXPORT_SYMBOL_GPL(dm_get_dev_t);
 
 /*
+ * Check if the target supports supports host-managed zoned block devices.
+ */
+static bool device_supported(struct dm_target *ti, struct dm_dev *dev)
+{
+   struct block_device *bdev = dev->bdev;
+   char b[BDEVNAME_SIZE];
+
+   if (bdev_zoned_model(bdev) == BLK_ZONED_HM &&
+   !dm_target_zoned_hm(ti->type)) {
+   DMWARN("%s: Unsupported host-managed zoned block device %s",
+  dm_device_name(ti->table->md), bdevname(bdev, b));
+   return false;
+   }
+
+   return true;
+}
+
+/*
  * Add a device to the list, or just increment the usage count if
  * it's already present.
  */
@@ -426,6 +444,11 @@ int dm_get_device(struct dm_target *ti, const char *path, 
fmode_t mode,
}
atomic_inc(&dd->count);
 
+   if (!device_supported(ti, dd->dm_dev)) {
+   dm_put_device(ti, dd->dm_dev);
+   return -ENOTSUPP;
+   }
+
*result = dd->dm_dev;
return 0;
 }
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a7e6903..b3c2408 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -214,6 +214,12 @@ struct target_type {
 #define dm_target_is_wildcard(type)((type)->features & DM_TARGET_WILDCARD)
 
 /*
+ * Indicates that a target supports host-managed zoned block devices.
+ */
+#define DM_TARGET_ZONED_HM 0x0010
+#define dm_target_zoned_hm(type)   ((type)->features & DM_TARGET_ZONED_HM)
+
+/*
  * Some targets need to be sent the same WRITE bio severals times so
  * that they can send copies of it to different devices.  This function
  * examines any supplied bio and returns the number of copies of it the
-- 
2.9.3

[PATCH v2 00/10] dm: zoned block device support

2017-05-01 Thread damien . lemoal

From: Damien Le Moal 

This series introduces zoned block device support to the device mapper
infrastructure. Pathces are as follows:

- Patch 1: Add a new target type feature flag to indicate if a target type
  supports host-managed zoned block devices. This prevents using these drives
  with the current target types since none of them have the proper support
  implemented and will not operate properly with these drives.
- Patch 2: If a target device is a zoned block device, check that the range of
  LBAs mapped is aligned to the device zone size and that the device start
  offset also aligns to zone boundaries. This is necessary for zone reset and
  zone report correct execution.
- Patch 3: Check that the different target devices of a table have compatible
  zone sizes and models. This is necessary for target types that expose a zone
  model different from the underlying device.
- Patch 4: Fix handling of REQ_OP_ZONE_RESET bios
- Patch 5: Fix handling of REQ_OP_ZONE_REPORT bios
- Patch 6: Introduce a new helper function to reverse map a device zone report
  to the target LBA range
- Patch 7: Add support for host-managed zoned block devices to dm-flakey. This
  is necessary for testing file systems supporting natively these drives (e.g.
  f2fs).
- Patch 8: Add support for for zoned block devices to dm-linear. This can have
  useful applications during development and testing (e.g. allow creating
  smaller zoned devices with different combinations and positions of zones).
  There are also interesting applications for production, for instance, the
  ability to aggregate conventional zones of different drives to create a
  regular disk.
- Patch 9: Add sequential write enforcement to dm_kcopyd_copy so that
  sequential zones of a host-managed zoned block device can be specified as
  destinations.
- Patch 10: New dm-zoned target type (this was already sent for review twice).
  This resend adds modifications suggested by Hannes to implement reclaim
  using dm-kcopyd. dm-zoned depends on patch 9.

As always, comments and reviews are welcome.

Changes from v1:
- Use for-loop in patch 3 as suggested by Bart
- Add memory shrinker to dm-zoned to shrink the metadata block cache under
  memory pressure (suggested by Bart)
- Added Hannes Reviewed-by tag

Damien Le Moal (10):
  dm-table: Introduce DM_TARGET_ZONED_HM feature
  dm-table: Check device area zone alignment
  dm-table: Check block devices zone model compatibility
  dm: Fix REQ_OP_ZONE_RESET bio handling
  dm: Fix REQ_OP_ZONE_REPORT bio handling
  dm: Introduce dm_remap_zone_report()
  dm-flakey: Add support for zoned block devices
  dm-linear: Add support for zoned block devices
  dm-kcopyd: Add sequential write feature
  dm-zoned: Drive-managed zoned block device target

 Documentation/device-mapper/dm-zoned.txt |  154 +++
 drivers/md/Kconfig   |   19 +
 drivers/md/Makefile  |2 +
 drivers/md/dm-flakey.c   |   21 +-
 drivers/md/dm-kcopyd.c   |   68 +-
 drivers/md/dm-linear.c   |   14 +-
 drivers/md/dm-table.c|  145 ++
 drivers/md/dm-zoned-io.c |  998 ++
 drivers/md/dm-zoned-metadata.c   | 2195 ++
 drivers/md/dm-zoned-reclaim.c|  535 
 drivers/md/dm-zoned.h|  528 +++
 drivers/md/dm.c  |   93 +-
 include/linux/device-mapper.h|   16 +
 include/linux/dm-kcopyd.h|1 +
 14 files changed, 4783 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/device-mapper/dm-zoned.txt
 create mode 100644 drivers/md/dm-zoned-io.c
 create mode 100644 drivers/md/dm-zoned-metadata.c
 create mode 100644 drivers/md/dm-zoned-reclaim.c
 create mode 100644 drivers/md/dm-zoned.h

-- 
2.9.3

[PATCH v2 02/10] dm-table: Check device area zone alignment

2017-05-01 Thread damien . lemoal

From: Damien Le Moal 

If a target maps to a zoned block device, check that the device area is
aligned on zone boundaries to avoid problems with REQ_OP_ZONE_RESET
operations (resetting a partially mapped sequential zone would not be
possible). This also greatly facilitate the processing of zone report
with REQ_OP_ZONE_REPORT bios.

Signed-off-by: Damien Le Moal 
Reviewed-by: Hannes Reinecke 
---
 drivers/md/dm-table.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 06d3b7b..6947f0f 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -339,6 +339,33 @@ static int device_area_is_invalid(struct dm_target *ti, 
struct dm_dev *dev,
return 1;
}
 
+   /*
+* If the target is mapped to a zoned block device, check
+* that the device zones are not partially mapped.
+*/
+   if (bdev_zoned_model(bdev) != BLK_ZONED_NONE) {
+   unsigned int zone_sectors = bdev_zone_sectors(bdev);
+
+   if (start & (zone_sectors - 1)) {
+   DMWARN("%s: start=%llu not aligned to h/w "
+  "zone size %u of %s",
+  dm_device_name(ti->table->md),
+  (unsigned long long)start,
+  zone_sectors, bdevname(bdev, b));
+   return 1;
+   }
+
+   if (start + len < dev_size &&
+   len & (zone_sectors - 1)) {
+   DMWARN("%s: len=%llu not aligned to h/w "
+  "zone size %u of %s",
+  dm_device_name(ti->table->md),
+  (unsigned long long)start,
+  zone_sectors, bdevname(bdev, b));
+   return 1;
+   }
+   }
+
return 0;
 }
 
-- 
2.9.3

Re: [PATCH v2 cosmetic] Remove trailing newline in elevator switch error message

2017-05-01 Thread mar...@trippelsdorf.de

Trying to switch to a non-existing elevator currently results in garbled
dmesg output, e.g.:

 # echo "foo" > /sys/block/sda/queue/scheduler 

elevator: type foo not found
elevator: switch to foo
 failed

(note the unintended line break.)

Fix by stripping the trailing newline.

Signed-off-by: Markus Trippelsdorf 

diff --git a/block/elevator.c b/block/elevator.c
index bf11e70f008b..1af23eee4395 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -1054,7 +1054,6 @@ static int elevator_switch(struct request_queue *q, 
struct elevator_type *new_e)
  */
 static int __elevator_change(struct request_queue *q, const char *name)
 {
-   char elevator_name[ELV_NAME_MAX];
struct elevator_type *e;
 
/*
@@ -1063,15 +1062,14 @@ static int __elevator_change(struct request_queue *q, 
const char *name)
if (q->mq_ops && !strncmp(name, "none", 4))
return elevator_switch(q, NULL);
 
-   strlcpy(elevator_name, name, sizeof(elevator_name));
-   e = elevator_get(strstrip(elevator_name), true);
+   e = elevator_get(name, true);
if (!e) {
-   printk(KERN_ERR "elevator: type %s not found\n", elevator_name);
+   printk(KERN_ERR "elevator: type %s not found\n", name);
return -EINVAL;
}
 
if (q->elevator &&
-   !strcmp(elevator_name, q->elevator->type->elevator_name)) {
+   !strcmp(name, q->elevator->type->elevator_name)) {
elevator_put(e);
return 0;
}
@@ -1112,16 +1110,19 @@ static inline bool elv_support_iosched(struct 
request_queue *q)
 ssize_t elv_iosched_store(struct request_queue *q, const char *name,
  size_t count)
 {
+   char elevator_name[ELV_NAME_MAX];
int ret;
 
if (!(q->mq_ops || q->request_fn) || !elv_support_iosched(q))
return count;
 
-   ret = __elevator_change(q, name);
+   strlcpy(elevator_name, name, sizeof(elevator_name));
+   strstrip(elevator_name);
+   ret = __elevator_change(q, elevator_name);
if (!ret)
return count;
 
-   printk(KERN_ERR "elevator: switch to %s failed\n", name);
+   printk(KERN_ERR "elevator: switch to %s failed\n", elevator_name);
return ret;
 }
 
-- 
Markus

Re: [PATCH 3/4] blk-mq: use hw tag for scheduling if hw tag space is big enough

2017-05-01 Thread Bart Van Assche

On Sat, 2017-04-29 at 18:35 +0800, Ming Lei wrote:
> On Fri, Apr 28, 2017 at 06:09:40PM +, Bart Van Assche wrote:
> > On Fri, 2017-04-28 at 23:15 +0800, Ming Lei wrote:
> > > +static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
> > > +{
> > > + if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
> > > + return false;
> > > +
> > > + if (blk_mq_get_queue_depth(q) < q->nr_requests)
> > > + return false;
> > > +
> > > + return true;
> > > +}
> > 
> > The only user of shared tag sets I know of is scsi-mq. I think it's really
> > unfortunate that this patch systematically disables 
> > BLK_MQ_F_SCHED_USE_HW_TAG
> > for scsi-mq.
> 
> In previous patch, I actually allow driver to pass this flag, but this
> feature is dropped in this post, just for making it simple & clean.
> If you think we need it for shared tag set, I can add it in v1.
> 
> For shared tag sets, I suggest to not enable it at default, because
> scheduler is per request queue now, and generaly more requests available,
> better it performs.  When tags are shared among several request
> queues, one of them may use tags up for its own scheduling, then
> starve others. But it should be possible and not difficult to allocate
> requests fairly for scheduling in this case if we switch to per-hctx
> scheduling.

Hello Ming,

Have you noticed that there is already a mechanism in the block layer to
avoid starvation if a tag set is shared? The hctx_may_queue() function
guarantees that each user that shares a tag set gets at least some tags.
The .active_queues counter keeps track of the number of hardware queues
that share a tag set.

Bart.

Re: [PATCH cosmetic] Remove trailing newline in elevator switch error message

2017-05-01 Thread Bart Van Assche

On Sat, 2017-04-29 at 07:38 +0200, Markus Trippelsdorf wrote:
> Trying to switch to a non-existing elevator currently results in garbled
> dmesg output, e.g.:
> 
>  # echo "foo" > /sys/block/sda/queue/scheduler 
> 
> elevator: type foo not found
> elevator: switch to foo
>  failed
> 
> (note the unintended line break.)
> 
> Fix by stripping the trailing newline.
> 
> diff --git a/block/elevator.c b/block/elevator.c
> index bf11e70f008b..4f13fcd3c626 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -1112,6 +1112,7 @@ static inline bool elv_support_iosched(struct 
> request_queue *q)
>  ssize_t elv_iosched_store(struct request_queue *q, const char *name,
> size_t count)
>  {
> + char elevator_name[ELV_NAME_MAX];
>   int ret;
>  
>   if (!(q->mq_ops || q->request_fn) || !elv_support_iosched(q))
> @@ -1121,7 +1122,9 @@ ssize_t elv_iosched_store(struct request_queue *q, 
> const char *name,
>   if (!ret)
>   return count;
>  
> - printk(KERN_ERR "elevator: switch to %s failed\n", name);
> + strlcpy(elevator_name, name, sizeof(elevator_name));
> + strstrip(elevator_name);
> + printk(KERN_ERR "elevator: switch to %s failed\n", elevator_name);
>   return ret;
>  }

Hello Markus,

Your patch duplicates the code to remove trailing whitespace which is not
very elegant. Please move the code that removes trailing whitespace from
__elevator_change() into  elv_iosched_store() instead of duplicating it.

Thanks,

Bart.

Re: [PATCH] block: Remove elevator_change()

2017-05-01 Thread Jens Axboe

On 05/01/2017 09:58 AM, Bart Van Assche wrote:
> Since commit 84253394927c ("remove the mg_disk driver") removed the
> only caller of elevator_change(), also remove the elevator_change()
> function itself.

Nice, thanks Bart. I thought s390 dasd used it as well, but looks like
that got killed a while back.

-- 
Jens Axboe

[PATCH] block: Remove elevator_change()

2017-05-01 Thread Bart Van Assche

Since commit 84253394927c ("remove the mg_disk driver") removed the
only caller of elevator_change(), also remove the elevator_change()
function itself.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Markus Trippelsdorf 
---
 block/elevator.c | 13 -
 include/linux/elevator.h |  1 -
 2 files changed, 14 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index c7a4ee682033..ab726a5c0bf6 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -1085,19 +1085,6 @@ static int __elevator_change(struct request_queue *q, 
const char *name)
return elevator_switch(q, e);
 }
 
-int elevator_change(struct request_queue *q, const char *name)
-{
-   int ret;
-
-   /* Protect q->elevator from elevator_init() */
-   mutex_lock(&q->sysfs_lock);
-   ret = __elevator_change(q, name);
-   mutex_unlock(&q->sysfs_lock);
-
-   return ret;
-}
-EXPORT_SYMBOL(elevator_change);
-
 static inline bool elv_support_iosched(struct request_queue *q)
 {
if (q->mq_ops && q->tag_set && (q->tag_set->flags &
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3a216318ae73..d44840368ee7 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -214,7 +214,6 @@ extern ssize_t elv_iosched_store(struct request_queue *, 
const char *, size_t);
 
 extern int elevator_init(struct request_queue *, char *);
 extern void elevator_exit(struct request_queue *, struct elevator_queue *);
-extern int elevator_change(struct request_queue *, const char *);
 extern bool elv_bio_merge_ok(struct request *, struct bio *);
 extern struct elevator_queue *elevator_alloc(struct request_queue *,
struct elevator_type *);
-- 
2.12.2

Re: [PATCH cosmetic] Remove trailing newline in elevator switch error message

2017-05-01 Thread Bart Van Assche

On Mon, 2017-05-01 at 17:49 +0200, mar...@trippelsdorf.de wrote:
> On 2017.05.01 at 15:18 +, Bart Van Assche wrote:
> > Your patch duplicates the code to remove trailing whitespace which is not
> > very elegant. Please move the code that removes trailing whitespace from
> > __elevator_change() into  elv_iosched_store() instead of duplicating it.
> 
> Yes, I thought about it, but the problem is that elevator_change() calls 
> __elevator_change(), too.

Hello Markus,

In kernel v4.11 elevator_change() only has a single caller and that caller
passes an elevator name with no trailing whitespace. I am aware that the
elevator_change() function is exported and hence could be used by out-of-tree
kernel drivers. However, we do not care about potential behavior changes for
out-of-tree drivers.

Bart.

Re: [PATCH cosmetic] Remove trailing newline in elevator switch error message

2017-05-01 Thread mar...@trippelsdorf.de

On 2017.05.01 at 15:18 +, Bart Van Assche wrote:
> On Sat, 2017-04-29 at 07:38 +0200, Markus Trippelsdorf wrote:
> > Trying to switch to a non-existing elevator currently results in garbled
> > dmesg output, e.g.:
> > 
> >  # echo "foo" > /sys/block/sda/queue/scheduler 
> > 
> > elevator: type foo not found
> > elevator: switch to foo
> >  failed
> > 
> > (note the unintended line break.)
> > 
> > Fix by stripping the trailing newline.
> > 
> > diff --git a/block/elevator.c b/block/elevator.c
> > index bf11e70f008b..4f13fcd3c626 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -1112,6 +1112,7 @@ static inline bool elv_support_iosched(struct 
> > request_queue *q)
> >  ssize_t elv_iosched_store(struct request_queue *q, const char *name,
> >   size_t count)
> >  {
> > +   char elevator_name[ELV_NAME_MAX];
> > int ret;
> >  
> > if (!(q->mq_ops || q->request_fn) || !elv_support_iosched(q))
> > @@ -1121,7 +1122,9 @@ ssize_t elv_iosched_store(struct request_queue *q, 
> > const char *name,
> > if (!ret)
> > return count;
> >  
> > -   printk(KERN_ERR "elevator: switch to %s failed\n", name);
> > +   strlcpy(elevator_name, name, sizeof(elevator_name));
> > +   strstrip(elevator_name);
> > +   printk(KERN_ERR "elevator: switch to %s failed\n", elevator_name);
> > return ret;
> >  }
> 
> Hello Markus,
> 
> Your patch duplicates the code to remove trailing whitespace which is not
> very elegant. Please move the code that removes trailing whitespace from
> __elevator_change() into  elv_iosched_store() instead of duplicating it.

Yes, I thought about it, but the problem is that elevator_change() calls 
__elevator_change(), too.

-- 
Markus

Re: [PATCH 02/11] blk: make the bioset rescue_workqueue optional.

2017-05-01 Thread Jens Axboe

On 04/30/2017 11:00 PM, NeilBrown wrote:
> On Mon, Apr 24 2017, Christoph Hellwig wrote:
> 
>> On Mon, Apr 24, 2017 at 11:51:01AM +1000, NeilBrown wrote:
>>>
>>> I was following the existing practice exemplified by
>>> bioset_create_nobvec().
>>
>> Which is pretty ugly to start with..
> 
> That is a matter of personal taste.
> As such, it is up to the maintainer to change it if they want it
> changed.
> 
>>
>>> By not changing the signature of the function, I can avoid touching
>>> quite a few places where it is called.
>>
>> There are 13 callers of bioset_create and one caller of
>> bioset_create_nobvec, and your series touches many of those.
>>
>> So just adding a flags argument to bioset_create and passing
>> BIOSET_NEED_BVECS and BIOSET_NEED_RESUER flags to it doesn't seem
>> to much of an effort, and it's going to create a much nicer and easier
>> to extend interface.
> 
> If someone else submitted a patch to discard bioset_create_nobvec in
> favour of BIOSET_NEED_BVECS and got it accepted, then I would rebase my
> series on that.  As it is, I'm basing my patches on the style currently
> present in the tree.
> 
> Of course, if Jens says he'll only take my patches if I change to style
> to match your preference, I'll do that.

I generally tend to prefer tree wide cleanups to improve our APIs, even
if it does cause an extra bit of pain. Would you mind doing that as a
prep patch?

-- 
Jens Axboe

44 matches

Mail list logo