Re: [PATCH 25/27] block: remove the discard_zeroes_data flag

2017-05-02 Thread Nicholas A. Bellinger
On Tue, 2017-05-02 at 09:23 +0200, h...@lst.de wrote:
> On Tue, May 02, 2017 at 12:16:13AM -0700, Nicholas A. Bellinger wrote:
> > Or, another options is use bdev_write_zeroes_sectors() to determine when
> > dev_attrib->unmap_zeroes_data should be set.
> 
> Yes, that in combination with your patch to use bdev_write_zeroes_sectors
> for zeroing from write same seems like the right fix.

The larger target/iblock conversion patch looks like post v4.12 material
at this point, so to avoid breakage wrt to existing LBPRZ behavior, I'll
plan to push the following patch post -rc1.

diff --git a/drivers/target/target_core_device.c 
b/drivers/target/target_core_device.c
index d2f089c..e7caf78 100644
--- a/drivers/target/target_core_device.c
+++ b/drivers/target/target_core_device.c
@@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct se_dev_attrib 
*attrib,
attrib->unmap_granularity = q->limits.discard_granularity / block_size;
attrib->unmap_granularity_alignment = q->limits.discard_alignment /
block_size;
-   attrib->unmap_zeroes_data = 0;
+   attrib->unmap_zeroes_data = (q->limits.max_write_zeroes_sectors);
return true;
 }
 EXPORT_SYMBOL(target_configure_unmap_from_queue);



Re: [PATCH 03/13 V2] blk: make the bioset rescue_workqueue optional.

2017-05-02 Thread Ming Lei
On Wed, May 3, 2017 at 6:34 AM, NeilBrown  wrote:
> From 09017acf74ec4df674b78ca66f0924187f10d8a4 Mon Sep 17 00:00:00 2001
> From: NeilBrown 
> Date: Fri, 10 Mar 2017 13:59:50 +1100
> Subject: [PATCH] blk: make the bioset rescue_workqueue optional.
>
> This patch converts bioset_create() to not create a workqueue by
> default, so alloctions will never trigger punt_bios_to_rescuer().  It
> also introduces a new flag BIOSET_NEED_RESCUER() which tells
> bioset_create() to preserve the old behavior.
>
> All callers of bioset_create() that are inside block device drivers,
> are given the BIOSET_NEED_RESCUER().
>
> biosets used by filesystems or other top-level users do not
> need rescuing as the bio can never be queued behind other
> bios.  This includes fs_bio_set, blkdev_dio_pool,
> btrfs_bioset, xfs_ioend_bioset, and one allocated by
> target_core_iblock.c.
>
> biosets used by md/raid do not need rescuing as
> their usage was recently audited and revised to never
> risk deadlock.
>
> It is hoped that most, if not all, of the remaining biosets
> can end up being the non-rescued version.
>
> Reviewed-by: Christoph Hellwig 
> Credit-to: Ming Lei  (minor fixes)
> Signed-off-by: NeilBrown 

Reviewed-by: Ming Lei 


Thanks,
Ming


Re: [PATCH 13/13] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()

2017-05-02 Thread Ming Lei
On Wed, May 03, 2017 at 08:54:55AM +1000, NeilBrown wrote:
> On Tue, May 02 2017, Ming Lei wrote:
> 
> > On Tue, May 02, 2017 at 01:42:26PM +1000, NeilBrown wrote:
> >> blk_bio_segment_split() makes sure bios have no more than
> >> BIO_MAX_PAGES entries in the bi_io_vec.
> >> This was done because bio_clone_bioset() (when given a
> >> mempool bioset) could not handle larger io_vecs.
> >> 
> >> No driver uses bio_clone_bioset() any more, they all
> >> use bio_clone_fast() if anything, and bio_clone_fast()
> >> doesn't clone the bi_io_vec.
> >
> > Maybe in future, some drivers still may try to use 
> > bio_clone_bioset() again, I suggest to add some comments
> > on bio_clone_bioset() to make this usage explicitly. Also
> > better to trigger a warning if a big src bio is passed to
> > bio_clone_bioset().
> 
> There are now just two users for bio_clone_bioset(): bounce.c and btrfs.
> 
> Christoph wants to get rid of bounce.c, which would leave one.
> 
> I'd have to drill into the btrfs code to be sure, but it might be that
> btrfs only needs bio_clone_fast().  That would leave zero users.
> Then we wouldn't need a warning at all.
> 
> So I agree that we need to guard against future incorrect usage.  I'm not
> yet sure what the best approach is.

I think it is helpful to simply comment this function as obsolete.


Thanks,
Ming


[PATCH 07/13 V2] drbd: use bio_clone_fast() instead of bio_clone()

2017-05-02 Thread NeilBrown

drbd does not modify the bi_io_vec of the cloned bio,
so there is no need to clone that part.  So bio_clone_fast()
is the better choice.
For bio_clone_fast() we need to specify a bio_set.
We could use fs_bio_set, which bio_clone() uses, or
drbd_md_io_bio_set, which drbd uses for metadata, but it is
generally best to avoid sharing bio_sets unless you can
be certain that there are no interdependencies.

So create a new bio_set, drbd_io_bio_set, and use bio_clone_fast().

Also remove a "XXX cannot fail ???" comment because it definitely
cannot fail - bio_clone_fast() doesn't fail if the GFP flags allow for
sleeping.

Reviewed-by: Christoph Hellwig 
Signed-off-by: NeilBrown 
---

This patch needed to be refreshed after long lines were
wrapped in an earlier patch.
Also added the BIOSET_NEED_RESCUER flag for new bioset as
it is not trivially obvious that this isn't needed.

NeilBrown


 drivers/block/drbd/drbd_int.h  | 3 +++
 drivers/block/drbd/drbd_main.c | 9 +
 drivers/block/drbd/drbd_req.h  | 2 +-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index d5da45bb03a6..f91982515a6b 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1441,6 +1441,9 @@ extern struct bio_set *drbd_md_io_bio_set;
 /* to allocate from that set */
 extern struct bio *bio_alloc_drbd(gfp_t gfp_mask);
 
+/* And a bio_set for cloning */
+extern struct bio_set *drbd_io_bio_set;
+
 extern struct mutex resources_mutex;
 
 extern int conn_lowest_minor(struct drbd_connection *connection);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index bdf51b6977cf..90680034ef57 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -128,6 +128,7 @@ mempool_t *drbd_request_mempool;
 mempool_t *drbd_ee_mempool;
 mempool_t *drbd_md_io_page_pool;
 struct bio_set *drbd_md_io_bio_set;
+struct bio_set *drbd_io_bio_set;
 
 /* I do not use a standard mempool, because:
1) I want to hand out the pre-allocated objects first.
@@ -2098,6 +2099,8 @@ static void drbd_destroy_mempools(void)
 
/* D_ASSERT(device, atomic_read(_pp_vacant)==0); */
 
+   if (drbd_io_bio_set)
+   bioset_free(drbd_io_bio_set);
if (drbd_md_io_bio_set)
bioset_free(drbd_md_io_bio_set);
if (drbd_md_io_page_pool)
@@ -2115,6 +2118,7 @@ static void drbd_destroy_mempools(void)
if (drbd_al_ext_cache)
kmem_cache_destroy(drbd_al_ext_cache);
 
+   drbd_io_bio_set  = NULL;
drbd_md_io_bio_set   = NULL;
drbd_md_io_page_pool = NULL;
drbd_ee_mempool  = NULL;
@@ -2142,6 +2146,7 @@ static int drbd_create_mempools(void)
drbd_pp_pool = NULL;
drbd_md_io_page_pool = NULL;
drbd_md_io_bio_set   = NULL;
+   drbd_io_bio_set  = NULL;
 
/* caches */
drbd_request_cache = kmem_cache_create(
@@ -2165,6 +2170,10 @@ static int drbd_create_mempools(void)
goto Enomem;
 
/* mempools */
+   drbd_io_bio_set = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_RESCUER);
+   if (drbd_io_bio_set == NULL)
+   goto Enomem;
+
drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0,
   BIOSET_NEED_BVECS |
   BIOSET_NEED_RESCUER);
diff --git a/drivers/block/drbd/drbd_req.h b/drivers/block/drbd/drbd_req.h
index eb49e7f2da91..9e1866ab238f 100644
--- a/drivers/block/drbd/drbd_req.h
+++ b/drivers/block/drbd/drbd_req.h
@@ -263,7 +263,7 @@ enum drbd_req_state_bits {
 static inline void drbd_req_make_private_bio(struct drbd_request *req, struct 
bio *bio_src)
 {
struct bio *bio;
-   bio = bio_clone(bio_src, GFP_NOIO); /* XXX cannot fail?? */
+   bio = bio_clone_fast(bio_src, GFP_NOIO, drbd_io_bio_set);
 
req->private_bio = bio;
 
-- 
2.12.2



signature.asc
Description: PGP signature


Re: [PATCH 13/13] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()

2017-05-02 Thread NeilBrown
On Tue, May 02 2017, Ming Lei wrote:

> On Tue, May 02, 2017 at 01:42:26PM +1000, NeilBrown wrote:
>> blk_bio_segment_split() makes sure bios have no more than
>> BIO_MAX_PAGES entries in the bi_io_vec.
>> This was done because bio_clone_bioset() (when given a
>> mempool bioset) could not handle larger io_vecs.
>> 
>> No driver uses bio_clone_bioset() any more, they all
>> use bio_clone_fast() if anything, and bio_clone_fast()
>> doesn't clone the bi_io_vec.
>
> Maybe in future, some drivers still may try to use 
> bio_clone_bioset() again, I suggest to add some comments
> on bio_clone_bioset() to make this usage explicitly. Also
> better to trigger a warning if a big src bio is passed to
> bio_clone_bioset().

There are now just two users for bio_clone_bioset(): bounce.c and btrfs.

Christoph wants to get rid of bounce.c, which would leave one.

I'd have to drill into the btrfs code to be sure, but it might be that
btrfs only needs bio_clone_fast().  That would leave zero users.
Then we wouldn't need a warning at all.

So I agree that we need to guard against future incorrect usage.  I'm not
yet sure what the best approach is.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


[PATCH 03/13 V2] blk: make the bioset rescue_workqueue optional.

2017-05-02 Thread NeilBrown
From 09017acf74ec4df674b78ca66f0924187f10d8a4 Mon Sep 17 00:00:00 2001
From: NeilBrown 
Date: Fri, 10 Mar 2017 13:59:50 +1100
Subject: [PATCH] blk: make the bioset rescue_workqueue optional.

This patch converts bioset_create() to not create a workqueue by
default, so alloctions will never trigger punt_bios_to_rescuer().  It
also introduces a new flag BIOSET_NEED_RESCUER() which tells
bioset_create() to preserve the old behavior.

All callers of bioset_create() that are inside block device drivers,
are given the BIOSET_NEED_RESCUER().

biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios.  This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, and one allocated by
target_core_iblock.c.

biosets used by md/raid do not need rescuing as
their usage was recently audited and revised to never
risk deadlock.

It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.

Reviewed-by: Christoph Hellwig 
Credit-to: Ming Lei  (minor fixes)
Signed-off-by: NeilBrown 
---

This version fixes the WARN_ON_ONCE() typo, updates the above comment,
wraps long lines, and preserves the rescuer behavior for DRBD.

Thanks for the review.

NeilBrown


 block/bio.c| 13 +++--
 block/blk-core.c   |  3 ++-
 drivers/block/drbd/drbd_main.c |  4 +++-
 drivers/md/bcache/super.c  |  8 ++--
 drivers/md/dm-crypt.c  |  3 ++-
 drivers/md/dm-io.c |  3 ++-
 drivers/md/dm.c|  5 +++--
 include/linux/bio.h|  1 +
 8 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 3f1286ed301e..f3a6ef4f5c42 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -363,6 +363,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
struct bio_list punt, nopunt;
struct bio *bio;
 
+   if (WARN_ON_ONCE(!bs->rescue_workqueue))
+   return;
/*
 * In order to guarantee forward progress we must punt only bios that
 * were allocated from this bio_set; otherwise, if there was a bio on
@@ -474,7 +476,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int 
nr_iovecs,
 
if (current->bio_list &&
(!bio_list_empty(>bio_list[0]) ||
-!bio_list_empty(>bio_list[1])))
+!bio_list_empty(>bio_list[1])) &&
+   bs->rescue_workqueue)
gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
p = mempool_alloc(bs->bio_pool, gfp_mask);
@@ -1925,7 +1928,8 @@ EXPORT_SYMBOL(bioset_free);
  * bioset_create  - Create a bio_set
  * @pool_size: Number of bio and bio_vecs to cache in the mempool
  * @front_pad: Number of bytes to allocate in front of the returned bio
- * @flags: Flags to modify behavior, currently only %BIOSET_NEED_BVECS
+ * @flags: Flags to modify behavior, currently %BIOSET_NEED_BVECS
+ *  and %BIOSET_NEED_RESCUER
  *
  * Description:
  *Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller
@@ -1936,6 +1940,8 @@ EXPORT_SYMBOL(bioset_free);
  *or things will break badly.
  *If %BIOSET_NEED_BVECS is set in @flags, a separate pool will be allocated
  *for allocating iovecs.  This pool is not needed e.g. for 
bio_clone_fast().
+ *If %BIOSET_NEED_RESCUER is set, workqueue is create which can be used to
+ *dispatch queued requests when the mempool runs out of space.
  *
  */
 struct bio_set *bioset_create(unsigned int pool_size,
@@ -1971,6 +1977,9 @@ struct bio_set *bioset_create(unsigned int pool_size,
goto bad;
}
 
+   if (!(flags & BIOSET_NEED_RESCUER))
+   return bs;
+
bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
if (!bs->rescue_workqueue)
goto bad;
diff --git a/block/blk-core.c b/block/blk-core.c
index 3797753f4085..bc66cd595bef 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -730,7 +730,8 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
if (q->id < 0)
goto fail_q;
 
-   q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
+   q->bio_split = bioset_create(BIO_POOL_SIZE, 0, (BIOSET_NEED_BVECS |
+   BIOSET_NEED_RESCUER));
if (!q->bio_split)
goto fail_id;
 
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index b395fe391171..bdf51b6977cf 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2165,7 +2165,9 @@ static int drbd_create_mempools(void)
goto Enomem;
 
/* mempools */
-   drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0, 
BIOSET_NEED_BVECS);
+   drbd_md_io_bio_set = 

Re: [PATCH 03/13] blk: make the bioset rescue_workqueue optional.

2017-05-02 Thread NeilBrown
On Tue, May 02 2017, Ming Lei wrote:

> On Tue, May 02, 2017 at 01:42:24PM +1000, NeilBrown wrote:
>> This patch converts bioset_create() and
>> bioset_create_nobvec() to not create a workqueue so
>> alloctions will never trigger punt_bios_to_rescuer().  It
>> also introduces bioset_create_rescued() and
>> bioset_create_nobvec_rescued() which preserve the old
>> behaviour.
>> 
>> All callers of bioset_create() and bioset_create_nobvec(),
>> that are inside block device drivers, are converted to the
>> _rescued() version.
>
> The above comment about *_nobvec() need to be updated.

Yes - thanks.  I'll post an updated version.


>
>> 
>> biosets used by filesystems or other top-level users do not
>> need rescuing as the bio can never be queued behind other
>> bios.  This includes fs_bio_set, blkdev_dio_pool,
>> btrfs_bioset, xfs_ioend_bioset, drbd_md_io_bio_set,
>> and one allocated by target_core_iblock.c.
>> 
>> biosets used by md/raid to not need rescuing as
>> their usage was recently audited to revised to never
>> risk deadlock.
>> 
>> It is hoped that most, if not all, of the remaining biosets
>> can end up being the non-rescued version.
>> 
>> Signed-off-by: NeilBrown 
>> ---
>>  block/bio.c   |   12 ++--
>>  block/blk-core.c  |2 +-
>>  drivers/md/bcache/super.c |6 --
>>  drivers/md/dm-crypt.c |2 +-
>>  drivers/md/dm-io.c|2 +-
>>  drivers/md/dm.c   |5 +++--
>>  include/linux/bio.h   |1 +
>>  7 files changed, 21 insertions(+), 9 deletions(-)
>
> DRBD can stack on block device too, so I guess it might need
> BIOSET_NEED_RESCUER?

A bioset only needs a rescuer is an allocation is made from the bioset
inside a make_request_fn.
The bioset in DRBD is drbd_md_io_bio_set is used for metadata IO, and
I think it is only used in a worker thread, where rescuing cannot help.
But that is probably subtle enough that I should leave the removal
of the rescuer to a subsequent patch.
So I'll give DRBD the flag when I update the changelog comment.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: [PATCH v2 10/10] dm-zoned: Drive-managed zoned block device target

2017-05-02 Thread Bart Van Assche
On Tue, 2017-05-02 at 02:53 +0900, damien.lem...@wdc.com wrote:
> +static unsigned long dmz_mblock_shrinker_count(struct shrinker *shrink,
> +  struct shrink_control *sc)
> +{
> +   struct dmz_target *dmz =
> +   container_of(shrink, struct dmz_target, mblk_shrinker);
> +
> +   return atomic_read(>nr_mblks);
> +}

Hello Damien,

dmz_mblock_shrinker_count() probably should return the following value since
dmz_shrink_mblock_cache() won't free more than the this number of elements:

max(atomic_read(>nr_mblks) - dmz->min_nr_mblks, 0) 

But since v2 is IMHO good enough to be merged, for the whole series:

Reviewed-by: Bart Van Assche 

Re: [PATCH 09/13] lightnvm/pblk-read: use bio_clone_fast()

2017-05-02 Thread NeilBrown
On Tue, May 02 2017, Javier González wrote:

>> On 2 May 2017, at 05.42, NeilBrown  wrote:
>> 
>> pblk_submit_read() uses bio_clone_bioset() but doesn't change the
>> io_vec, so bio_clone_fast() is a better choice.
>> 
>> It also uses fs_bio_set which is intended for filesystems.  Using it
>> in a device driver can deadlock.
>> So allocate a new bioset, and and use bio_clone_fast().
>> 
>> Signed-off-by: NeilBrown 
>> ---
>> drivers/lightnvm/pblk-init.c |   12 +++-
>> drivers/lightnvm/pblk-read.c |2 +-
>> drivers/lightnvm/pblk.h  |1 +
>> 3 files changed, 13 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
>> index b3fec8ec55b8..aaefbccce30e 100644
>> --- a/drivers/lightnvm/pblk-init.c
>> +++ b/drivers/lightnvm/pblk-init.c
>> @@ -23,6 +23,7 @@
>> static struct kmem_cache *pblk_blk_ws_cache, *pblk_rec_cache, 
>> *pblk_r_rq_cache,
>>  *pblk_w_rq_cache, *pblk_line_meta_cache;
>> static DECLARE_RWSEM(pblk_lock);
>> +struct bio_set *pblk_bio_set;
>> 
>> static int pblk_rw_io(struct request_queue *q, struct pblk *pblk,
>>struct bio *bio)
>> @@ -946,11 +947,20 @@ static struct nvm_tgt_type tt_pblk = {
>> 
>> static int __init pblk_module_init(void)
>> {
>> -return nvm_register_tgt_type(_pblk);
>> +int ret;
>> +
>> +pblk_bio_set = bioset_create(BIO_POOL_SIZE, 0, 0);
>> +if (!pblk_bio_set)
>> +return -ENOMEM;
>> +ret = nvm_register_tgt_type(_pblk);
>> +if (ret)
>> +bioset_free(pblk_bio_set);
>> +return ret;
>> }
>> 
>> static void pblk_module_exit(void)
>> {
>> +bioset_free(pblk_bio_set);
>>  nvm_unregister_tgt_type(_pblk);
>> }
>> 
>> diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
>> index 4a12f14d78c6..8ee0c50a7a71 100644
>> --- a/drivers/lightnvm/pblk-read.c
>> +++ b/drivers/lightnvm/pblk-read.c
>> @@ -342,7 +342,7 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
>>  struct pblk_r_ctx *r_ctx = nvm_rq_to_pdu(rqd);
>> 
>>  /* Clone read bio to deal with read errors internally */
>> -int_bio = bio_clone_bioset(bio, GFP_KERNEL, fs_bio_set);
>> +int_bio = bio_clone_fast(bio, GFP_KERNEL, pblk_bio_set);
>>  if (!int_bio) {
>>  pr_err("pblk: could not clone read bio\n");
>>  return NVM_IO_ERR;
>> diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
>> index 99f3186b5288..95b665f23925 100644
>> --- a/drivers/lightnvm/pblk.h
>> +++ b/drivers/lightnvm/pblk.h
>> @@ -702,6 +702,7 @@ void pblk_write_should_kick(struct pblk *pblk);
>> /*
>>  * pblk read path
>>  */
>> +extern struct bio_set *pblk_bio_set;
>> int pblk_submit_read(struct pblk *pblk, struct bio *bio);
>> int pblk_submit_read_gc(struct pblk *pblk, u64 *lba_list, void *data,
>>  unsigned int nr_secs, unsigned int *secs_to_gc,
>
> Hi Neil,
>
> Looks good. Thanks for fixing this. I did not know that bio_clone_bioset
> was not supposed to be used on drivers.

Prior to my patchset, using bio_clone_bioset() wasn't wrong in drivers,
though it was a waste when bio_clone_fast() would to just as well as is
more efficient.
After my patchset, using it can be problematic.  I'm wondering what I
should do to encourage those problems to be more visible so that if
people to us it, they'll get a warning or something.

Thanks,
NeilBrown


>
> Reviewed-by: Javier González 
> Tested-by: Javier González 


signature.asc
Description: PGP signature


Re: [PATCH 02/13] blk: replace bioset_create_nobvec() with a flags arg to bioset_create()

2017-05-02 Thread NeilBrown
On Tue, May 02 2017, Christoph Hellwig wrote:

>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>> index d1b04b0e99cf..0975da6bebd9 100644
>> --- a/include/linux/bio.h
>> +++ b/include/linux/bio.h
>> @@ -373,8 +373,10 @@ static inline struct bio *bio_next_split(struct bio 
>> *bio, int sectors,
>>  return bio_split(bio, sectors, gfp, bs);
>>  }
>>  
>> -extern struct bio_set *bioset_create(unsigned int, unsigned int);
>> -extern struct bio_set *bioset_create_nobvec(unsigned int, unsigned int);
>> +extern struct bio_set *bioset_create(unsigned int, unsigned int, int flags);
>> +enum {
>> +BIOSET_NEED_BVECS = BIT(0),
>> +};
>
> I really hate the BIT macro as it obsfucates what's going on.

So post a patch to remove it from the tree.
I happen to like it, but I wouldn't fight for it.

>
> Why not just
>
>   BIOSET_NEED_BVECS   = (1 << 0),
>
> which is a lot more intuitive.
>
> Otherwise looks fine to me:
>
> Reviewed-by: Christoph Hellwig 

Thanks,
NeilBrown


signature.asc
Description: PGP signature


Re: BUG: KASAN: use-after-free in scsi_exit_rq

2017-05-02 Thread Bart Van Assche
On Tue, 2017-05-02 at 16:41 +0200, Jan Kara wrote:
> So I'm also not aware of any particular breakage this would cause. However
> logically the freeing of request mempools really belongs to
> blk_release_queue() so it seems a bit dumb to move blk_exit_rl() just
> because SCSI stores the fact from which slab cache it has allocated the
> sense buffer in a structure (shost) that it frees under its hands by the
> time blk_release_queue() is called. :-|

Hello Jan,

My concern when I wrote my previous e-mail was that I didn't want to add a
scsi_host_get() / scsi_host_put() pair to the hot path in the SCSI core. But
I just realized that scsi_init_rq() and scsi_exit_rq() are not in the hot
path so adding a scsi_host_get() / scsi_host_put() pair should work fine. I
will post a patch.

Bart.


Re: [PATCH] block: don't call blk_mq_quiesce_queue() after queue is freezed

2017-05-02 Thread Bart Van Assche
On Tue, 2017-05-02 at 07:28 +0800, Ming Lei wrote:
> After queue is freezed, no request in this queue can be in use
> at all, so there can't be any .queue_rq() is running on this queue.
> It isn't necessary to call blk_mq_quiesce_queue() any more, so
> remove it in both elevator_switch_mq() and blk_mq_update_nr_requests().

Reviewed-by: Bart Van Assche 


Re: [PATCH] block: don't call blk_mq_quiesce_queue() after queue is freezed

2017-05-02 Thread Jens Axboe
On 05/01/2017 05:28 PM, Ming Lei wrote:
> After queue is freezed, no request in this queue can be in use
> at all, so there can't be any .queue_rq() is running on this queue.
> It isn't necessary to call blk_mq_quiesce_queue() any more, so
> remove it in both elevator_switch_mq() and blk_mq_update_nr_requests().

Added, thanks Ming.

-- 
Jens Axboe



Re: [PATCH ] block: don't call blk_mq_quiesce_queue() during switching mq sched

2017-05-02 Thread Jens Axboe
On 04/28/2017 09:59 AM, Bart Van Assche wrote:
> On Fri, 2017-04-28 at 07:42 -0600, Jens Axboe wrote:
>> On 04/28/2017 01:32 AM, Ming Lei wrote:
>>> We have freezed queue already, not necessary to call
>>> blk_mq_quiesce_queue() any more, so remove it.
>>
>> Are you sure? It ensures that we also aren't in the middle of
>> blk_mq_make_request(), we need a stable view of the sched
>> status throughout that.
> 
> Hello Jens,
> 
> My understanding is that blk_mq_freeze_queue() provides stronger
> guarantees than blk_mq_quiesce_queue(). The former waits until all
> pending requests have finished while the latter only waits until
> pending .queue_rq() calls have finished. blk_mq_freeze_queue() also
> causes new blk_get_request() calls to wait until
> blk_mq_unfreeze_queue() is called while blk_get_request() can still
> succeed after blk_mq_quiesce_queue() returned and before
> blk_mq_start_stopped_hw_queues() is called.
> 
> Regarding blk_mq_make_request(): I think that the blk_queue_enter()
> call in generic_make_request() prevents that blk_mq_make_request()
> gets called after a queue has been frozen.

Bart, you are right, I'm fine with the patch.

-- 
Jens Axboe



Re: Treat REQ_FUA and REQ_PREFLUSH as synchronous

2017-05-02 Thread Jan Kara
On Tue 02-05-17 08:18:15, Christoph Hellwig wrote:
> On Tue, May 02, 2017 at 05:15:58PM +0200, Jan Kara wrote:
> > in generic_make_request_checks()? Or just set it unconditionally in that
> > function if we see REQ_FUA | REQ_PREFLUSH set?
> 
> Just add REQ_FUA and REQ_PREFLUSH to the test in op_is_sync.

It is there. The problem happens when REQ_FUA and REQ_PREFLUSH get stripped
from bio because the underlying storage doesn't have volatile cache, the
bio suddently becomes treated as async which regresses performance.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 6/6] blk-mq-debugfs: Add 'kick' operation

2017-05-02 Thread Jens Axboe
On 05/02/2017 10:23 AM, Bart Van Assche wrote:
> On Mon, 2017-05-01 at 18:24 -0600, Jens Axboe wrote:
>> On 05/01/2017 06:19 PM, Omar Sandoval wrote:
>>> On Thu, Apr 27, 2017 at 08:54:37AM -0700, Bart Van Assche wrote:
 Running a queue causes the block layer to examine the per-CPU and
 hw queues but not the requeue list. Hence add a 'kick' operation
 that also examines the requeue list.
>>>
>>> The naming of these operations isn't super intuitive, but it makes
>>> enough sense if you know the code, I guess.
>>
>> I don't worry about that too much, but I do think it's important
>> that we have some way of knowing WHAT commands are available
>> for a given kernel, without having to consult the source. It's
>> no big deal you're running the latest and greatest debug kernels,
>> but it's a bigger issue if you're debugging kernel x.y.z for
>> a customer and you have to consult the source to find them.
>>
>> Can we include it in the show output?
> 
> Hello Jens,
> 
> Sorry but I'm afraid that including the list of supported commands in the
> 'show' output would make that output harder to read. Figuring out what the
> supported commands are is not that hard as the output below shows:
> 
> # dmesg -c >/dev/null; for d in /sys/kernel/debug/block/*/mq/state; do \
>   if [ -e "$d" ]; then echo help >$d 2>/dev/null; break; fi; done; dmesg
> blk_queue_flags_store: unsupported operation help. Use 'run', 'start' or 
> 'kick'

Ah perfect, I missed that. That's perfectly fine, all I care about is
that there is some way to tell what the valid commands are, without
having to find the source.

-- 
Jens Axboe



Re: [PATCH 6/6] blk-mq-debugfs: Add 'kick' operation

2017-05-02 Thread Bart Van Assche
On Mon, 2017-05-01 at 18:24 -0600, Jens Axboe wrote:
> On 05/01/2017 06:19 PM, Omar Sandoval wrote:
> > On Thu, Apr 27, 2017 at 08:54:37AM -0700, Bart Van Assche wrote:
> > > Running a queue causes the block layer to examine the per-CPU and
> > > hw queues but not the requeue list. Hence add a 'kick' operation
> > > that also examines the requeue list.
> > 
> > The naming of these operations isn't super intuitive, but it makes
> > enough sense if you know the code, I guess.
> 
> I don't worry about that too much, but I do think it's important
> that we have some way of knowing WHAT commands are available
> for a given kernel, without having to consult the source. It's
> no big deal you're running the latest and greatest debug kernels,
> but it's a bigger issue if you're debugging kernel x.y.z for
> a customer and you have to consult the source to find them.
> 
> Can we include it in the show output?

Hello Jens,

Sorry but I'm afraid that including the list of supported commands in the
'show' output would make that output harder to read. Figuring out what the
supported commands are is not that hard as the output below shows:

# dmesg -c >/dev/null; for d in /sys/kernel/debug/block/*/mq/state; do \
  if [ -e "$d" ]; then echo help >$d 2>/dev/null; break; fi; done; dmesg
blk_queue_flags_store: unsupported operation help. Use 'run', 'start' or 'kick'

Bart.

Re: [PATCH 5/6] blk-mq-debugfs: Show busy requests

2017-05-02 Thread Bart Van Assche
On Mon, 2017-05-01 at 17:17 -0700, Omar Sandoval wrote:
> On Thu, Apr 27, 2017 at 08:54:36AM -0700, Bart Van Assche wrote:
> > +   if (q->queue_hw_ctx[rq->mq_ctx->index_hw] == ctx->hctx &&
> 
> This doesn't look right. ctx->index_hw is the index into hctx->ctxs for
> this ctx. I think you mean blk_mq_map_queue(q, rq->mq_ctx->cpu).

Hello Omar,

Thanks for the feedback. I will integrate this fix when I repost this patch 
series
(after the merge window has closed).

Bart.

[PATCH 2/2] mtip32xx: convert internal commands to regular block infrastructure

2017-05-02 Thread Jens Axboe
Get rid of the private end_io handlers and data, and just use the
regular block IO path for these requests. This removes a lot of
redundant code.

Signed-off-by: Jens Axboe 
---
 drivers/block/mtip32xx/mtip32xx.c | 198 ++
 drivers/block/mtip32xx/mtip32xx.h |  10 --
 2 files changed, 30 insertions(+), 178 deletions(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 3204623f746a..3a779a4f5653 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -205,66 +205,12 @@ static struct mtip_cmd *mtip_get_int_command(struct 
driver_data *dd)
return blk_mq_rq_to_pdu(rq);
 }
 
-static void mtip_put_int_command(struct driver_data *dd, struct mtip_cmd *cmd)
-{
-   blk_put_request(blk_mq_rq_from_pdu(cmd));
-}
-
-/*
- * Once we add support for one hctx per mtip group, this will change a bit
- */
-static struct request *mtip_rq_from_tag(struct driver_data *dd,
-   unsigned int tag)
-{
-   struct blk_mq_hw_ctx *hctx = dd->queue->queue_hw_ctx[0];
-
-   return blk_mq_tag_to_rq(hctx->tags, tag);
-}
-
 static struct mtip_cmd *mtip_cmd_from_tag(struct driver_data *dd,
  unsigned int tag)
 {
-   struct request *rq = mtip_rq_from_tag(dd, tag);
-
-   return blk_mq_rq_to_pdu(rq);
-}
-
-/*
- * IO completion function.
- *
- * This completion function is called by the driver ISR when a
- * command that was issued by the kernel completes. It first calls the
- * asynchronous completion function which normally calls back into the block
- * layer passing the asynchronous callback data, then unmaps the
- * scatter list associated with the completed command, and finally
- * clears the allocated bit associated with the completed command.
- *
- * @port   Pointer to the port data structure.
- * @tagTag of the command.
- * @data   Pointer to driver_data.
- * @status Completion status.
- *
- * return value
- * None
- */
-static void mtip_async_complete(struct mtip_port *port,
-   int tag, struct mtip_cmd *cmd, int status)
-{
-   struct driver_data *dd = port->dd;
-   struct request *rq;
-
-   if (unlikely(!dd) || unlikely(!port))
-   return;
-
-   if (unlikely(status == PORT_IRQ_TF_ERR)) {
-   dev_warn(>dd->pdev->dev,
-   "Command tag %d failed due to TFE\n", tag);
-   }
-
-   rq = mtip_rq_from_tag(dd, tag);
+   struct blk_mq_hw_ctx *hctx = dd->queue->queue_hw_ctx[0];
 
-   cmd->status = status;
-   blk_mq_complete_request(rq);
+   return blk_mq_rq_to_pdu(blk_mq_tag_to_rq(hctx->tags, tag));
 }
 
 /*
@@ -581,38 +527,19 @@ static void print_tags(struct driver_data *dd,
"%d command(s) %s: tagmap [%s]", cnt, msg, tagmap);
 }
 
-/*
- * Internal command completion callback function.
- *
- * This function is normally called by the driver ISR when an internal
- * command completed. This function signals the command completion by
- * calling complete().
- *
- * @port   Pointer to the port data structure.
- * @tagTag of the command that has completed.
- * @data   Pointer to a completion structure.
- * @status Completion status.
- *
- * return value
- * None
- */
-static void mtip_completion(struct mtip_port *port,
-   int tag, struct mtip_cmd *command, int status)
-{
-   struct completion *waiting = command->comp_data;
-   if (unlikely(status == PORT_IRQ_TF_ERR))
-   dev_warn(>dd->pdev->dev,
-   "Internal command %d completed with TFE\n", tag);
-
-   command->comp_func = NULL;
-   command->comp_data = NULL;
-   complete(waiting);
-}
-
 static int mtip_read_log_page(struct mtip_port *port, u8 page, u16 *buffer,
dma_addr_t buffer_dma, unsigned int sectors);
 static int mtip_get_smart_attr(struct mtip_port *port, unsigned int id,
struct smart_attr *attrib);
+
+static void mtip_complete_command(struct mtip_cmd *cmd, int status)
+{
+   struct request *req = blk_mq_rq_from_pdu(cmd);
+
+   cmd->status = status;
+   blk_mq_complete_request(req);
+}
+
 /*
  * Handle an error.
  *
@@ -641,11 +568,7 @@ static void mtip_handle_tfe(struct driver_data *dd)
if (test_bit(MTIP_PF_IC_ACTIVE_BIT, >flags)) {
cmd = mtip_cmd_from_tag(dd, MTIP_TAG_INTERNAL);
dbg_printk(MTIP_DRV_NAME " TFE for the internal command\n");
-
-   if (cmd->comp_data && cmd->comp_func) {
-   cmd->comp_func(port, MTIP_TAG_INTERNAL,
-   cmd, PORT_IRQ_TF_ERR);
-   }
+   mtip_complete_command(cmd, -EIO);
return;
}
 
@@ -672,19 +595,9 @@ static void mtip_handle_tfe(struct driver_data *dd)
 

[PATCH 1/2] mtip32xx: cleanup internal tag assumptions

2017-05-02 Thread Jens Axboe
We don't decode the internal tag to the proper group or tag
indx. This works fine because we have hard wired it as 0 for now,
but could break if we get rid of that.

Signed-off-by: Jens Axboe 
---
 drivers/block/mtip32xx/mtip32xx.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 96fe6500e941..3204623f746a 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -847,16 +847,15 @@ static inline void mtip_process_legacy(struct driver_data 
*dd, u32 port_stat)
struct mtip_port *port = dd->port;
struct mtip_cmd *cmd = mtip_cmd_from_tag(dd, MTIP_TAG_INTERNAL);
 
-   if (test_bit(MTIP_PF_IC_ACTIVE_BIT, >flags) &&
-   (cmd != NULL) && !(readl(port->cmd_issue[MTIP_TAG_INTERNAL])
-   & (1 << MTIP_TAG_INTERNAL))) {
-   if (cmd->comp_func) {
-   cmd->comp_func(port, MTIP_TAG_INTERNAL, cmd, 0);
-   return;
+   if (test_bit(MTIP_PF_IC_ACTIVE_BIT, >flags) && cmd) {
+   int group = MTIP_TAG_INDEX(MTIP_TAG_INTERNAL);
+   int status = readl(port->cmd_issue[group]);
+
+   if (!(status & (1 << MTIP_TAG_BIT(MTIP_TAG_INTERNAL {
+   if (cmd->comp_func)
+   cmd->comp_func(port, MTIP_TAG_INTERNAL, cmd, 0);
}
}
-
-   return;
 }
 
 /*
@@ -1213,8 +1212,8 @@ static int mtip_exec_internal_command(struct mtip_port 
*port,
goto exec_ic_exit;
}
 
-   if (readl(port->cmd_issue[MTIP_TAG_INTERNAL])
-   & (1 << MTIP_TAG_INTERNAL)) {
+   if (readl(port->cmd_issue[MTIP_TAG_INDEX(MTIP_TAG_INTERNAL)])
+   & (1 << MTIP_TAG_BIT(MTIP_TAG_INTERNAL))) {
rv = -ENXIO;
if (!test_bit(MTIP_DDF_REMOVE_PENDING_BIT, >dd_flag)) {
mtip_device_reset(dd);
-- 
2.7.4



Re: [PATCH 4/6] mtip32xx: convert internal command issue to block IO path

2017-05-02 Thread Bart Van Assche
On Tue, 2017-05-02 at 09:25 -0600, Jens Axboe wrote:
> On 05/02/2017 09:16 AM, Christoph Hellwig wrote:
> > Any reason for the move from ->end_io_data to ->special?  I thought
> > that ->special was something we'd get rid of sooner or later now
> > that we can have additional per-cmd data even for !mq.
> 
> With the switch to blk_execute_rq(), we can't be using end_io_data
> and end_io, as we use that internally for the wakeup. So I have to
> stuff it somewhere.
> 
> The obvious option would be to move it to mtip_cmd, but we can't
> safely access that prior to having a driver tag assigned, which doesn't
> happen until we end up in our ->queue_rq(). So we need to stuff it
> somewhere.

Hello Jens,

Do you think it would be a good idea to allow blk_get_request() callers
to specify that a driver tag has to be allocated even if a scheduler has
been configured? That would make it possible to store completion data in
mtip_cmd for the mtip driver.

Thanks,

Bart.

Re: Treat REQ_FUA and REQ_PREFLUSH as synchronous

2017-05-02 Thread Jan Kara
On Tue 02-05-17 08:49:14, Jens Axboe wrote:
> On 05/02/2017 08:45 AM, Christoph Hellwig wrote:
> > On Tue, May 02, 2017 at 12:21:23PM +0200, Jan Kara wrote:
> >> it makes sense to treat REQ_FUA and REQ_PREFLUSH ops as synchronous in
> >> op_is_sync() since callers cannot rely on this anyway... Thoughts?
> > 
> > I'm fine with treating them as sync.
> 
> Yes me too. It makes sense to do so implicitly, and it avoids having
> to go and add REQ_SYNC in a bunch of places. That will also be more
> error proof in the future.
> 
> Jan, will you send a patch for that?

Hum, so I've just sent out fixes to individual filesystems to set REQ_SYNC
explicitely :-|. So would you prefer to do something like:

if (op_is_flush(bio->bi_opf) &&
!test_bit(QUEUE_FLAG_WC, >queue_flags)) {
bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
+   bio->bi_opf |= REQ_SYNC;
if (!nr_sectors) {
err = 0;
goto end_io;
}
}

in generic_make_request_checks()? Or just set it unconditionally in that
function if we see REQ_FUA | REQ_PREFLUSH set? Because we cannot just add
REQ_SYNC into REQ_FUA and REQ_PREFLUSH definitions as it used to be with
WRITE_FUA & co. definitions...

In the end I think that modifying filesystems (and drivers/md) to set REQ_SYNC
was the cleanest way to go as much as it was annoying...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 4/6] mtip32xx: convert internal command issue to block IO path

2017-05-02 Thread Christoph Hellwig
This looks reasonable to me, although of course I don't have a way
to test it.

Any reason for the move from ->end_io_data to ->special?  I thought
that ->special was something we'd get rid of sooner or later now
that we can have additional per-cmd data even for !mq.


Re: [PATCH 4/6] mtip32xx: convert internal command issue to block IO path

2017-05-02 Thread Jens Axboe
On 05/02/2017 07:53 AM, Jens Axboe wrote:
> On 05/02/2017 01:28 AM, Christoph Hellwig wrote:
>> Looks fine for now:
>>
>> Reviewed-by: Christoph Hellwig 
>>
>> But rather sooner than later we need to make this path at least go
>> through the normal end_request processing.  Without that we're just
>> bound to run into problems like we had with the tag changes again
>> when the driver is using the block layer APIs incompletely.
> 
> I completely agree, I'll see if I can find some time to do that
> soon.

Seems like a fine thing to do over morning coffee. How about something
like the below? It's on top of the previous series. Basically this kills
the private completion handler and data, which were basically redundant
as they did nothing but spew a warning in case of an error.

It'd be nice to unify the INTERNAL and regular commands, but we don't
have to do that to use the block layer APIs. If someone else wants to do
that, be my guest.

It compiles and loads fine on my setup, did some basic testing it that
went fine. Would be nice if someone else could look it over as well, to
verify that I didn't drop any of the 0-vs-error setting. I think it's
fine, we just need to ensure that cmd->status is always set to the
appropriate error, then we propagate that through the mtip32xx softirq
handler to the block layer.

 mtip32xx.c |  194 +
 mtip32xx.h |   10 ---
 2 files changed, 29 insertions(+), 175 deletions(-)

diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 96fe6500e941..60b62a72bc22 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -205,66 +205,12 @@ static struct mtip_cmd *mtip_get_int_command(struct 
driver_data *dd)
return blk_mq_rq_to_pdu(rq);
 }
 
-static void mtip_put_int_command(struct driver_data *dd, struct mtip_cmd *cmd)
-{
-   blk_put_request(blk_mq_rq_from_pdu(cmd));
-}
-
-/*
- * Once we add support for one hctx per mtip group, this will change a bit
- */
-static struct request *mtip_rq_from_tag(struct driver_data *dd,
-   unsigned int tag)
-{
-   struct blk_mq_hw_ctx *hctx = dd->queue->queue_hw_ctx[0];
-
-   return blk_mq_tag_to_rq(hctx->tags, tag);
-}
-
 static struct mtip_cmd *mtip_cmd_from_tag(struct driver_data *dd,
  unsigned int tag)
 {
-   struct request *rq = mtip_rq_from_tag(dd, tag);
-
-   return blk_mq_rq_to_pdu(rq);
-}
-
-/*
- * IO completion function.
- *
- * This completion function is called by the driver ISR when a
- * command that was issued by the kernel completes. It first calls the
- * asynchronous completion function which normally calls back into the block
- * layer passing the asynchronous callback data, then unmaps the
- * scatter list associated with the completed command, and finally
- * clears the allocated bit associated with the completed command.
- *
- * @port   Pointer to the port data structure.
- * @tagTag of the command.
- * @data   Pointer to driver_data.
- * @status Completion status.
- *
- * return value
- * None
- */
-static void mtip_async_complete(struct mtip_port *port,
-   int tag, struct mtip_cmd *cmd, int status)
-{
-   struct driver_data *dd = port->dd;
-   struct request *rq;
-
-   if (unlikely(!dd) || unlikely(!port))
-   return;
-
-   if (unlikely(status == PORT_IRQ_TF_ERR)) {
-   dev_warn(>dd->pdev->dev,
-   "Command tag %d failed due to TFE\n", tag);
-   }
-
-   rq = mtip_rq_from_tag(dd, tag);
+   struct blk_mq_hw_ctx *hctx = dd->queue->queue_hw_ctx[0];
 
-   cmd->status = status;
-   blk_mq_complete_request(rq);
+   return blk_mq_rq_to_pdu(blk_mq_tag_to_rq(hctx->tags, tag));
 }
 
 /*
@@ -581,38 +527,18 @@ static void print_tags(struct driver_data *dd,
"%d command(s) %s: tagmap [%s]", cnt, msg, tagmap);
 }
 
-/*
- * Internal command completion callback function.
- *
- * This function is normally called by the driver ISR when an internal
- * command completed. This function signals the command completion by
- * calling complete().
- *
- * @port   Pointer to the port data structure.
- * @tagTag of the command that has completed.
- * @data   Pointer to a completion structure.
- * @status Completion status.
- *
- * return value
- * None
- */
-static void mtip_completion(struct mtip_port *port,
-   int tag, struct mtip_cmd *command, int status)
-{
-   struct completion *waiting = command->comp_data;
-   if (unlikely(status == PORT_IRQ_TF_ERR))
-   dev_warn(>dd->pdev->dev,
-   "Internal command %d completed with TFE\n", tag);
-
-   command->comp_func = NULL;
-   command->comp_data = NULL;
-   complete(waiting);
-}
-
 static int mtip_read_log_page(struct mtip_port *port, u8 page, 

Re: Treat REQ_FUA and REQ_PREFLUSH as synchronous

2017-05-02 Thread Christoph Hellwig
On Tue, May 02, 2017 at 12:21:23PM +0200, Jan Kara wrote:
> it makes sense to treat REQ_FUA and REQ_PREFLUSH ops as synchronous in
> op_is_sync() since callers cannot rely on this anyway... Thoughts?

I'm fine with treating them as sync.


Re: BUG: KASAN: use-after-free in scsi_exit_rq

2017-05-02 Thread Jan Kara
On Fri 28-04-17 17:46:47, Tejun Heo wrote:
> On Fri, Apr 21, 2017 at 09:49:17PM +, Bart Van Assche wrote:
> > On Thu, 2017-04-20 at 15:18 -0600, Scott Bauer wrote:
> > > [  642.638860] BUG: KASAN: use-after-free in scsi_exit_rq+0xf3/0x120 at 
> > > addr 8802b7fedf00
> > > [  642.639362] Read of size 1 by task rcuos/5/53
> > > [  642.639713] CPU: 7 PID: 53 Comm: rcuos/6 Not tainted 4.11.0-rc5+ #13
> > > [  642.640170] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 
> > > 04/01/2014
> > > [  642.640923] Call Trace:
> > > [  642.641080]  dump_stack+0x63/0x8f
> > > [  642.641289]  kasan_object_err+0x21/0x70
> > > [  642.641531]  kasan_report.part.1+0x231/0x500
> > > [  642.641823]  ? scsi_exit_rq+0xf3/0x120
> > > [  642.642054]  ? _raw_spin_unlock_irqrestore+0xe/0x10
> > > [  642.642353]  ? free_percpu+0x1b7/0x340
> > > [  642.642586]  ? put_task_stack+0x117/0x2b0
> > > [  642.642837]  __asan_report_load1_noabort+0x2e/0x30
> > > [  642.643138]  scsi_exit_rq+0xf3/0x120
> > > [  642.643366]  free_request_size+0x44/0x60
> > > [  642.643614]  mempool_destroy.part.6+0x9b/0x150
> > > [  642.643899]  ? kasan_slab_free+0x87/0xb0
> > > [  642.644152]  mempool_destroy+0x13/0x20
> > > [  642.644394]  blk_exit_rl+0x36/0x40
> > > [  642.644614]  blkg_free+0x146/0x200
> > > [  642.644836]  __blkg_release_rcu+0x121/0x220
> > > [  642.645112]  rcu_nocb_kthread+0x61f/0xca0
> > > [  642.645376]  ? get_state_synchronize_rcu+0x20/0x20
> > > [  642.645690]  ? pci_mmcfg_check_reserved+0x110/0x110
> > > [  642.646011]  kthread+0x298/0x390
> > > [  642.646224]  ? get_state_synchronize_rcu+0x20/0x20
> > > [  642.646535]  ? kthread_park+0x160/0x160
> > > [  642.646787]  ret_from_fork+0x2c/0x40
> > 
> > I'm not familiar with cgroups but seeing this makes me wonder whether it 
> > would
> > be possible to move the blk_exit_rl() calls from blk_release_queue() into
> > blk_cleanup_queue()? The SCSI core frees a SCSI host after 
> > blk_cleanup_queue()
> > has finished for all associated SCSI devices. This is why I think that 
> > calling
> > blk_exit_rl() earlier would be sufficient to avoid that scsi_exit_rq()
> > dereferences a SCSI host pointer after it has been freed.
> 
> Hmm... I see.  Didn't know request put path involved derefing those
> structs.  The blk_exit_rl() call just replaced open coded
> mempool_destroy call, so the destruction order was always like this.
> We shouldn't have any request in flight by cleanup, so moving it there
> should be fine.

So I'm also not aware of any particular breakage this would cause. However
logically the freeing of request mempools really belongs to
blk_release_queue() so it seems a bit dumb to move blk_exit_rl() just
because SCSI stores the fact from which slab cache it has allocated the
sense buffer in a structure (shost) that it frees under its hands by the
time blk_release_queue() is called. :-|

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 4/6] mtip32xx: convert internal command issue to block IO path

2017-05-02 Thread Jens Axboe
On 05/02/2017 01:28 AM, Christoph Hellwig wrote:
> Looks fine for now:
> 
> Reviewed-by: Christoph Hellwig 
> 
> But rather sooner than later we need to make this path at least go
> through the normal end_request processing.  Without that we're just
> bound to run into problems like we had with the tag changes again
> when the driver is using the block layer APIs incompletely.

I completely agree, I'll see if I can find some time to do that
soon.

-- 
Jens Axboe



Re: [PATCH v4 0/6] Fixup mtip32xx for scheduling

2017-05-02 Thread Ming Lei
On Fri, Apr 28, 2017 at 10:54:27AM -0600, Jens Axboe wrote:
> OK, I think we're getting there now. I dropped the RQF_RESERVED
> and the tag helper, since mtip32xx can just call
> blk_rq_is_passthrough() and know it's an internal request.
> 
> We can do another cleanup series of mtip32xx making the
> internal requests use the normal blk end_io infrastructure,
> and use the provider helpers for mapping data. I think
> that should be done separately as a cleanup, keeping this series
> as a straight forward conversion.
> 
> I've tested this on the following card:
> 
> [   56.394162] mtip32xx :81:00.0: Model: P320h-MTFDGAR350SAH
> 
> and it works fine for me.

Tested this patchset against the latest linus tree, works fine
on my system with mq-deadline scheduler enabled.

Tested-by: Ming Lei 

Thanks,
Ming


Re: Playing with BFQ

2017-05-02 Thread Markus Trippelsdorf
On 2017.05.02 at 14:07 +0200, Sedat Dilek wrote:
> On Tue, May 2, 2017 at 10:00 AM, Markus Trippelsdorf
>  wrote:
> > On 2017.05.02 at 09:54 +0200, Sedat Dilek wrote:
> >> Hi,
> >>
> >> I want to play with BFQ.
> >>
> >> My base is block-next as of 28-Apr-2017.
> [...]
> >> Not sure if the attached patches make sense (right now).
> >
> > No, it doesn't make sense at all.
> 
> Hmm, I looked at 4.11.0-v8r11 and 0001 has exactly what my 2 patches do :-).

BFQ started as a conventional scheduler. But because mq is the way of
the future it was ported before it was accepted into mainline.

-- 
Markus


Re: Playing with BFQ

2017-05-02 Thread Sedat Dilek
On Tue, May 2, 2017 at 10:00 AM, Markus Trippelsdorf
 wrote:
> On 2017.05.02 at 09:54 +0200, Sedat Dilek wrote:
>> Hi,
>>
>> I want to play with BFQ.
>>
>> My base is block-next as of 28-Apr-2017.
[...]
>> Not sure if the attached patches make sense (right now).
>
> No, it doesn't make sense at all.

Hmm, I looked at 4.11.0-v8r11 and 0001 has exactly what my 2 patches do :-).

- Sedat -

[1] 
http://algo.ing.unimo.it/people/paolo/disk_sched/patches/4.11.0-v8r11/0001-block-cgroups-kconfig-build-bits-for-BFQ-v7r11-4.11..patch
[2] 
http://algo.ing.unimo.it/people/paolo/disk_sched/patches/4.11.0-v8r11/0002-block-introduce-the-BFQ-v7r11-I-O-sched-for-4.11.0.patch
[3] 
http://algo.ing.unimo.it/people/paolo/disk_sched/patches/4.11.0-v8r11/0003-block-bfq-add-Early-Queue-Merge-EQM-to-BFQ-v7r11-for.patch
[4] 
http://algo.ing.unimo.it/people/paolo/disk_sched/patches/4.11.0-v8r11/0004-blk-bfq-turn-BFQ-v7r11-for-4.11.0-into-BFQ-v8r11-for.patch


Re: [PATCH 05/13] block: Improvements to bounce-buffer handling

2017-05-02 Thread Ming Lei
On Tue, May 02, 2017 at 01:42:24PM +1000, NeilBrown wrote:
> Since commit 23688bf4f830 ("block: ensure to split after potentially
> bouncing a bio") blk_queue_bounce() is called *before*
> blk_queue_split().
> This means that:
>  1/ the comments blk_queue_split() about bounce buffers are
> irrelevant, and
>  2/ a very large bio (more than BIO_MAX_PAGES) will no longer be
> split before it arrives at blk_queue_bounce(), leading to the
> possibility that bio_clone_bioset() will fail and a NULL
> will be dereferenced.
> 
> Separately, blk_queue_bounce() shouldn't use fs_bio_set as the bio
> being copied could be from the same set, and this could lead to a
> deadlock.
> 
> So:
>  - allocate 2 private biosets for blk_queue_bounce, one for
>splitting enormous bios and one for cloning bios.
>  - add code to split a bio that exceeds BIO_MAX_PAGES.
>  - Fix up the comments in blk_queue_split()
> 
> Credit-to: Ming Lei  (suggested using single 
> bio_for_each_segment loop)
> Signed-off-by: NeilBrown 
> ---
>  block/blk-merge.c |   14 --
>  block/bounce.c|   32 ++--
>  2 files changed, 30 insertions(+), 16 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index d59074556703..51c84540d3bb 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -117,17 +117,11 @@ static struct bio *blk_bio_segment_split(struct 
> request_queue *q,
>* each holds at most BIO_MAX_PAGES bvecs because
>* bio_clone() can fail to allocate big bvecs.
>*
> -  * It should have been better to apply the limit per
> -  * request queue in which bio_clone() is involved,
> -  * instead of globally. The biggest blocker is the
> -  * bio_clone() in bio bounce.
> +  * Those drivers which will need to use bio_clone()
> +  * should tell us in some way.  For now, impose the
> +  * BIO_MAX_PAGES limit on all queues.
>*
> -  * If bio is splitted by this reason, we should have
> -  * allowed to continue bios merging, but don't do
> -  * that now for making the change simple.
> -  *
> -  * TODO: deal with bio bounce's bio_clone() gracefully
> -  * and convert the global limit into per-queue limit.
> +  * TODO: handle users of bio_clone() differently.
>*/
>   if (bvecs++ >= BIO_MAX_PAGES)
>   goto split;
> diff --git a/block/bounce.c b/block/bounce.c
> index 1cb5dd3a5da1..087ecc2dc66c 100644
> --- a/block/bounce.c
> +++ b/block/bounce.c
> @@ -26,6 +26,7 @@
>  #define POOL_SIZE64
>  #define ISA_POOL_SIZE16
>  
> +struct bio_set *bounce_bio_set, *bounce_bio_split;
>  static mempool_t *page_pool, *isa_page_pool;
>  
>  #if defined(CONFIG_HIGHMEM) || defined(CONFIG_NEED_BOUNCE_POOL)
> @@ -40,6 +41,14 @@ static __init int init_emergency_pool(void)
>   BUG_ON(!page_pool);
>   pr_info("pool size: %d pages\n", POOL_SIZE);
>  
> + bounce_bio_set = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> + BUG_ON(!bounce_bio_set);
> + if (bioset_integrity_create(bounce_bio_set, BIO_POOL_SIZE))
> + BUG_ON(1);
> +
> + bounce_bio_split = bioset_create(BIO_POOL_SIZE, 0, 0);
> + BUG_ON(!bounce_bio_split);
> +
>   return 0;
>  }
>  
> @@ -186,15 +195,26 @@ static void __blk_queue_bounce(struct request_queue *q, 
> struct bio **bio_orig,
>   int rw = bio_data_dir(*bio_orig);
>   struct bio_vec *to, from;
>   struct bvec_iter iter;
> - unsigned i;
> + unsigned i = 0;
> + bool bounce = false;
> + int sectors = 0;
>  
> - bio_for_each_segment(from, *bio_orig, iter)
> + bio_for_each_segment(from, *bio_orig, iter) {
> + if (i++ < BIO_MAX_PAGES)
> + sectors += from.bv_len >> 9;
>   if (page_to_pfn(from.bv_page) > queue_bounce_pfn(q))
> - goto bounce;
> + bounce = true;
> + }
> + if (!bounce)
> + return;
>  
> - return;
> -bounce:
> - bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set);
> + if (sectors < bio_sectors(*bio_orig)) {
> + bio = bio_split(*bio_orig, sectors, GFP_NOIO, bounce_bio_split);
> + bio_chain(bio, *bio_orig);
> + generic_make_request(*bio_orig);
> + *bio_orig = bio;
> + }
> + bio = bio_clone_bioset(*bio_orig, GFP_NOIO, bounce_bio_set);
>  
>   bio_for_each_segment_all(to, bio, i) {
>   struct page *page = to->bv_page;

Reviewed-by: Ming Lei 

Thanks,
Ming


Re: [PATCH 04/13] blk: use non-rescuing bioset for q->bio_split.

2017-05-02 Thread Ming Lei
On Tue, May 02, 2017 at 01:42:24PM +1000, NeilBrown wrote:
> A rescuing bioset is only useful if there might be bios from
> that same bioset on the bio_list_on_stack queue at a time
> when bio_alloc_bioset() is called.  This never applies to
> q->bio_split.
> 
> Allocations from q->bio_split are only ever made from
> blk_queue_split() which is only ever called early in each of
> various make_request_fn()s.  The original bio (call this A)
> is then passed to generic_make_request() and is placed on
> the bio_list_on_stack queue, and the bio that was allocated
> from q->bio_split (B) is processed.
> 
> The processing of this may cause other bios to be passed to
> generic_make_request() or may even cause the bio B itself to
> be passed, possible after some prefix has been split off
> (using some other bioset).
> 
> generic_make_request() now guarantees that all of these bios
> (B and dependants) will be fully processed before the tail
> of the original bio A gets handled.  None of these early bios
> can possible trigger an allocation from the original
> q->bio_split as they are either too small to require
> splitting or (more likely) are destined for a different queue.
> 
> The next time that the original q->bio_split might be used
> by this thread is when A is processed again, as it might
> still be too big to handle directly.  By this time there
> cannot be any other bios allocated from q->bio_split in the
> generic_make_request() queue.  So no rescuing will ever be
> needed.
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: NeilBrown 
> ---
>  block/blk-core.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index ae51f159a2ca..3797753f4085 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -730,7 +730,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id)
>   if (q->id < 0)
>   goto fail_q;
>  
> - q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS | 
> BIOSET_NEED_RESCUER);
> + q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
>   if (!q->bio_split)
>   goto fail_id;
>  
> 
> 

Reviewed-by: Ming Lei 

Thanks,
Ming


Re: [PATCH 09/13] lightnvm/pblk-read: use bio_clone_fast()

2017-05-02 Thread Javier González
> On 2 May 2017, at 05.42, NeilBrown  wrote:
> 
> pblk_submit_read() uses bio_clone_bioset() but doesn't change the
> io_vec, so bio_clone_fast() is a better choice.
> 
> It also uses fs_bio_set which is intended for filesystems.  Using it
> in a device driver can deadlock.
> So allocate a new bioset, and and use bio_clone_fast().
> 
> Signed-off-by: NeilBrown 
> ---
> drivers/lightnvm/pblk-init.c |   12 +++-
> drivers/lightnvm/pblk-read.c |2 +-
> drivers/lightnvm/pblk.h  |1 +
> 3 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
> index b3fec8ec55b8..aaefbccce30e 100644
> --- a/drivers/lightnvm/pblk-init.c
> +++ b/drivers/lightnvm/pblk-init.c
> @@ -23,6 +23,7 @@
> static struct kmem_cache *pblk_blk_ws_cache, *pblk_rec_cache, 
> *pblk_r_rq_cache,
>   *pblk_w_rq_cache, *pblk_line_meta_cache;
> static DECLARE_RWSEM(pblk_lock);
> +struct bio_set *pblk_bio_set;
> 
> static int pblk_rw_io(struct request_queue *q, struct pblk *pblk,
> struct bio *bio)
> @@ -946,11 +947,20 @@ static struct nvm_tgt_type tt_pblk = {
> 
> static int __init pblk_module_init(void)
> {
> - return nvm_register_tgt_type(_pblk);
> + int ret;
> +
> + pblk_bio_set = bioset_create(BIO_POOL_SIZE, 0, 0);
> + if (!pblk_bio_set)
> + return -ENOMEM;
> + ret = nvm_register_tgt_type(_pblk);
> + if (ret)
> + bioset_free(pblk_bio_set);
> + return ret;
> }
> 
> static void pblk_module_exit(void)
> {
> + bioset_free(pblk_bio_set);
>   nvm_unregister_tgt_type(_pblk);
> }
> 
> diff --git a/drivers/lightnvm/pblk-read.c b/drivers/lightnvm/pblk-read.c
> index 4a12f14d78c6..8ee0c50a7a71 100644
> --- a/drivers/lightnvm/pblk-read.c
> +++ b/drivers/lightnvm/pblk-read.c
> @@ -342,7 +342,7 @@ int pblk_submit_read(struct pblk *pblk, struct bio *bio)
>   struct pblk_r_ctx *r_ctx = nvm_rq_to_pdu(rqd);
> 
>   /* Clone read bio to deal with read errors internally */
> - int_bio = bio_clone_bioset(bio, GFP_KERNEL, fs_bio_set);
> + int_bio = bio_clone_fast(bio, GFP_KERNEL, pblk_bio_set);
>   if (!int_bio) {
>   pr_err("pblk: could not clone read bio\n");
>   return NVM_IO_ERR;
> diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
> index 99f3186b5288..95b665f23925 100644
> --- a/drivers/lightnvm/pblk.h
> +++ b/drivers/lightnvm/pblk.h
> @@ -702,6 +702,7 @@ void pblk_write_should_kick(struct pblk *pblk);
> /*
>  * pblk read path
>  */
> +extern struct bio_set *pblk_bio_set;
> int pblk_submit_read(struct pblk *pblk, struct bio *bio);
> int pblk_submit_read_gc(struct pblk *pblk, u64 *lba_list, void *data,
>   unsigned int nr_secs, unsigned int *secs_to_gc,

Hi Neil,

Looks good. Thanks for fixing this. I did not know that bio_clone_bioset
was not supposed to be used on drivers.

Reviewed-by: Javier González 
Tested-by: Javier González 



signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 03/13] blk: make the bioset rescue_workqueue optional.

2017-05-02 Thread Ming Lei
On Tue, May 02, 2017 at 01:42:24PM +1000, NeilBrown wrote:
> This patch converts bioset_create() and
> bioset_create_nobvec() to not create a workqueue so
> alloctions will never trigger punt_bios_to_rescuer().  It
> also introduces bioset_create_rescued() and
> bioset_create_nobvec_rescued() which preserve the old
> behaviour.
> 
> All callers of bioset_create() and bioset_create_nobvec(),
> that are inside block device drivers, are converted to the
> _rescued() version.

The above comment about *_nobvec() need to be updated.

> 
> biosets used by filesystems or other top-level users do not
> need rescuing as the bio can never be queued behind other
> bios.  This includes fs_bio_set, blkdev_dio_pool,
> btrfs_bioset, xfs_ioend_bioset, drbd_md_io_bio_set,
> and one allocated by target_core_iblock.c.
> 
> biosets used by md/raid to not need rescuing as
> their usage was recently audited to revised to never
> risk deadlock.
> 
> It is hoped that most, if not all, of the remaining biosets
> can end up being the non-rescued version.
> 
> Signed-off-by: NeilBrown 
> ---
>  block/bio.c   |   12 ++--
>  block/blk-core.c  |2 +-
>  drivers/md/bcache/super.c |6 --
>  drivers/md/dm-crypt.c |2 +-
>  drivers/md/dm-io.c|2 +-
>  drivers/md/dm.c   |5 +++--
>  include/linux/bio.h   |1 +
>  7 files changed, 21 insertions(+), 9 deletions(-)

DRBD can stack on block device too, so I guess it might need
BIOSET_NEED_RESCUER?

> 
> diff --git a/block/bio.c b/block/bio.c
> index 3f1286ed301e..257606e742e0 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -363,6 +363,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   struct bio_list punt, nopunt;
>   struct bio *bio;
>  
> + if (!WARN_ON_ONCE(!bs->rescue_workqueue))
> + return;

I guess the above check should have been the following?

if (WARN_ON_ONCE(!bs->rescue_workqueue))
return;

>   /*
>* In order to guarantee forward progress we must punt only bios that
>* were allocated from this bio_set; otherwise, if there was a bio on
> @@ -474,7 +476,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int 
> nr_iovecs,
>  
>   if (current->bio_list &&
>   (!bio_list_empty(>bio_list[0]) ||
> -  !bio_list_empty(>bio_list[1])))
> +  !bio_list_empty(>bio_list[1])) &&
> + bs->rescue_workqueue)
>   gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>  
>   p = mempool_alloc(bs->bio_pool, gfp_mask);
> @@ -1925,7 +1928,7 @@ EXPORT_SYMBOL(bioset_free);
>   * bioset_create  - Create a bio_set
>   * @pool_size:   Number of bio and bio_vecs to cache in the mempool
>   * @front_pad:   Number of bytes to allocate in front of the returned bio
> - * @flags:   Flags to modify behavior, currently only %BIOSET_NEED_BVECS
> + * @flags:   Flags to modify behavior, currently %BIOSET_NEED_BVECS and 
> %BIOSET_NEED_RESCUER
>   *
>   * Description:
>   *Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller
> @@ -1936,6 +1939,8 @@ EXPORT_SYMBOL(bioset_free);
>   *or things will break badly.
>   *If %BIOSET_NEED_BVECS is set in @flags, a separate pool will be 
> allocated
>   *for allocating iovecs.  This pool is not needed e.g. for 
> bio_clone_fast().
> + *If %BIOSET_NEED_RESCUER is set, workqueue is create which can be used 
> to
> + *dispatch queued requests when the mempool runs out of space.
>   *
>   */
>  struct bio_set *bioset_create(unsigned int pool_size,
> @@ -1971,6 +1976,9 @@ struct bio_set *bioset_create(unsigned int pool_size,
>   goto bad;
>   }
>  
> + if (!(flags & BIOSET_NEED_RESCUER))
> + return bs;
> +
>   bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
>   if (!bs->rescue_workqueue)
>   goto bad;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3797753f4085..ae51f159a2ca 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -730,7 +730,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id)
>   if (q->id < 0)
>   goto fail_q;
>  
> - q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> + q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS | 
> BIOSET_NEED_RESCUER);
>   if (!q->bio_split)
>   goto fail_id;
>  
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 8b9dae98cc7e..bb50991a82d3 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -786,7 +786,8 @@ static int bcache_device_init(struct bcache_device *d, 
> unsigned block_size,
>  
>   minor *= BCACHE_MINORS;
>  
> - if (!(d->bio_split = bioset_create(4, offsetof(struct bbio, bio), 
> BIOSET_NEED_BVECS)) ||
> + if (!(d->bio_split = bioset_create(4, offsetof(struct 

Re: [PATCH 01/13] blk: remove bio_set arg from blk_queue_split()

2017-05-02 Thread javigon

> On 2 May 2017, at 05.42, NeilBrown  wrote:
> 
> blk_queue_split() is always called with the last arg being q->bio_split,
> where 'q' is the first arg.
> 
> Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
> q->bio_split.
> 
> This is inconsistent and unnecessary.  Remove the last arg and always use
> q->bio_split inside blk_queue_split()
> 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Ming Lei 
> Credit-to: Javier González  (Noticed that lightnvm was 
> missed)
> Signed-off-by: NeilBrown 
> ---
> block/blk-core.c  |2 +-
> block/blk-merge.c |9 -
> block/blk-mq.c|2 +-
> drivers/block/drbd/drbd_req.c |2 +-
> drivers/block/pktcdvd.c   |2 +-
> drivers/block/ps3vram.c   |2 +-
> drivers/block/rsxx/dev.c  |2 +-
> drivers/block/umem.c  |2 +-
> drivers/block/zram/zram_drv.c |2 +-
> drivers/lightnvm/pblk-init.c  |4 ++--
> drivers/lightnvm/rrpc.c   |2 +-
> drivers/md/md.c   |2 +-
> drivers/s390/block/dcssblk.c  |2 +-
> drivers/s390/block/xpram.c|2 +-
> include/linux/blkdev.h|3 +--
> 15 files changed, 19 insertions(+), 21 deletions(-)
> 

> [..]

> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
> index ae8cd6d5af8b..b3fec8ec55b8 100644
> --- a/drivers/lightnvm/pblk-init.c
> +++ b/drivers/lightnvm/pblk-init.c
> @@ -33,7 +33,7 @@ static int pblk_rw_io(struct request_queue *q, struct pblk 
> *pblk,
>* constraint. Writes can be of arbitrary size.
>*/
>   if (bio_data_dir(bio) == READ) {
> - blk_queue_split(q, , q->bio_split);
> + blk_queue_split(q, );
>   ret = pblk_submit_read(pblk, bio);
>   if (ret == NVM_IO_DONE && bio_flagged(bio, BIO_CLONED))
>   bio_put(bio);
> @@ -46,7 +46,7 @@ static int pblk_rw_io(struct request_queue *q, struct pblk 
> *pblk,
>* available for user I/O.
>*/
>   if (unlikely(pblk_get_secs(bio) >= pblk_rl_sysfs_rate_show(>rl)))
> - blk_queue_split(q, , q->bio_split);
> + blk_queue_split(q, );
> 
>   return pblk_write_to_cache(pblk, bio, PBLK_IOTYPE_USER);
> }
> diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
> index cf0e28a0ff61..8e241056b141 100644
> --- a/drivers/lightnvm/rrpc.c
> +++ b/drivers/lightnvm/rrpc.c
> @@ -994,7 +994,7 @@ static blk_qc_t rrpc_make_rq(struct request_queue *q, 
> struct bio *bio)
>   struct nvm_rq *rqd;
>   int err;
> 
> - blk_queue_split(q, , q->bio_split);
> + blk_queue_split(q, );
> 
>   if (bio_op(bio) == REQ_OP_DISCARD) {
>   rrpc_discard(rrpc, bio);
> 

Hi Neil,

Thanks for adding the LightNVM changes. It works fine on our test setup.

Reviewed-by: Javier González 
Tested-by: Javier González 



signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 13/13] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()

2017-05-02 Thread Ming Lei
On Tue, May 02, 2017 at 01:42:26PM +1000, NeilBrown wrote:
> blk_bio_segment_split() makes sure bios have no more than
> BIO_MAX_PAGES entries in the bi_io_vec.
> This was done because bio_clone_bioset() (when given a
> mempool bioset) could not handle larger io_vecs.
> 
> No driver uses bio_clone_bioset() any more, they all
> use bio_clone_fast() if anything, and bio_clone_fast()
> doesn't clone the bi_io_vec.

Maybe in future, some drivers still may try to use 
bio_clone_bioset() again, I suggest to add some comments
on bio_clone_bioset() to make this usage explicitly. Also
better to trigger a warning if a big src bio is passed to
bio_clone_bioset().

Thanks,
Ming


Treat REQ_FUA and REQ_PREFLUSH as synchronous

2017-05-02 Thread Jan Kara
Hello,

Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous"
made requests with REQ_FUA and REQ_PREFLUSH to be treated as synchronous
and dropped REQ_SYNC from definitions of WRITE_FUA and friends. This
however introduced a bunch of bugs to filesystems (I know about ext4,
btrfs, f2fs, gfs2, reiserfs regressing because of this) because they
implicitely expected REQ_FUA or REQ_PREFLUSH implies a synchronous request.
At the first sight they do however generic_make_request_checks() will strip
REQ_FUA and REQ_PREFLUSH flags from a bio if the underlying storage does
not have volatile write cache and so writes suddenly become async. I will
go and fix filesystems to explicitely set REQ_SYNC if they want sync
behavior as that is a good cleanup anyway but I wanted to question whether
it makes sense to treat REQ_FUA and REQ_PREFLUSH ops as synchronous in
op_is_sync() since callers cannot rely on this anyway... Thoughts?

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 02/13] blk: replace bioset_create_nobvec() with a flags arg to bioset_create()

2017-05-02 Thread Ming Lei
On Tue, May 02, 2017 at 01:42:24PM +1000, NeilBrown wrote:
> "flags" arguments are often seen as good API design as they allow
> easy extensibility.
> bioset_create_nobvec() is implemented internally as a variation in
> flags passed to __bioset_create().

>From driver's view, this flag has document benifit too.

> 
> To support future extension, make the internal structure part of the
> API.
> i.e. add a 'flags' argument to bioset_create() and discard
> bioset_create_nobvec().
> 
> Note that the bio_split allocations in drivers/md/raid* do not need
> the bvec mempool - they should have used bioset_create_nobvec().
> 
> Suggested-by: Christoph Hellwig 
> Signed-off-by: NeilBrown 

Reviewed-by: Ming Lei 

Thanks,
Ming


Re: Playing with BFQ

2017-05-02 Thread Sedat Dilek
On Tue, May 2, 2017 at 10:07 AM, Sedat Dilek  wrote:
> On Tue, May 2, 2017 at 10:00 AM, Markus Trippelsdorf
>  wrote:
>> On 2017.05.02 at 09:54 +0200, Sedat Dilek wrote:
>>> Hi,
>>>
>>> I want to play with BFQ.
>>>
>>> My base is block-next as of 28-Apr-2017.
>>>
>>> First I looked through the Kconfigs.
>>> What is a good setting?
>>> Built as module?
>>>
>>> How can I switch the IO-scheduler - real-time?
>>>
>>> Not sure if the attached patches make sense (right now).
>>
>> No, it doesn't make sense at all.
>> BFQ is a mq scheduler. To use it you should enable
>> SCSI_MQ_DEFAULT. And then you can switch schedulers like:
>>
>>  echo "kyber" > /sys/block/sda/queue/scheduler
>>  echo "bfq" > /sys/block/sdb/queue/scheduler
>>
>
> Great, I got these informations before starting a kernel-build.
>
> So, I have now...
>
> $ ./scripts/diffconfig /boot/config-4.11.0-1-iniza-amd64 .config
> -BLK_DEV_HD n
>  SCSI_MQ_DEFAULT n -> y
> +BFQ_GROUP_IOSCHED y
> +BLK_DEV_THROTTLING_LOW n
> +IOSCHED_BFQ y
> +MQ_IOSCHED_KYBER y
>
> Thanks, Markus.
>

This worked...

# cat /sys/block/sda/queue/scheduler
[mq-deadline] kyber bfq none

# echo "bfq" > /sys/block/sda/queue/scheduler

# cat /sys/block/sda/queue/scheduler
mq-deadline kyber [bfq] none

- sed@ -


Re: [PATCH 3/4] blk-mq: use hw tag for scheduling if hw tag space is big enough

2017-05-02 Thread Ming Lei
On Mon, May 01, 2017 at 03:06:16PM +, Bart Van Assche wrote:
> On Sat, 2017-04-29 at 18:35 +0800, Ming Lei wrote:
> > On Fri, Apr 28, 2017 at 06:09:40PM +, Bart Van Assche wrote:
> > > On Fri, 2017-04-28 at 23:15 +0800, Ming Lei wrote:
> > > > +static inline bool blk_mq_sched_may_use_hw_tag(struct request_queue *q)
> > > > +{
> > > > +   if (q->tag_set->flags & BLK_MQ_F_TAG_SHARED)
> > > > +   return false;
> > > > +
> > > > +   if (blk_mq_get_queue_depth(q) < q->nr_requests)
> > > > +   return false;
> > > > +
> > > > +   return true;
> > > > +}
> > > 
> > > The only user of shared tag sets I know of is scsi-mq. I think it's really
> > > unfortunate that this patch systematically disables 
> > > BLK_MQ_F_SCHED_USE_HW_TAG
> > > for scsi-mq.
> > 
> > In previous patch, I actually allow driver to pass this flag, but this
> > feature is dropped in this post, just for making it simple & clean.
> > If you think we need it for shared tag set, I can add it in v1.
> > 
> > For shared tag sets, I suggest to not enable it at default, because
> > scheduler is per request queue now, and generaly more requests available,
> > better it performs.  When tags are shared among several request
> > queues, one of them may use tags up for its own scheduling, then
> > starve others. But it should be possible and not difficult to allocate
> > requests fairly for scheduling in this case if we switch to per-hctx
> > scheduling.
> 
> Hello Ming,
> 
> Have you noticed that there is already a mechanism in the block layer to
> avoid starvation if a tag set is shared? The hctx_may_queue() function
> guarantees that each user that shares a tag set gets at least some tags.
> The .active_queues counter keeps track of the number of hardware queues
> that share a tag set.

OK, from hctx_may_queue(), each hw queue can be allocated at most
tags of (queue depth / nr_hw_queues), and we can allow to use
hw tag for scheduler too if the following condition is ture for
shared tags:

q->nr_requests <= (blk_mq_get_queue_depth(q) / nr_hw_queues)

It should often be true for some scsi devices(such as virtio-scsi)
in some configurations.


thanks,
Ming


Re: [PATCH 13/13] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()

2017-05-02 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 09/13] lightnvm/pblk-read: use bio_clone_fast()

2017-05-02 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 10/13] xen-blkfront: remove bio splitting.

2017-05-02 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 03/13] blk: make the bioset rescue_workqueue optional.

2017-05-02 Thread Christoph Hellwig
On Tue, May 02, 2017 at 01:42:24PM +1000, NeilBrown wrote:
> This patch converts bioset_create() and
> bioset_create_nobvec() to not create a workqueue so
> alloctions will never trigger punt_bios_to_rescuer().  It
> also introduces bioset_create_rescued() and
> bioset_create_nobvec_rescued() which preserve the old
> behaviour.
> 
> All callers of bioset_create() and bioset_create_nobvec(),
> that are inside block device drivers, are converted to the
> _rescued() version.
> 
> biosets used by filesystems or other top-level users do not
> need rescuing as the bio can never be queued behind other
> bios.  This includes fs_bio_set, blkdev_dio_pool,
> btrfs_bioset, xfs_ioend_bioset, drbd_md_io_bio_set,
> and one allocated by target_core_iblock.c.
> 
> biosets used by md/raid to not need rescuing as
> their usage was recently audited to revised to never
> risk deadlock.
> 
> It is hoped that most, if not all, of the remaining biosets
> can end up being the non-rescued version.
> 
> Signed-off-by: NeilBrown 
> ---
>  block/bio.c   |   12 ++--
>  block/blk-core.c  |2 +-
>  drivers/md/bcache/super.c |6 --
>  drivers/md/dm-crypt.c |2 +-
>  drivers/md/dm-io.c|2 +-
>  drivers/md/dm.c   |5 +++--
>  include/linux/bio.h   |1 +
>  7 files changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 3f1286ed301e..257606e742e0 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -363,6 +363,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   struct bio_list punt, nopunt;
>   struct bio *bio;
>  
> + if (!WARN_ON_ONCE(!bs->rescue_workqueue))
> + return;
>   /*
>* In order to guarantee forward progress we must punt only bios that
>* were allocated from this bio_set; otherwise, if there was a bio on
> @@ -474,7 +476,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int 
> nr_iovecs,
>  
>   if (current->bio_list &&
>   (!bio_list_empty(>bio_list[0]) ||
> -  !bio_list_empty(>bio_list[1])))
> +  !bio_list_empty(>bio_list[1])) &&
> + bs->rescue_workqueue)
>   gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>  
>   p = mempool_alloc(bs->bio_pool, gfp_mask);
> @@ -1925,7 +1928,7 @@ EXPORT_SYMBOL(bioset_free);
>   * bioset_create  - Create a bio_set
>   * @pool_size:   Number of bio and bio_vecs to cache in the mempool
>   * @front_pad:   Number of bytes to allocate in front of the returned bio
> - * @flags:   Flags to modify behavior, currently only %BIOSET_NEED_BVECS
> + * @flags:   Flags to modify behavior, currently %BIOSET_NEED_BVECS and 
> %BIOSET_NEED_RESCUER
>   *
>   * Description:
>   *Set up a bio_set to be used with @bio_alloc_bioset. Allows the caller
> @@ -1936,6 +1939,8 @@ EXPORT_SYMBOL(bioset_free);
>   *or things will break badly.
>   *If %BIOSET_NEED_BVECS is set in @flags, a separate pool will be 
> allocated
>   *for allocating iovecs.  This pool is not needed e.g. for 
> bio_clone_fast().
> + *If %BIOSET_NEED_RESCUER is set, workqueue is create which can be used 
> to
> + *dispatch queued requests when the mempool runs out of space.
>   *
>   */
>  struct bio_set *bioset_create(unsigned int pool_size,
> @@ -1971,6 +1976,9 @@ struct bio_set *bioset_create(unsigned int pool_size,
>   goto bad;
>   }
>  
> + if (!(flags & BIOSET_NEED_RESCUER))
> + return bs;
> +
>   bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
>   if (!bs->rescue_workqueue)
>   goto bad;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3797753f4085..ae51f159a2ca 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -730,7 +730,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id)
>   if (q->id < 0)
>   goto fail_q;
>  
> - q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> + q->bio_split = bioset_create(BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS | 
> BIOSET_NEED_RESCUER);

Please avoid > 80 char lines.

Otherwise looks fine,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 05/13] block: Improvements to bounce-buffer handling

2017-05-02 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 

although I think we really should kill off the block level bouncing
rather sooner than later.


Re: Playing with BFQ

2017-05-02 Thread Sedat Dilek
On Tue, May 2, 2017 at 10:00 AM, Markus Trippelsdorf
 wrote:
> On 2017.05.02 at 09:54 +0200, Sedat Dilek wrote:
>> Hi,
>>
>> I want to play with BFQ.
>>
>> My base is block-next as of 28-Apr-2017.
>>
>> First I looked through the Kconfigs.
>> What is a good setting?
>> Built as module?
>>
>> How can I switch the IO-scheduler - real-time?
>>
>> Not sure if the attached patches make sense (right now).
>
> No, it doesn't make sense at all.
> BFQ is a mq scheduler. To use it you should enable
> SCSI_MQ_DEFAULT. And then you can switch schedulers like:
>
>  echo "kyber" > /sys/block/sda/queue/scheduler
>  echo "bfq" > /sys/block/sdb/queue/scheduler
>

Great, I got these informations before starting a kernel-build.

So, I have now...

$ ./scripts/diffconfig /boot/config-4.11.0-1-iniza-amd64 .config
-BLK_DEV_HD n
 SCSI_MQ_DEFAULT n -> y
+BFQ_GROUP_IOSCHED y
+BLK_DEV_THROTTLING_LOW n
+IOSCHED_BFQ y
+MQ_IOSCHED_KYBER y

Thanks, Markus.

- Sedat -


Playing with BFQ

2017-05-02 Thread Sedat Dilek
Hi,

I want to play with BFQ.

My base is block-next as of 28-Apr-2017.

First I looked through the Kconfigs.
What is a good setting?
Built as module?

How can I switch the IO-scheduler - real-time?

Not sure if the attached patches make sense (right now).

Please comment.


Kind Regards,
- Sedat -
From 7d922302a3af8934da46e8fc04d256025d8ddb27 Mon Sep 17 00:00:00 2001
From: Sedat Dilek 
Date: Tue, 2 May 2017 09:31:37 +0200
Subject: [PATCH] block: bfq: Move CONFIG_IOSCHED_BFQ and
 CONFIG_BFQ_GROUP_IOSCHED

---
 block/Kconfig.iosched | 40 
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index fd2cefa47d35..8ea2c8d77100 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -39,6 +39,26 @@ config CFQ_GROUP_IOSCHED
 	---help---
 	  Enable group IO scheduling in CFQ.
 
+config IOSCHED_BFQ
+tristate "BFQ I/O scheduler"
+default n
+---help---
+BFQ I/O scheduler for BLK-MQ. BFQ distributes the bandwidth of
+of the device among all processes according to their weights,
+regardless of the device parameters and with any workload. It
+also guarantees a low latency to interactive and soft
+real-time applications.  Details in
+Documentation/block/bfq-iosched.txt
+
+config BFQ_GROUP_IOSCHED
+   bool "BFQ hierarchical scheduling support"
+   depends on IOSCHED_BFQ && BLK_CGROUP
+   default n
+   ---help---
+
+   Enable hierarchical scheduling in BFQ, using the blkio
+   (cgroups-v1) or io (cgroups-v2) controller.
+
 choice
 
 	prompt "Default I/O scheduler"
@@ -79,26 +99,6 @@ config MQ_IOSCHED_KYBER
 	  synchronous writes, it will self-tune queue depths to achieve that
 	  goal.
 
-config IOSCHED_BFQ
-	tristate "BFQ I/O scheduler"
-	default n
-	---help---
-	BFQ I/O scheduler for BLK-MQ. BFQ distributes the bandwidth of
-	of the device among all processes according to their weights,
-	regardless of the device parameters and with any workload. It
-	also guarantees a low latency to interactive and soft
-	real-time applications.  Details in
-	Documentation/block/bfq-iosched.txt
-
-config BFQ_GROUP_IOSCHED
-   bool "BFQ hierarchical scheduling support"
-   depends on IOSCHED_BFQ && BLK_CGROUP
-   default n
-   ---help---
-
-   Enable hierarchical scheduling in BFQ, using the blkio
-   (cgroups-v1) or io (cgroups-v2) controller.
-
 endmenu
 
 endif
-- 
2.11.0

From da3379424fc0237d5de25e946ed5482a625aaac2 Mon Sep 17 00:00:00 2001
From: Sedat Dilek 
Date: Tue, 2 May 2017 09:33:27 +0200
Subject: [PATCH] block: bfq: Introduce DEFAULT_BFQ

---
 block/Kconfig.iosched | 4 
 1 file changed, 4 insertions(+)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 8ea2c8d77100..f37455dcf381 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -73,6 +73,9 @@ choice
 	config DEFAULT_CFQ
 		bool "CFQ" if IOSCHED_CFQ=y
 
+	config DEFAULT_BFQ
+		bool "BFQ" if IOSCHED_BFQ=y
+
 	config DEFAULT_NOOP
 		bool "No-op"
 
@@ -82,6 +85,7 @@ config DEFAULT_IOSCHED
 	string
 	default "deadline" if DEFAULT_DEADLINE
 	default "cfq" if DEFAULT_CFQ
+	default "bfq" if DEFAULT_BFQ
 	default "noop" if DEFAULT_NOOP
 
 config MQ_IOSCHED_DEADLINE
-- 
2.11.0



Re: [PATCH 0/9] block: T10/DIF Fixes and cleanups v3

2017-05-02 Thread Christoph Hellwig
Hi Dmitry,

can you resend this series?  I really think we should get this into
4.12 at least.


Re: [PATCH 4/6] mtip32xx: convert internal command issue to block IO path

2017-05-02 Thread Christoph Hellwig
Looks fine for now:

Reviewed-by: Christoph Hellwig 

But rather sooner than later we need to make this path at least go
through the normal end_request processing.  Without that we're just
bound to run into problems like we had with the tag changes again
when the driver is using the block layer APIs incompletely.


Re: [PATCH 5/6] blk-mq-sched: remove hack that bypasses scheduler for reserved requests

2017-05-02 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 


Re: [PATCH 25/27] block: remove the discard_zeroes_data flag

2017-05-02 Thread Nicholas A. Bellinger
On Mon, 2017-05-01 at 23:43 -0700, Nicholas A. Bellinger wrote:
> On Mon, 2017-05-01 at 20:45 +, Bart Van Assche wrote:
> > On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> > > Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
> > > kill this hack.
> > > 
> > > Signed-off-by: Christoph Hellwig 
> > > Reviewed-by: Martin K. Petersen 
> > > Reviewed-by: Hannes Reinecke 
> > > [ ... ]
> > > diff --git a/drivers/target/target_core_device.c 
> > > b/drivers/target/target_core_device.c
> > > index c754ae33bf7b..d2f089cfa9ae 100644
> > > --- a/drivers/target/target_core_device.c
> > > +++ b/drivers/target/target_core_device.c
> > > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct 
> > > se_dev_attrib *attrib,
> > >   attrib->unmap_granularity = q->limits.discard_granularity / block_size;
> > >   attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> > >   block_size;
> > > - attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
> > > + attrib->unmap_zeroes_data = 0;
> > >   return true;
> > >  }
> > >  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> > 
> > Hello Christoph,
> > 
> > Sorry that I hadn't noticed this before but I think that this patch
> > introduces a significant performance regressions for LIO users. Before
> > this patch the LBPRZ flag was reported correctly to initiator systems
> > through the thin provisioning VPD. With this patch applied that flag
> > will always be reported as zero, forcing initiators to submit WRITE
> > commands with zeroed data buffers instead of submitting the SCSI UNMAP
> > command to block devices for which discard_zeroes_data was set. From
> > target_core_spc.c:
> > 
> > /* Thin Provisioning VPD */
> > static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char 
> > *buf)
> > {
> > [ ... ]
> > /*
> >  * The unmap_zeroes_data set means that the underlying device supports
> >  * REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
> >  * the SBC requirements for LBPRZ, meaning that a subsequent read
> >  * will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
> >  * See sbc4r36 6.6.4.
> >  */
> > if (((dev->dev_attrib.emulate_tpu != 0) ||
> >  (dev->dev_attrib.emulate_tpws != 0)) &&
> >  (dev->dev_attrib.unmap_zeroes_data != 0))
> > buf[5] |= 0x04;
> > [ ... ]
> > }
> > 
> 
> According to sd_config_discard(), it's SD_LBP_WS16, SD_LBP_WS10 and
> SD_LBP_ZERO that where ever setting unmap_zeros_data = 1 to begin with.
> For UNMAP, q->limits.discard_zeroes_data was never set.
> 
> That said, it's pretty much implied that supporting DISCARD means
> subsequent READs return zeros, so target_configure_unmap_from_queue()
> should be setting attrib->unmap_zeroes_data = 1, or just dropping it
> all-together.
> 
> Post -rc1, I'll push a patch to do the latter.
> 

Or, another options is use bdev_write_zeroes_sectors() to determine when
dev_attrib->unmap_zeroes_data should be set.




Re: [PATCH 25/27] block: remove the discard_zeroes_data flag

2017-05-02 Thread Nicholas A. Bellinger
On Mon, 2017-05-01 at 20:45 +, Bart Van Assche wrote:
> On Wed, 2017-04-05 at 19:21 +0200, Christoph Hellwig wrote:
> > Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
> > kill this hack.
> > 
> > Signed-off-by: Christoph Hellwig 
> > Reviewed-by: Martin K. Petersen 
> > Reviewed-by: Hannes Reinecke 
> > [ ... ]
> > diff --git a/drivers/target/target_core_device.c 
> > b/drivers/target/target_core_device.c
> > index c754ae33bf7b..d2f089cfa9ae 100644
> > --- a/drivers/target/target_core_device.c
> > +++ b/drivers/target/target_core_device.c
> > @@ -851,7 +851,7 @@ bool target_configure_unmap_from_queue(struct 
> > se_dev_attrib *attrib,
> > attrib->unmap_granularity = q->limits.discard_granularity / block_size;
> > attrib->unmap_granularity_alignment = q->limits.discard_alignment /
> > block_size;
> > -   attrib->unmap_zeroes_data = q->limits.discard_zeroes_data;
> > +   attrib->unmap_zeroes_data = 0;
> > return true;
> >  }
> >  EXPORT_SYMBOL(target_configure_unmap_from_queue);
> 
> Hello Christoph,
> 
> Sorry that I hadn't noticed this before but I think that this patch
> introduces a significant performance regressions for LIO users. Before
> this patch the LBPRZ flag was reported correctly to initiator systems
> through the thin provisioning VPD. With this patch applied that flag
> will always be reported as zero, forcing initiators to submit WRITE
> commands with zeroed data buffers instead of submitting the SCSI UNMAP
> command to block devices for which discard_zeroes_data was set. From
> target_core_spc.c:
> 
> /* Thin Provisioning VPD */
> static sense_reason_t spc_emulate_evpd_b2(struct se_cmd *cmd, unsigned char 
> *buf)
> {
>   [ ... ]
>   /*
>* The unmap_zeroes_data set means that the underlying device supports
>* REQ_DISCARD and has the discard_zeroes_data bit set. This satisfies
>* the SBC requirements for LBPRZ, meaning that a subsequent read
>* will return zeroes after an UNMAP or WRITE SAME (16) to an LBA
>* See sbc4r36 6.6.4.
>*/
>   if (((dev->dev_attrib.emulate_tpu != 0) ||
>(dev->dev_attrib.emulate_tpws != 0)) &&
>(dev->dev_attrib.unmap_zeroes_data != 0))
>   buf[5] |= 0x04;
>   [ ... ]
> }
> 

According to sd_config_discard(), it's SD_LBP_WS16, SD_LBP_WS10 and
SD_LBP_ZERO that where ever setting unmap_zeros_data = 1 to begin with.
For UNMAP, q->limits.discard_zeroes_data was never set.

That said, it's pretty much implied that supporting DISCARD means
subsequent READs return zeros, so target_configure_unmap_from_queue()
should be setting attrib->unmap_zeroes_data = 1, or just dropping it
all-together.

Post -rc1, I'll push a patch to do the latter.



Re: [PATCH 7/7] Revert "mtip32xx: pass BLK_MQ_F_NO_SCHED"

2017-05-02 Thread Hannes Reinecke
On 04/28/2017 04:55 PM, Jens Axboe wrote:
> This reverts commit 4981d04dd8f1ab19e2cce008da556d7f099b6e68.
> 
> The driver has been converted to using the proper infrastructure
> for issuing internal commands. This means it's now safe to use with
> the scheduling infrastruture, so we can now revert the change
> that turned off scheduling for mtip32xx.
> 
> Reviewed-by: Ming Lei 
> Signed-off-by: Jens Axboe 
> ---
>  drivers/block/mtip32xx/mtip32xx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/block/mtip32xx/mtip32xx.c 
> b/drivers/block/mtip32xx/mtip32xx.c
> index 9780c4d3697c..c598c5ac2fd0 100644
> --- a/drivers/block/mtip32xx/mtip32xx.c
> +++ b/drivers/block/mtip32xx/mtip32xx.c
> @@ -3964,7 +3964,7 @@ static int mtip_block_initialize(struct driver_data *dd)
>   dd->tags.reserved_tags = 1;
>   dd->tags.cmd_size = sizeof(struct mtip_cmd);
>   dd->tags.numa_node = dd->numa_node;
> - dd->tags.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_NO_SCHED;
> + dd->tags.flags = BLK_MQ_F_SHOULD_MERGE;
>   dd->tags.driver_data = dd;
>   dd->tags.timeout = MTIP_NCQ_CMD_TIMEOUT_MS;
>  
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 6/7] blk-mq-sched: remove hack that bypasses scheduler for reserved requests

2017-05-02 Thread Hannes Reinecke
On 04/28/2017 04:55 PM, Jens Axboe wrote:
> We have update the troublesome driver (mtip32xx) to deal with this
> appropriately. So kill the hack that bypassed scheduler allocation
> and insertion for reserved requests.
> 
updated?

> Signed-off-by: Jens Axboe 
> ---
>  block/blk-mq-sched.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 8b361e192e8a..e79e9f18d7c2 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -82,11 +82,7 @@ struct request *blk_mq_sched_get_request(struct 
> request_queue *q,
>   if (likely(!data->hctx))
>   data->hctx = blk_mq_map_queue(q, data->ctx->cpu);
>  
> - /*
> -  * For a reserved tag, allocate a normal request since we might
> -  * have driver dependencies on the value of the internal tag.
> -  */
> - if (e && !(data->flags & BLK_MQ_REQ_RESERVED)) {
> + if (e) {
>   data->flags |= BLK_MQ_REQ_INTERNAL;
>  
>   /*
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 5/7] mtip32xx: convert internal command issue to block IO path

2017-05-02 Thread Hannes Reinecke
On 04/28/2017 04:55 PM, Jens Axboe wrote:
> The driver special cases certain things for command issue, depending
> on whether it's an internal command or not. Make the internal commands
> use the regular infrastructure for issuing IO.
> 
> Since this is an 8-group souped up AHCI variant, we have to deal
> with NCQ vs non-queueable commands. Do this from the queue_rq
> handler, by backing off unless the drive is idle.
> 
> Signed-off-by: Jens Axboe 
> ---
>  drivers/block/mtip32xx/mtip32xx.c | 106 
> +++---
>  1 file changed, 76 insertions(+), 30 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)