Re: [PATCH v4 0/3] AIO add per-command iopriority

2018-05-17 Thread Jens Axboe
On 5/17/18 2:38 PM, adam.manzana...@wdc.com wrote:
> From: Adam Manzanares 
> 
> This is the per-I/O equivalent of the ioprio_set system call.
> See the following link for performance implications on a SATA HDD:
> https://lkml.org/lkml/2016/12/6/495
> 
> First patch factors ioprio_check_cap function out of ioprio_set system call to
> also be used by the aio ioprio interface.
> 
> Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.
> 
> Third patch passes ioprio hint from aio iocb to kiocb and enables block_dev
> usage of the per I/O ioprio feature.
> 
> v2: merge patches
> use IOCB_FLAG_IOPRIO
> validate intended use with IOCB_IOPRIO
> add linux-api and linux-block to cc
> 
> v3: add ioprio_check_cap function
> convert kiocb ki_hint to u16
> use ioprio_check_cap when adding ioprio to kiocb in aio.c
> 
> v4: handle IOCB_IOPRIO in aio_prep_rw
> note patch 3 depends on patch 1 in commit msg
> 
> Adam Manzanares (3):
>   block: add ioprio_check_cap function
>   fs: Convert kiocb rw_hint from enum to u16
>   fs: Add aio iopriority support for block_dev
> 
>  block/ioprio.c   | 22 --
>  fs/aio.c | 16 
>  fs/block_dev.c   |  2 ++
>  include/linux/fs.h   | 17 +++--
>  include/linux/ioprio.h   |  2 ++
>  include/uapi/linux/aio_abi.h |  1 +
>  6 files changed, 52 insertions(+), 8 deletions(-)

This looks fine to me now. I can pick up #1 for 4.18 - and 2+3 as well,
unless someone else wants to take them.

-- 
Jens Axboe



Re: [PATCH V6 11/11] nvme: pci: support nested EH

2018-05-17 Thread Christoph Hellwig
On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > All simulation in block/011 may happen in reality.
> 
> If this test actually simulates reality, then the following one line
> patch (plus explanation for why) would be a real "fix" as this is very
> successful in passing block/011. :)
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 1faa32cd07da..dcc5746304c4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  
>   if (pci_enable_device_mem(pdev))
>   return result;
> + /*
> +  * blktests block/011 disables the device without the driver knowing.
> +  * We'll just enable the device twice to get the enable_cnt > 1
> +  * so that the test's disabling does absolutely nothing.
> +  */
> + pci_enable_device_mem(pdev);

Heh.  But yes, this test and the PCI "enable" interface in sysfs look
horribly wrong. pci_disable_device/pci_enable_device aren't something we
can just do underneath the driver.  Even if injecting the lowlevel
config space manipulations done by it might be useful and a good test
the Linux state ones are just killing the driver.

The enable attribute appears to have been added by Arjan for the
Xorg driver.  I think if we have a driver bound to the device we
should not allow it.


[PATCH v2] bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n

2018-05-17 Thread Coly Li
Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()")
returns the return value of debugfs_create_dir() to bcache_init(). When
CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes
bcache_init() failedi.

This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n,
so bcache can continue to work for the kernels which don't have debugfs
enanbled.

Fixes: Commit 539d39eb2708 ("bcache: fix wrong return value in 
bch_debug_init()")
Cc: sta...@vger.kernel.org
Signed-off-by: Coly Li 
Reported-by: Massimo B. 
Reported-by: Kai Krakow 
Tested-by: Kai Krakow 
Cc: Kent Overstreet 
---
 drivers/md/bcache/bcache.h | 5 +
 drivers/md/bcache/debug.c  | 8 
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 3a0cfb237af9..5b3fe87f32ee 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -994,8 +994,13 @@ void bch_open_buckets_free(struct cache_set *);
 
 int bch_cache_allocator_start(struct cache *ca);
 
+#ifdef CONFIG_DEBUG_FS
 void bch_debug_exit(void);
 int bch_debug_init(struct kobject *);
+#else
+static inline void bch_debug_exit(void) {};
+static inline int bch_debug_init(struct kobject *kobj) { return 0; };
+#endif
 void bch_request_exit(void);
 int bch_request_init(void);
 
diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 4e63c6f6c04d..20e5e524e88e 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -17,8 +17,6 @@
 #include 
 #include 
 
-struct dentry *bcache_debug;
-
 #ifdef CONFIG_BCACHE_DEBUG
 
 #define for_each_written_bset(b, start, i) \
@@ -151,6 +149,8 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
 
 /* XXX: cache set refcounting */
 
+struct dentry *bcache_debug;
+
 struct dump_iterator {
charbuf[PAGE_SIZE];
size_t  bytes;
@@ -240,8 +240,6 @@ void bch_debug_init_cache_set(struct cache_set *c)
}
 }
 
-#endif
-
 void bch_debug_exit(void)
 {
if (!IS_ERR_OR_NULL(bcache_debug))
@@ -254,3 +252,5 @@ int __init bch_debug_init(struct kobject *kobj)
 
return IS_ERR_OR_NULL(bcache_debug);
 }
+
+#endif
-- 
2.16.3



Re: [PATCH V6 11/11] nvme: pci: support nested EH

2018-05-17 Thread Johannes Thumshirn
On Thu, May 17, 2018 at 08:20:51AM -0600, Keith Busch wrote:
> > Heh.  But yes, this test and the PCI "enable" interface in sysfs look
> > horribly wrong. pci_disable_device/pci_enable_device aren't something we
> > can just do underneath the driver.  Even if injecting the lowlevel
> > config space manipulations done by it might be useful and a good test
> > the Linux state ones are just killing the driver.
> 
> Yes, I'm totally fine with injecting errors into the config space, but
> for goodness sakes, don't fuck with the internal kernel structures out
> from under drivers using them.
> 
> My suggestion is to use 'setpci' to disable the device if you want to
> inject this scenario. That way you get the desired broken device
> scenario you want to test, but struct pci_dev hasn't been altered.
>  
> > The enable attribute appears to have been added by Arjan for the
> > Xorg driver.  I think if we have a driver bound to the device we
> > should not allow it.
> 
> Agreed. Alternatively possibly call the driver's reset_preparei/done
> callbacks.

Exactly, but as long as we can issue the reset via sysfs the test-case
is still valid.

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH blktests] Documentation: document prerequisite scriptlets

2018-05-17 Thread Johannes Thumshirn
Omar, ping?
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH V6 11/11] nvme: pci: support nested EH

2018-05-17 Thread Keith Busch
On Thu, May 17, 2018 at 10:41:29AM +0200, Christoph Hellwig wrote:
> On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> > On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > > All simulation in block/011 may happen in reality.
> > 
> > If this test actually simulates reality, then the following one line
> > patch (plus explanation for why) would be a real "fix" as this is very
> > successful in passing block/011. :)
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 1faa32cd07da..dcc5746304c4 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
> >  
> > if (pci_enable_device_mem(pdev))
> > return result;
> > +   /*
> > +* blktests block/011 disables the device without the driver knowing.
> > +* We'll just enable the device twice to get the enable_cnt > 1
> > +* so that the test's disabling does absolutely nothing.
> > +*/
> > +   pci_enable_device_mem(pdev);
> 
> Heh.  But yes, this test and the PCI "enable" interface in sysfs look
> horribly wrong. pci_disable_device/pci_enable_device aren't something we
> can just do underneath the driver.  Even if injecting the lowlevel
> config space manipulations done by it might be useful and a good test
> the Linux state ones are just killing the driver.

Yes, I'm totally fine with injecting errors into the config space, but
for goodness sakes, don't fuck with the internal kernel structures out
from under drivers using them.

My suggestion is to use 'setpci' to disable the device if you want to
inject this scenario. That way you get the desired broken device
scenario you want to test, but struct pci_dev hasn't been altered.
 
> The enable attribute appears to have been added by Arjan for the
> Xorg driver.  I think if we have a driver bound to the device we
> should not allow it.

Agreed. Alternatively possibly call the driver's reset_preparei/done
callbacks.


Re: [PATCH 04/33] fs: remove the buffer_unwritten check in page_seek_hole_data

2018-05-17 Thread Andreas Grünbacher
2018-05-09 9:48 GMT+02:00 Christoph Hellwig :
> We only call into this function through the iomap iterators, so we already
> know the buffer is unwritten.  In addition to that we always require the
> uptodate flag that is ORed with the result anyway.

Please update the page_cache_seek_hole_data description as well:

--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -647,8 +647,8 @@
  * Seek for SEEK_DATA / SEEK_HOLE in the page cache.
  *
  * Within unwritten extents, the page cache determines which parts are holes
- * and which are data: unwritten and uptodate buffer heads count as data;
- * everything else counts as a hole.
+ * and which are data: uptodate buffer heads count as data; everything else
+ * counts as a hole.
  *
  * Returns the resulting offset on successs, and -ENOENT otherwise.
  */

Thanks,
Andreas


Re: [PATCH v2] bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n

2018-05-17 Thread Kent Overstreet
On Thu, May 17, 2018 at 05:53:48PM +0800, Coly Li wrote:
> Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()")
> returns the return value of debugfs_create_dir() to bcache_init(). When
> CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes
> bcache_init() failedi.
> 
> This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n,
> so bcache can continue to work for the kernels which don't have debugfs
> enanbled.
> 
> Fixes: Commit 539d39eb2708 ("bcache: fix wrong return value in 
> bch_debug_init()")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Coly Li 
> Reported-by: Massimo B. 
> Reported-by: Kai Krakow 
> Tested-by: Kai Krakow 
> Cc: Kent Overstreet 
> ---
>  drivers/md/bcache/bcache.h | 5 +
>  drivers/md/bcache/debug.c  | 8 
>  2 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 3a0cfb237af9..5b3fe87f32ee 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -994,8 +994,13 @@ void bch_open_buckets_free(struct cache_set *);
>  
>  int bch_cache_allocator_start(struct cache *ca);
>  
> +#ifdef CONFIG_DEBUG_FS

you could just use if (IS_ENABLED(CONFIG_DEBUG_FS)) in bch_debug_init, that way
you don't need to add an #ifdef

>  void bch_debug_exit(void);
>  int bch_debug_init(struct kobject *);
> +#else
> +static inline void bch_debug_exit(void) {};
> +static inline int bch_debug_init(struct kobject *kobj) { return 0; };
> +#endif
>  void bch_request_exit(void);
>  int bch_request_init(void);
>  
> diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
> index 4e63c6f6c04d..20e5e524e88e 100644
> --- a/drivers/md/bcache/debug.c
> +++ b/drivers/md/bcache/debug.c
> @@ -17,8 +17,6 @@
>  #include 
>  #include 
>  
> -struct dentry *bcache_debug;
> -
>  #ifdef CONFIG_BCACHE_DEBUG
>  
>  #define for_each_written_bset(b, start, i)   \
> @@ -151,6 +149,8 @@ void bch_data_verify(struct cached_dev *dc, struct bio 
> *bio)
>  
>  /* XXX: cache set refcounting */
>  
> +struct dentry *bcache_debug;
> +
>  struct dump_iterator {
>   charbuf[PAGE_SIZE];
>   size_t  bytes;
> @@ -240,8 +240,6 @@ void bch_debug_init_cache_set(struct cache_set *c)
>   }
>  }
>  
> -#endif
> -
>  void bch_debug_exit(void)
>  {
>   if (!IS_ERR_OR_NULL(bcache_debug))
> @@ -254,3 +252,5 @@ int __init bch_debug_init(struct kobject *kobj)
>  
>   return IS_ERR_OR_NULL(bcache_debug);
>  }
> +
> +#endif
> -- 
> 2.16.3
> 


Re: [PATCH v3] bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n

2018-05-17 Thread Kent Overstreet
On Thu, May 17, 2018 at 10:32:54PM +0800, Coly Li wrote:
> Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()")
> returns the return value of debugfs_create_dir() to bcache_init(). When
> CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes
> bcache_init() failedi.
> 
> This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n,
> so bcache can continue to work for the kernels which don't have debugfs
> enanbled.
> 
> Fixes: Commit 539d39eb2708 ("bcache: fix wrong return value in 
> bch_debug_init()")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Coly Li 
> Reported-by: Massimo B. 
> Reported-by: Kai Krakow 
> Tested-by: Kai Krakow 
> Cc: Kent Overstreet 

Acked-by: Kent Overstreet 


[PATCH v3] bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n

2018-05-17 Thread Coly Li
Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()")
returns the return value of debugfs_create_dir() to bcache_init(). When
CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes
bcache_init() failedi.

This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n,
so bcache can continue to work for the kernels which don't have debugfs
enanbled.

Fixes: Commit 539d39eb2708 ("bcache: fix wrong return value in 
bch_debug_init()")
Cc: sta...@vger.kernel.org
Signed-off-by: Coly Li 
Reported-by: Massimo B. 
Reported-by: Kai Krakow 
Tested-by: Kai Krakow 
Cc: Kent Overstreet 
---
 drivers/md/bcache/debug.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 4e63c6f6c04d..d030ce3025a6 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -250,7 +250,9 @@ void bch_debug_exit(void)
 
 int __init bch_debug_init(struct kobject *kobj)
 {
-   bcache_debug = debugfs_create_dir("bcache", NULL);
+   if (!IS_ENABLED(CONFIG_DEBUG_FS))
+   return 0;
 
+   bcache_debug = debugfs_create_dir("bcache", NULL);
return IS_ERR_OR_NULL(bcache_debug);
 }
-- 
2.16.3



Re: [PATCH v2] bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n

2018-05-17 Thread Coly Li
On 2018/5/17 7:35 PM, Kent Overstreet wrote:
> On Thu, May 17, 2018 at 05:53:48PM +0800, Coly Li wrote:
>> Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()")
>> returns the return value of debugfs_create_dir() to bcache_init(). When
>> CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes
>> bcache_init() failedi.
>>
>> This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n,
>> so bcache can continue to work for the kernels which don't have debugfs
>> enanbled.
>>
>> Fixes: Commit 539d39eb2708 ("bcache: fix wrong return value in 
>> bch_debug_init()")
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Coly Li 
>> Reported-by: Massimo B. 
>> Reported-by: Kai Krakow 
>> Tested-by: Kai Krakow 
>> Cc: Kent Overstreet 
>> ---
>>  drivers/md/bcache/bcache.h | 5 +
>>  drivers/md/bcache/debug.c  | 8 
>>  2 files changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index 3a0cfb237af9..5b3fe87f32ee 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -994,8 +994,13 @@ void bch_open_buckets_free(struct cache_set *);
>>  
>>  int bch_cache_allocator_start(struct cache *ca);
>>  
>> +#ifdef CONFIG_DEBUG_FS
> 
> you could just use if (IS_ENABLED(CONFIG_DEBUG_FS)) in bch_debug_init, that 
> way
> you don't need to add an #ifdef
> 

Yes, it is much simpler. v3 patch resent for your review.

Thanks for the hint.

Coly Li


Re: [PATCH 1/1] bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n

2018-05-17 Thread Jens Axboe
On 5/17/18 9:33 AM, Coly Li wrote:
> Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()")
> returns the return value of debugfs_create_dir() to bcache_init(). When
> CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes
> bcache_init() failedi.
> 
> This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n,
> so bcache can continue to work for the kernels which don't have debugfs
> enanbled.

Applied, thanks.


-- 
Jens Axboe



[PATCH 0/1] bcache fix for 4.17-rc6

2018-05-17 Thread Coly Li
Hi Jenns,

We have reports that bcache does not work when CONFIG_DEBUG_FS=n, here is
the bug report on bugzilla.kernel.org:
https://bugzilla.kernel.org/show_bug.cgi?id=199683
This problem happens since v4.16 and I believe it still happens in v4.17.

This submission has one patch to fix this issue, the patch is tested by
one of the reporters. If it is not too late for 4.17, could you please to
pick it for 4.17-rc6 ?

Thanks in advance.

Coly Li
---

Coly Li (1):
  bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n

 drivers/md/bcache/debug.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

-- 
2.16.3



[PATCH 1/1] bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n

2018-05-17 Thread Coly Li
Commit 539d39eb2708 ("bcache: fix wrong return value in bch_debug_init()")
returns the return value of debugfs_create_dir() to bcache_init(). When
CONFIG_DEBUG_FS=n, bch_debug_init() always returns 1 and makes
bcache_init() failedi.

This patch makes bch_debug_init() always returns 0 if CONFIG_DEBUG_FS=n,
so bcache can continue to work for the kernels which don't have debugfs
enanbled.

Changelog:
v4: Add Acked-by from Kent Overstreet.
v3: Use IS_ENABLED(CONFIG_DEBUG_FS) to replace #ifdef DEBUG_FS.
v2: Remove a warning information
v1: Initial version.

Fixes: Commit 539d39eb2708 ("bcache: fix wrong return value in 
bch_debug_init()")
Cc: sta...@vger.kernel.org
Signed-off-by: Coly Li 
Reported-by: Massimo B. 
Reported-by: Kai Krakow 
Tested-by: Kai Krakow 
Acked-by: Kent Overstreet 
---
 drivers/md/bcache/debug.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 4e63c6f6c04d..d030ce3025a6 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -250,7 +250,9 @@ void bch_debug_exit(void)
 
 int __init bch_debug_init(struct kobject *kobj)
 {
-   bcache_debug = debugfs_create_dir("bcache", NULL);
+   if (!IS_ENABLED(CONFIG_DEBUG_FS))
+   return 0;
 
+   bcache_debug = debugfs_create_dir("bcache", NULL);
return IS_ERR_OR_NULL(bcache_debug);
 }
-- 
2.16.3



Re: [PATCH 2/2] nbd: don't start req until after the dead connection logic

2018-05-17 Thread Bart Van Assche
On Thu, 2017-10-19 at 16:21 -0400, Josef Bacik wrote:
> + blk_mq_start_request(req);
>   if (unlikely(nsock->pending && nsock->pending != req)) {
>   blk_mq_requeue_request(req, true);
>   ret = 0;

(replying to an e-mail from seven months ago)

Hello Josef,

Are you aware that the nbd driver is one of the very few block drivers that
calls blk_mq_requeue_request() after a request has been started? I think that
can lead to the block layer core to undesired behavior, e.g. that the timeout
handler fires concurrently with a request being reinstered. Can you or a
colleague have a look at this? I would like to add the following code to the
block layer core and I think that the nbd driver would trigger this warning:

 void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
 {
+   WARN_ON_ONCE(old_state != MQ_RQ_COMPLETE);
+
__blk_mq_requeue_request(rq);

Thanks,

Bart.

Re: [PATCH 00/10] Misc block layer patches for bcachefs

2018-05-17 Thread Bart Van Assche
On Tue, 2018-05-08 at 21:33 -0400, Kent Overstreet wrote:
> [ ... ]

Hello Kent,

With Jens' latest for-next branch I hit the kernel warning shown below. Can
you have a look?

Thanks,

Bart.


==
BUG: KASAN: use-after-free in bio_advance+0x110/0x1b0
Read of size 4 at addr 880156c5e6d0 by task ksoftirqd/10/72

CPU: 10 PID: 72 Comm: ksoftirqd/10 Tainted: GW 4.17.0-rc4-dbg+ 
#5
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
dump_stack+0x9a/0xeb
print_address_description+0x65/0x270
kasan_report+0x232/0x350
bio_advance+0x110/0x1b0
blk_update_request+0x9d/0x5a0
scsi_end_request+0x4c/0x300 [scsi_mod]
scsi_io_completion+0x71e/0xa40 [scsi_mod]
__blk_mq_complete_request+0x143/0x220
srp_recv_done+0x454/0x1100 [ib_srp]
__ib_process_cq+0x9a/0xf0 [ib_core]
ib_poll_handler+0x2d/0x90 [ib_core]
irq_poll_softirq+0xe5/0x1e0
__do_softirq+0x112/0x5f0
run_ksoftirqd+0x29/0x50
smpboot_thread_fn+0x30f/0x410
kthread+0x1b2/0x1d0
ret_from_fork+0x24/0x30

Allocated by task 1356:
kasan_kmalloc+0xa0/0xd0
kmem_cache_alloc+0xed/0x320
mempool_alloc+0xc6/0x210
bio_alloc_bioset+0x128/0x2d0
submit_bh_wbc+0x95/0x2d0
__block_write_full_page+0x2a6/0x5c0
__writepage+0x37/0x80
write_cache_pages+0x305/0x7c0
generic_writepages+0xb9/0x110
do_writepages+0x96/0x180
__filemap_fdatawrite_range+0x162/0x1b0
file_write_and_wait_range+0x4d/0xb0
blkdev_fsync+0x3c/0x70
do_fsync+0x33/0x60
__x64_sys_fsync+0x18/0x20
do_syscall_64+0x6d/0x220
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 72:
__kasan_slab_free+0x130/0x180
kmem_cache_free+0xcd/0x380
blk_update_request+0xc4/0x5a0
blk_update_request+0xc4/0x5a0
scsi_end_request+0x4c/0x300 [scsi_mod]
scsi_io_completion+0x71e/0xa40 [scsi_mod]
__blk_mq_complete_request+0x143/0x220
srp_recv_done+0x454/0x1100 [ib_srp]
__ib_process_cq+0x9a/0xf0 [ib_core]
ib_poll_handler+0x2d/0x90 [ib_core]
irq_poll_softirq+0xe5/0x1e0
__do_softirq+0x112/0x5f0

The buggy address belongs to the object at 880156c5e640
which belongs to the cache bio-0 of size 200
The buggy address is located 144 bytes inside of
200-byte region [880156c5e640, 880156c5e708)
The buggy address belongs to the page:
page:ea00055b1780 count:1 mapcount:0 mapping: index:0x0 
compound_mapcount: 0
ib_srpt:srpt_zerolength_write: ib_srpt 10.196.159.179-24: queued zerolength 
write
flags: 0x80008100(slab|head)
raw: 80008100   000100190019
raw: ea000543a800 00020002 88015a8f3a00 
ib_srpt:srpt_zerolength_write: ib_srpt 10.196.159.179-22: queued zerolength 
write
page dumped because: kasan: bad access detected
ib_srpt:srpt_zerolength_write: ib_srpt 10.196.159.179-20: queued zerolength 
write

Memory state around the buggy address:
ib_srpt:srpt_zerolength_write: ib_srpt 10.196.159.179-18: queued zerolength 
write
880156c5e580: 00 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc
ib_srpt:srpt_zerolength_write_done: ib_srpt 10.196.159.179-24 wc->status 5
880156c5e600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
ib_srpt:srpt_zerolength_write_done: ib_srpt 10.196.159.179-22 wc->status 5
>880156c5e680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ib_srpt:srpt_zerolength_write_done: ib_srpt 10.196.159.179-20 wc->status 5
^
880156c5e700: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ib_srpt:srpt_zerolength_write_done: ib_srpt 10.196.159.179-18 wc->status 5
880156c5e780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ib_srpt:srpt_release_channel_work: ib_srpt 10.196.159.179-24
==

(gdb) list *(bio_advance+0x110)
0x81450090 is in bio_advance (./include/linux/bvec.h:82).
77  iter->bi_size = 0;
78  return false;
79  }
80
81  while (bytes) {
82  unsigned iter_len = bvec_iter_len(bv, *iter);
83  unsigned len = min(bytes, iter_len);
84
85  bytes -= len;
86  iter->bi_size -= len;








Re: [PATCH 2/2] nbd: don't start req until after the dead connection logic

2018-05-17 Thread Bart Van Assche
On Thu, 2018-05-17 at 14:41 -0400, Josef Bacik wrote:
> Yup I can tell you why, on 4.11 where I originally did this work
> __blk_mq_requeue_request() did this
> 
> static void __blk_mq_requeue_request(struct request *rq)
> {
> struct request_queue *q = rq->q;
> 
> trace_block_rq_requeue(q, rq);
> rq_qos_requeue(q, >issue_stat);
> blk_mq_sched_requeue_request(rq);
> 
> if (test_and_clear_bit(REQ_ATOM_STARTED, >atomic_flags)) {
> if (q->dma_drain_size && blk_rq_bytes(rq))
> rq->nr_phys_segments--;
> }
> }
> 
> So it was clearing the started part when it did the requeue.  If that's not 
> what
> I'm supposed to be doing anymore then I can send a patch to fix it.  What is
> supposed to be done if I did already do blk_mq_start_request, because I can
> avoid doing the start until after that chunk of code, but there's a part 
> further
> down that needs to have start done before we reach it, so I'll have to do
> whatever the special thing is now there.  Thanks,

Hello Josef,

After having had a closer look I think calling blk_mq_start_request() before
blk_mq_requeue_request() is fine anyway. The v4.16 block layer core namely 
defers
timeout processing until after .queue_rq() has returned. So the timeout code
shouldn't see the requests for which both blk_mq_start_request() and
blk_mq_requeue_request() are called from inside .queue_rq(). I will make sure 
this
behavior is preserved.

Bart.





[PATCH v4 3/3] fs: Add aio iopriority support for block_dev

2018-05-17 Thread adam . manzanares
From: Adam Manzanares 

This is the per-I/O equivalent of the ioprio_set system call.

When IOCB_FLAG_IOPRIO is set on the iocb aio_flags field, then we set the
newly added kiocb ki_ioprio field to the value in the iocb aio_reqprio field.

When a bio is created for an aio request by the block dev we set the priority
value of the bio to the user supplied value.

This patch depends on block: add ioprio_check_cap function

Signed-off-by: Adam Manzanares 
---
 fs/aio.c | 16 
 fs/block_dev.c   |  2 ++
 include/linux/fs.h   |  2 ++
 include/uapi/linux/aio_abi.h |  1 +
 4 files changed, 21 insertions(+)

diff --git a/fs/aio.c b/fs/aio.c
index f3eae5d5771b..ff3107aa82d5 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1451,6 +1451,22 @@ static int aio_prep_rw(struct kiocb *req, struct iocb 
*iocb)
if (iocb->aio_flags & IOCB_FLAG_RESFD)
req->ki_flags |= IOCB_EVENTFD;
req->ki_hint = file_write_hint(req->ki_filp);
+   if (iocb->aio_flags & IOCB_FLAG_IOPRIO) {
+   /*
+* If the IOCB_FLAG_IOPRIO flag of aio_flags is set, then
+* aio_reqprio is interpreted as an I/O scheduling
+* class and priority.
+*/
+   ret = ioprio_check_cap(iocb->aio_reqprio);
+   if (ret) {
+   pr_debug("aio ioprio check cap error\n");
+   return -EINVAL;
+   }
+
+   req->ki_ioprio = iocb->aio_reqprio;
+   req->ki_flags |= IOCB_IOPRIO;
+   }
+
ret = kiocb_set_rw_flags(req, iocb->aio_rw_flags);
if (unlikely(ret))
fput(req->ki_filp);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 7ec920e27065..970bef79caa6 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -355,6 +355,8 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
*iter, int nr_pages)
bio->bi_write_hint = iocb->ki_hint;
bio->bi_private = dio;
bio->bi_end_io = blkdev_bio_end_io;
+   if (iocb->ki_flags & IOCB_IOPRIO)
+   bio->bi_ioprio = iocb->ki_ioprio;
 
ret = bio_iov_iter_get_pages(bio, iter);
if (unlikely(ret)) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7a90ce387e00..3415e83f6350 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -294,6 +294,7 @@ enum rw_hint {
 #define IOCB_SYNC  (1 << 5)
 #define IOCB_WRITE (1 << 6)
 #define IOCB_NOWAIT(1 << 7)
+#define IOCB_IOPRIO(1 << 8)
 
 struct kiocb {
struct file *ki_filp;
@@ -302,6 +303,7 @@ struct kiocb {
void*private;
int ki_flags;
u16 ki_hint;
+   u16 ki_ioprio; /* See linux/ioprio.h */
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
diff --git a/include/uapi/linux/aio_abi.h b/include/uapi/linux/aio_abi.h
index 2c0a3415beee..d4e768d55d14 100644
--- a/include/uapi/linux/aio_abi.h
+++ b/include/uapi/linux/aio_abi.h
@@ -55,6 +55,7 @@ enum {
  *   is valid.
  */
 #define IOCB_FLAG_RESFD(1 << 0)
+#define IOCB_FLAG_IOPRIO   (1 << 1)
 
 /* read() from /dev/aio returns these structures. */
 struct io_event {
-- 
2.15.1



[PATCH v4 1/3] block: add ioprio_check_cap function

2018-05-17 Thread adam . manzanares
From: Adam Manzanares 

Aio per command iopriority support introduces a second interface between
userland and the kernel capable of passing iopriority. The aio interface also
needs the ability to verify that the submitting context has sufficient
priviledges to submit IOPRIO_RT commands. This patch creates the
ioprio_check_cap function to be used by the ioprio_set system call and also by
the aio interface.

Signed-off-by: Adam Manzanares 
---
 block/ioprio.c | 22 --
 include/linux/ioprio.h |  2 ++
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/block/ioprio.c b/block/ioprio.c
index 6f5d0b6625e3..f9821080c92c 100644
--- a/block/ioprio.c
+++ b/block/ioprio.c
@@ -61,15 +61,10 @@ int set_task_ioprio(struct task_struct *task, int ioprio)
 }
 EXPORT_SYMBOL_GPL(set_task_ioprio);
 
-SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
+int ioprio_check_cap(int ioprio)
 {
int class = IOPRIO_PRIO_CLASS(ioprio);
int data = IOPRIO_PRIO_DATA(ioprio);
-   struct task_struct *p, *g;
-   struct user_struct *user;
-   struct pid *pgrp;
-   kuid_t uid;
-   int ret;
 
switch (class) {
case IOPRIO_CLASS_RT:
@@ -92,6 +87,21 @@ SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, 
ioprio)
return -EINVAL;
}
 
+   return 0;
+}
+
+SYSCALL_DEFINE3(ioprio_set, int, which, int, who, int, ioprio)
+{
+   struct task_struct *p, *g;
+   struct user_struct *user;
+   struct pid *pgrp;
+   kuid_t uid;
+   int ret;
+
+   ret = ioprio_check_cap(ioprio);
+   if (ret)
+   return ret;
+
ret = -ESRCH;
rcu_read_lock();
switch (which) {
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index 627efac73e6d..4a28cec49ec3 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -77,4 +77,6 @@ extern int ioprio_best(unsigned short aprio, unsigned short 
bprio);
 
 extern int set_task_ioprio(struct task_struct *task, int ioprio);
 
+extern int ioprio_check_cap(int ioprio);
+
 #endif
-- 
2.15.1



[PATCH v4 2/3] fs: Convert kiocb rw_hint from enum to u16

2018-05-17 Thread adam . manzanares
From: Adam Manzanares 

In order to avoid kiocb bloat for per command iopriority support, rw_hint
is converted from enum to a u16. Added a guard around ki_hint assigment.

Signed-off-by: Adam Manzanares 
---
 include/linux/fs.h | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 760d8da1b6c7..7a90ce387e00 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -284,6 +284,8 @@ enum rw_hint {
WRITE_LIFE_EXTREME  = RWH_WRITE_LIFE_EXTREME,
 };
 
+#define MAX_KI_HINT((1 << 16) - 1) /* ki_hint type is u16 */
+
 #define IOCB_EVENTFD   (1 << 0)
 #define IOCB_APPEND(1 << 1)
 #define IOCB_DIRECT(1 << 2)
@@ -299,7 +301,7 @@ struct kiocb {
void (*ki_complete)(struct kiocb *iocb, long ret, long ret2);
void*private;
int ki_flags;
-   enum rw_hintki_hint;
+   u16 ki_hint;
 } __randomize_layout;
 
 static inline bool is_sync_kiocb(struct kiocb *kiocb)
@@ -1927,12 +1929,21 @@ static inline enum rw_hint file_write_hint(struct file 
*file)
 
 static inline int iocb_flags(struct file *file);
 
+/* ki_hint changed from enum to u16, make sure rw_hint fits into u16 */
+static inline u16 ki_hint_valid(enum rw_hint hint)
+{
+   if (hint > MAX_KI_HINT)
+   return 0;
+
+   return hint;
+}
+
 static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
 {
*kiocb = (struct kiocb) {
.ki_filp = filp,
.ki_flags = iocb_flags(filp),
-   .ki_hint = file_write_hint(filp),
+   .ki_hint = ki_hint_valid(file_write_hint(filp)),
};
 }
 
-- 
2.15.1



[PATCH v4 0/3] AIO add per-command iopriority

2018-05-17 Thread adam . manzanares
From: Adam Manzanares 

This is the per-I/O equivalent of the ioprio_set system call.
See the following link for performance implications on a SATA HDD:
https://lkml.org/lkml/2016/12/6/495

First patch factors ioprio_check_cap function out of ioprio_set system call to
also be used by the aio ioprio interface.

Second patch converts kiocb ki_hint field to a u16 to avoid kiocb bloat.

Third patch passes ioprio hint from aio iocb to kiocb and enables block_dev
usage of the per I/O ioprio feature.

v2: merge patches
use IOCB_FLAG_IOPRIO
validate intended use with IOCB_IOPRIO
add linux-api and linux-block to cc

v3: add ioprio_check_cap function
convert kiocb ki_hint to u16
use ioprio_check_cap when adding ioprio to kiocb in aio.c

v4: handle IOCB_IOPRIO in aio_prep_rw
note patch 3 depends on patch 1 in commit msg

Adam Manzanares (3):
  block: add ioprio_check_cap function
  fs: Convert kiocb rw_hint from enum to u16
  fs: Add aio iopriority support for block_dev

 block/ioprio.c   | 22 --
 fs/aio.c | 16 
 fs/block_dev.c   |  2 ++
 include/linux/fs.h   | 17 +++--
 include/linux/ioprio.h   |  2 ++
 include/uapi/linux/aio_abi.h |  1 +
 6 files changed, 52 insertions(+), 8 deletions(-)

-- 
2.15.1



Re: [PATCH 2/2] nbd: don't start req until after the dead connection logic

2018-05-17 Thread Josef Bacik
On Thu, May 17, 2018 at 06:21:40PM +, Bart Van Assche wrote:
> On Thu, 2017-10-19 at 16:21 -0400, Josef Bacik wrote:
> > +   blk_mq_start_request(req);
> > if (unlikely(nsock->pending && nsock->pending != req)) {
> > blk_mq_requeue_request(req, true);
> > ret = 0;
> 
> (replying to an e-mail from seven months ago)
> 
> Hello Josef,
> 
> Are you aware that the nbd driver is one of the very few block drivers that
> calls blk_mq_requeue_request() after a request has been started? I think that
> can lead to the block layer core to undesired behavior, e.g. that the timeout
> handler fires concurrently with a request being reinstered. Can you or a
> colleague have a look at this? I would like to add the following code to the
> block layer core and I think that the nbd driver would trigger this warning:
> 
>  void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
>  {
> +   WARN_ON_ONCE(old_state != MQ_RQ_COMPLETE);
> +
> __blk_mq_requeue_request(rq);
> 

Yup I can tell you why, on 4.11 where I originally did this work
__blk_mq_requeue_request() did this

static void __blk_mq_requeue_request(struct request *rq)
{
struct request_queue *q = rq->q;

trace_block_rq_requeue(q, rq);
rq_qos_requeue(q, >issue_stat);
blk_mq_sched_requeue_request(rq);

if (test_and_clear_bit(REQ_ATOM_STARTED, >atomic_flags)) {
if (q->dma_drain_size && blk_rq_bytes(rq))
rq->nr_phys_segments--;
}
}

So it was clearing the started part when it did the requeue.  If that's not what
I'm supposed to be doing anymore then I can send a patch to fix it.  What is
supposed to be done if I did already do blk_mq_start_request, because I can
avoid doing the start until after that chunk of code, but there's a part further
down that needs to have start done before we reach it, so I'll have to do
whatever the special thing is now there.  Thanks,

Josef


Re: [PATCH V6 11/11] nvme: pci: support nested EH

2018-05-17 Thread Ming Lei
On Fri, May 18, 2018 at 08:19:58AM +0800, Ming Lei wrote:
> On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> > Hi Ming,
> > 
> > I'm developing the answers in code the issues you raised. It will just
> > take a moment to complete flushing those out. In the meantime just want
> > to point out why I think block/011 isn't a real test.
> > 
> > On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > > All simulation in block/011 may happen in reality.
> > 
> > If this test actually simulates reality, then the following one line
> > patch (plus explanation for why) would be a real "fix" as this is very
> > successful in passing block/011. :)
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 1faa32cd07da..dcc5746304c4 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
> >  
> > if (pci_enable_device_mem(pdev))
> > return result;
> > +   /*
> > +* blktests block/011 disables the device without the driver knowing.
> > +* We'll just enable the device twice to get the enable_cnt > 1
> > +* so that the test's disabling does absolutely nothing.
> > +*/
> > +   pci_enable_device_mem(pdev);
> 
> What I think block/011 is helpful is that it can trigger IO timeout
> during reset, which can be triggered in reality too.
> 
> That is one big problem of NVMe driver, IMO.
> 
> And this patch does fix this issue, together other timeout related
> races.

BTW, the above patch may not be enough to 'fix' this NVMe issue, I guess
you may have to figure out another one to remove fault-injection, or at
least disable io-timeout-fail on NVMe.

Thanks,
Ming


Re: [PATCH V6 11/11] nvme: pci: support nested EH

2018-05-17 Thread Ming Lei
On Wed, May 16, 2018 at 08:20:31PM -0600, Keith Busch wrote:
> Hi Ming,
> 
> I'm developing the answers in code the issues you raised. It will just
> take a moment to complete flushing those out. In the meantime just want
> to point out why I think block/011 isn't a real test.
> 
> On Thu, May 17, 2018 at 07:10:59AM +0800, Ming Lei wrote:
> > All simulation in block/011 may happen in reality.
> 
> If this test actually simulates reality, then the following one line
> patch (plus explanation for why) would be a real "fix" as this is very
> successful in passing block/011. :)
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 1faa32cd07da..dcc5746304c4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2118,6 +2118,12 @@ static int nvme_pci_enable(struct nvme_dev *dev)
>  
>   if (pci_enable_device_mem(pdev))
>   return result;
> + /*
> +  * blktests block/011 disables the device without the driver knowing.
> +  * We'll just enable the device twice to get the enable_cnt > 1
> +  * so that the test's disabling does absolutely nothing.
> +  */
> + pci_enable_device_mem(pdev);

What I think block/011 is helpful is that it can trigger IO timeout
during reset, which can be triggered in reality too.

That is one big problem of NVMe driver, IMO.

And this patch does fix this issue, together other timeout related
races.


Thanks,
Ming